Journal of Information Security and Applications 46 (2019) 121–137
Contents lists available at ScienceDirect
Journal of Information Security and Applications journal homepage: www.elsevier.com/locate/jisa
Repositioning privacy concerns: Web servers controlling URL metadata Rui Ferreira∗, Rui L. Aguiar Instituto de Telecomunicações, University of Aveiro Portugal
a r t i c l e
i n f o
Keywords: Privacy Web URL Namespaces
a b s t r a c t Uniform Resource Locators reveal a significant amount of metadata about user actions, in ways that inherently violate our natural expectations of privacy. Adequately protecting this information is an important issue tackled (sometimes partially) in areas such as secure transport protocols, terminal encryption and caching strategies. A different (and complementary) approach would be to design the application namespace to minimise privacy leakage. Our goal is to develop a different practical concept of this approach, where service providers enforce fully transient URL namespaces that intentionally conceal data through encryption. We aim to determine what would be the design challenges and required compromises to make this a feasible technique to protect data privacy. For starters, we gather requirements from the constraints of URLs in general and compatibility issues seen in web applications, and propose a mapping process for a namespace of encrypted URLs. We implement this approach over an existing web development framework, and analyse the outcome workload from different popular websites to measure its impact in various conditions. Based on our results, we discuss critical design and implementation choices, consider deployment issues that were encountered, and what compromises can be made to address them, if the web service providers want to embed user privacy in their services. Based on this analysis it can be concluded that this type of privacy approach is expensive, with a significant impact in performance and deployment costs that increases with the expected degree of privacy, but there is also room for improvement in various areas. Furthermore, privacy implemented in this way is not a replacement for other types of privacy solutions, but rather a complementary or even conflicting approach, driven by entirely different motives. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Digital privacy is now a recurrent topic. While the Internet is often presented as a tool that enables Freedom of Speech for whistle-blowers, reporters or political activists, the debate surrounding online privacy is far from such clarity and is unavoidably entrenched with society’s (different) notions of Freedom of Speech, public space, safety, and privacy. There is an ongoing debate about the boundaries of public and non public spaces, with the automated gathering of information by CCTV systems, mobile phones or even Google glasses. Different physical spaces in different societies can enforce different rules. For example, in some private establishments, user devices are left at the door to prevent recording, while in public spaces, local laws vary and often clash with reasonable expectations of privacy. This challenge extends into the digital world as Metadata easily discloses online movements and the content viewed by users.
∗
Corresponding author. E-mail addresses:
[email protected] (R. Ferreira),
[email protected] (R.L. Aguiar).
https://doi.org/10.1016/j.jisa.2019.03.010 2214-2126/© 2019 Elsevier Ltd. All rights reserved.
Digital interactions are not constrained by the physical barriers of the real world, and this metadata is available to anyone with deep enough technical knowledge. A valuable observation is that privacy, in the real world, is a result of mutual enforcement within a certain space - e.g., public outrage over the Google Glass usage gained expression as private establishments banned the devices. In public spaces the expectation of privacy follows the letter of law, which can nevertheless differ significantly between countries. We clearly understand that some locations are more private than others, either due to physical barriers, or human behaviour, and we adapt our (human) behaviour accordingly. This physical world notion of variable privacy has not been translated into “online spaces”. The technical burden of protecting privacy is treated as a user challenge, or as part of specialised infrastructure (e.g., ToR [1]); the service provider rarely takes action to protect user privacy (there are exceptions in extreme cases, e.g., Wikileaks or The Guardian using ToR to ensure anonymity for whistle-blowers). Such measures seem justified in well-identified hostile environments, where state sponsored traffic interception defeats or blocks common security techniques such as HTTPS.
122
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
This stringent view is hardly ever supported, because “absolute” anonymity is only required in some cases, just like in the physical world. A more practical concern is for reasonable expectations of privacy - users expect information online to be private (known to a restricted number of parties). With the increased surveillance at the state and corporate levels, simply reading a document can affect your personal life. There is already extensive work on how individual users can protect their privacy, but following the real world concept of associating privacy to spaces, it is also relevant to ask how “online spaces” can be associated to privacy. Uniform Resource Locators (URLs) [2] are a fundamental form of metadata in computer networks, fulfilling multiple purposes: they represent network locators for services, resource identifiers, or even user/protocol attributes and actions. They mediate many user interactions. At the application layer, namespaces are often designed as being globally unique, semantically meaningful, memorable and persistent. This means URLs alone hold metadata about actions being taken by users - even actions we typically consider as passive human behaviour (what we listen to, or what we glance at). URL leakage is often discussed as a privacy issue within HTTP browsers, where there is significant research around URL cache attacks and unintentional disclosure as a direct consequence of browser and application design [3,4], but the same holds for other protocols. Intuitively, it follows that common privacy approaches for disclosure, such as segmentation and pseudonyms [5,6], can help mitigate these issues. However, to our knowledge, discussions on the topic treat this as an implementation aspect on a case by case basis. It would be more accurate to treat it as a namespace design issue, that then warrants a broader study of the problem to enable services to make the best design choices. One can argue that the use of secure transport protocols (e.g., TLS) would protect URLs from prying eyes. However multiple issues concerning privacy stem from URL leakage that happens outside the coverage scope of these protocols [3,4,7]: browser caches can be probed for entries, intermediate proxies can read content, eavesdroppers can still gather hostnames from DNS queries or the initial TLS negotiation [8], and devices can be stolen and analysed. From this problem we draw the following hypothesis: what if, for privacy concerns, services enforced the majority of URLs used to reach them as privacy friendly, creating privacy-oriented online spaces? Being transient, their lifetime is bound to a session (e.g., user session or context) that actively conceals information placed in URLs as a way to reduce privacy leakage. This work studies the feasibility for service providers to implement URL schemes that are transient, conceal information and preserve privacy, while discussing their main implications and how to address them. While primarily focusing on HTTP examples and implementation details, the applicability of such schemes also extends to other protocols. Towards this goal, we start by proposing a URL namespace where URL components (Host, Path, Query) are encrypted, and unique for distinct sessions. In a proof of concept implementation, we extend the Flask web framework, to automatically rewrite HTTP requests and responses in accordance to this namespace. This can be used to transparently enable web applications (under that framework) to use our namespace, or to build new applications that take full advantage of the new capabilities (to enable the namespace on contextual parameters, or embed additional information into the URL scheme). Based on this implementation we study the content of a set of popular web sites, to gauge the cost of adopting such schemes under different loads. From those results we identify the main implications of this solution and discuss its applications and compromises made for deployment.
The remainder of this paper is structured as follows. Section 2 quickly outlines the contributions of this paper. Section 3 describes the motivation and related work, while Section 4 surveys the main goals for this work and identifies related constraints. After considering practical options and limitations, an example scheme for a Session Bound Namespace (SBN) is defined in Section 5. Section 6 describes implementation and deployment details, followed by an analysis of experimental results in Section 7. Finally we discuss applications and consequences of this type of solution in Section 8, and draw conclusions in Section 9. 2. Contributions In this paper we aim to develop a digital concept of online privacy spaces, and as such we focus on defining a privacy preserving URL namespace that conceals information through the encryption of various components of an URL. The primary contributions of this paper focus on the design of such a scheme and the consequences in terms of backward compatibility, implementation performance, and implications on related systems. We first design a scheme that encrypts URL components as a way to conceal their value. To achieve we study the limitations of each components and how to they limit our scheme. Based on this design we implement a prototype that applies our scheme to the URLs used in web pages, as used by the HTTP protocol and HTML content. It is through this implementation that we conduct a study on the performance implications of our scheme. Our study analyses HTTP traffic for popular websites and determines the impact our scheme would have in serving the same data. In particular the delay caused by increased number of encryption/decryption operations. To minimise this delay we also study the use of different encryption algorithms as well as the use of caching. Since this scheme affects other systems that relate to the service of URLs as part of the HTTP protocol, we also study implications to other related protocols such as DNS, HTTP and TLS. Finally we discuss how a partial use of this scheme can achieve the desired compromise between performance goals and privacy expectations of a web service. Furthermore we point out future approaches to improve on this approach. 3. Motivation & related work The history of Uniform Resource Locators (URLs) is intertwined with the World Wide Web, but this namespace is now broadly used in numerous applications and protocols. Many assumptions are made when using the URL namespace, some of which are a result of long standing habits in HTTP/HTML: 1. URLs are reasonably long lived 2. URLs are memorable, and convey information about the content or resources they address. 3. URLs are transferable: a single URL yields the same content for two independent requests. While the previous properties are certainly true for the general HTTP case, there are exceptions: 1. In some cases a URL is only valid for a certain time interval (e.g., 15 minutes, one time use), to protect content from unauthorised access, or to dissuade users from sharing their links with each others (e.g., in content hosting websites). 2. Although memorability in URLs is not mandatory, it is common practice for web applications to enforce URL paths that convey content description to the user. This also applies to DNS hostnames, with a fierce market for acquisition and resale of hostnames due to brand value.
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
3. Content associated with the same URL can vary, based on the identity of the user requesting it, localisation, and other contextual user information used to customise content. Nevertheless, in practice, URLs hold a reasonable amount of metadata about interactions taking place within network protocols. URLs from HTTP messages can convey content identification, or even user preferences as part of the path and query. For email messages (POP/SMTP), they identify user accounts, sources and destinations while in other protocols/applications they hold additional types of information. Depending on the protocol in use, URLs will be handled at multiple locations: at the client terminal (HTTP caches); at intermediate parties (SMTP relays, HTTP proxies); at the network provider; or at the service provider. While each of these entities is entitled to operate over these URLs, some may be more or less trustworthy to hold this information. Therefore it is worthwhile to analyse application/protocol namespaces, the amount of information they disclose, and the compromise they imply. HTTP, due to its reusability, is often used as a building block for services and applications. There is extensive research on privacy attacks on the browser [9,10] based on state information placed in the browser such as cookies or HTML elements, and more generically on individual properties that make each browser instance unique [11]. Another vector for URL leakage is the HTTP Referer header [12], that includes the previously URL viewed in the browser, revealing it to the next server being visited. This header is commonly used by websites to determine traffic sources (e.g., search engine, or shop page [13]) and often leaks the URL, along with any embeded query arguments, which may included sensitive information [14–16]. To mitigate these privacy threats, multiple approaches have been proposed as client side extensions to the browser [3,4,7,17,18]. However these works assume only the browser is invested in avoiding privacy disclosure, and cannot tackle attacks that have unrestricted access to the user terminal, such as remote exploits [19–21] or forensic analysis [22,23] of the browser cache/history in a stolen device. When dealing with HTTP, there is a large body of work concerning URL transformations for various purposes: 1. Signed query arguments are often used in URLs to reduce server state for one time operations, such as download links, account recovery or creation - these include a signed statement from the service provider with a validity timestamp. For example, the scheme in [24] stores user agent information (IP address) as part of a URL, and appends signature elements as attributes in the URL query. 2. Over the years, a number of online services have been created with the sole purpose of mapping single URLs into URLs managed by a different authority. Most of these are named “URL shortners”, because their focus is to generate very short URLs for distribution over restricted medium such as SMS and Twitter. However there are services that focus on other properties, such as temporary URLs (both HTTP and email addresses) and on enforcing access restriction (e.g., link/address protectors that enforce password or captcha verification). Due to URL obfuscation, these are viable mechanisms for both privacy protection but are also a source of security issues [25]. Such services work well for distributing individual links but cannot be used for the general case, because redirecting an entire website domain would be the equivalent to a binding attack [26]. 3. Another approach, used in SAML [27], allowed messages to carry temporary references named “Artifacts” used to reference content that could not be transmitted in a certain channel, and should be retrieved through a side channel. However, this approach requires additional protocol support.
123
4. The use of session binding arguments in HTTP URLs was an alternative when cookies were unavailable. A well known example was the PHPSESSIONID query argument used by PHP based websites. Given its popularity, it was targeted for session fixation attacks, where an attacker could take over a user session by discovering the PHPSESSIONID value, or inject a new value to bind the user into a crafted session (This type of use seems to be largely discontinued [28]). The work from [29,30] partially shares our motivation (serverside enabled privacy protection). However [29] is fully dependent on HTTP, only targets directed cache attacks with the use a pseudonym suffix and does not address obfuscation. Studies such as [30–32] successfully extract user private information from URL data sets. To partially address this, [30] proposes automated methods for sanitising URLs that effectively remove unnecessary content from URL queries, but only for URL query argument that are optional for the service being provided. The Veil framework [33] shares our initial approach to this problem. It considers that the only way to generate new identifiers without breaking the application is with inside knowledge, as part of web application development. As such, it takes a aggressive approach that requires full recompilation of a web application (HTML, CSS, Javascript), and replaces content references with javscript code. Their approach to identifier concealment is similar to the one presented here, but it makes use of more coarse techniques that are tied together with content obfuscation; furthermore it does not cope with hostnames. While there is an overlap of concerns with these proposals, in the sense that both concern privacy leakage through disclosure of URLs, our motivation, assumptions and technical approaches differ significantly. Primarily our approach is not user-centric. Instead we argue that some user privacy issues are also damaging for service providers both at reputation and business levels (e.g., a user may not consider as a privacy breach if an online store has the capability to determine which competitor websites were previously visited by the user, and uses this information for targeted offers [3]. Competing services however will consider it a threat and should be motivated to hinder such information disclosure!) Other approaches, e.g., remote executed browsers [33,34] or private browsing mode [35], require a savy and/or conscious user to configure them, or some kind of bootstrap process. The methods we discuss in this paper are to be enabled by service providers, i.e. we consider worthwhile for services to take a stand to prevent privacy disclosure to third parties they cannot control. Transitively, this may end up improving user-privacy in ways similar to the aforementioned proposals, but now relying in an association between “specific online space” and “privacy assurance”, as happens in the real world. But since the service provider hosts these mechanisms, it would be naive to assume companies not to be mostly self-serving when dealing with user privacy. 3.1. Threat scenarios As users browse the Internet, records of visited websites are stored in multiple locations at the user terminal: • the browser cache temporarily holds website content for reuse • the browser history holds the URLs of all the pages that were viewed • the browser itself, or/and its extensions, store logs of requested URLs • URLs are revealed to third parties through the Referer header or similar • the user, directly or indirectly, can store webpage content, e.g., for offline/later reading
124
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
The actual content seen in webpages, and stored in the cache is temporary, but the URL metadata can persist until the user takes action to remove it. In this work, we are not concerned with cached page content since there are other solutions that already cover this problem, at the client side [35] and through server side obfuscation [33]. Our primary concern are the URLs stored in these multiples ways, as they have an information value in itself. Also, we assume the user has not installed any special mechanisms for privacy protection, and that an attacker can attempt to extract information from the user terminal using multiple types of attacks. The attacks that can target this information have diverse strategies. Forensic inspection of a stolen terminal could enable an attacker to access this information, either from the system memory or persistent storage (history or bookmarks). Likewise, remote exploits that target user devices can also extract it. In particular, some attacks specifically target web browsers cached information, through which an attacker website can determine if a user previously visited a specific URL [4]. For example, a sales website can probe the browser to determine if the user already visited its main competitor for specific pages. A motivated attacker, with enough information about the target website, can use these techniques to probe content that matches specific purposes: • URLs, for content that requires a user to log in, can be probed to determine if the user has an account in the target website • the attacker cannot access the complete browser caches, but it can profile users based on content, e.g., “users that visit wikileaks” The privacy implications of this type of disclosure vary with each site and content being targeted. They can be damaging for the users, services that run the websites, or both. A similar threat could be posed by intermediate HTTP proxies, scanning for metadata in HTTP requests, but we feel this particular type of attack is already addressed by HTTPS. 4. URLs as session bound namespaces URLs are used in various network protocols, to identify and locate related resources, such as content, users, or the network attachment point for a service. These identifiers are defined based on parameters such as resource availability, human recognition, brand value, or semantic meaning. While properties such as resource availability are directly related to the network, others are left to the criteria of the services that define them. Our approach is that a service provider can, at will, in order to create a ’privacy-enabled online space’, start providing a service under a different temporary URL namespace, that can be converted to the original URL namespace. By implementing logic that translates URLs (Fig. 1) at the server, clients will see URLs belonging to that transient namespace instead of the original namespace.The same transient namespace will be witnessed as well by any observer in the network. We designate the mapping function that converts URLs into our Session Bound Namespace as ESBN
ESBN (URL ) → URL and conversely a reverse mapping function
DSBN URL → URL The main implementation goal is to enable service providers to start using this type of namespace without requiring changes to the client browser (steps 1 and 2 in Fig. 1). A user still uses a well known URL for reaching a service, but when a session “starts” the service provider redirects the browser to a new URL (which may also be co-located with the original URL) effectively hiding the specifics of what ocurrs in that session.
Formally we want to map a namespace of all URLs into a second URL sub-namespace that offers more suitable properties for our purposes. The key properties of this new namespace are the following: 1. Transient: URLs in this namespace are only useful within the session that generated it, and its use outside that session will result in an error. 2. Security: unintended parties should not be able to trivially reverse the mapping. 3. Extensible: URLs can hold ancillary information, placed therein as part of the conversion process - this may be required to compromise somewhat on the previous two points. However, before we can look into valid mapping functions, we first need to outline the constraints of the URL namespace, and the limitations that stipulate valid mapping functions. 4.1. Practical URL building constraints As defined in RFC3986 [2] an URL can include multiple components
Scheme : //Aut hority/Pat h?Query#F ragment Scheme: used by the client terminal to determine the protocol to be applied. Authority: a tuple [User:Password@]Host[:Port] that holds a server hostname, and optionally a port and user information. The User and Password are used for identification and authentication according to the service provider. The Host and Port are used to reach the server, and must refer to a valid network node (DNS hostname or IP). Path: a set of segments separated by a slash. Segments “.” and “.” denote relative references in the path. Query: represents key/value pairs that further identify a resource. Fragment: is used for client-side indirect referencing in resources associated with a URL. This means they are normally not seen as part of a request, and are used solely by the client implementation. While RFC specifications do not impose strict limits on URL lengths, practical limits must be considered, based on common implementation practices, tools and protocols. A common denominator for an upper limit to URL length in various browsers and web indexing engines is 20 0 0 octets. This value is based on empirical observation of web browsers and HTTP server implementations, and while some implementations allow longer URLs (e.g., 10 0.0 0 0 octets) this value remains a reasonable assumption for practical purposes1 . Our own analysis of the Common Crawl2 dataset that contains 1.603.205.557 unique URLs collected from web pages, only contained 299 URLs that exceeded 20 0 0 bytes. Additional restrictions apply to the host name inside the Authority component: the total size for a DNS hostname must not exceed 253 octets and in addition each label (between “.”) is limited to 63 octets. Furthermore, domain names only accept a restricted set of characters and would require encoding schemes such as base32 [36] in order to represent arbitrary data. A wider character set could be used (per RFC2181 [37]), but we assume the lowest common denominator to avoid compatibility issues. Similarly, the Path, Query and Fragment are also restricted in the character set they can use, and each component has a different set 1 To our knowledge the most up to date (2015) compilation of data on this subject can be found at http://www.boutell.com/newfaq/misc/urllength.html and its conclusions still hold 2 http://commoncrawl.org as of July 2015
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
125
Fig. 1. SBN implemented at the server side.
of restricted characters. A one size fits all approach seen in several applications to encode arbitrary data inside these components is the use of base64 [38] with a URL safe alphabet. 4.2. Backward compatibility As was pointed out earlier, this solution aims to work with existing client-side implementations, requiring changes only at the server side. In our approach, it is in the interest of the service provider to enforce these privacy mechanisms. As such, it is important to retain backward compatibility with existing client-side systems. From the previous breakdown of URL components and empirical observation in existing protocols or applications, additional requirements can be extracted concerning backwards compatibility: 1. The Scheme defines the protocol handling at the client, and therefore should not be changed, otherwise the client implementation would not recognise the URL scheme. It provides little or no information to a privacy attacker. 2. The Password in the Authority segment, as seen in HTTP browsers, is not sent unless the client first receives an authentication header from the server, and in general is protected from unintended access. 3. URLs in the same session must not break the same origin policy, i.e. use hostnames under a common domain. However, different Sessions can potentially use distinct hostnames. 4. Protocols (e.g., HTTP, FTP) navigate the URL Path using relative references. As a consequence, relative references must also work for URLs in the SBN - thus the number of segments in a mapped path must remain unchanged. 5. For some HTTP based applications, we observe the URL Query being manipulated at the client side. Ideally the mapped query attribute names should not clash with the ones used by the application. A less common case can be seen in JavaScript applications, where the client expects to read the Query string; in such cases the mapping function should not alter it. 6. The Fragment is considered to be a local (client-side) reference, and can point to protocol specific content. Moreover, the URL fragment is typically not sent as part of a request (HTTP), so there is little value in modifying it.
Paths are frequently manipulated at the client side, using relative links or known path segments names. In order to maximise compatibility we define an additional requirement for Paths: 7. A path can include a mixture of segments encoded and nonencoded according to the SBN. Requirement 7 is a compromise that relaxes one of our main goals - to conceal information gathering from URLs. For completeness, we enforce this expectation here to avoid compatibility issues, but we expect it to be relaxed in some cases at the discretion of the Service Provider, since it is a matter of his specific policy for privacy support, and its server code development. From requirements 4 and 7 we can extract two conclusions. First, one must be able to distinguish encoded and non encoded Path segments. And second, path segment transformations must be independent from one another i.e.
ESBN (Path ) = /ESBN (Path0 )/ESBN (Path1 )/ . . . DSBN (ESBN (Pathi ) ) = Pathi Naturally, it follows that using the reverse mapping function in a Path that was not encoded, returns the same Path.
DSBN (Pathi ) = Pathi A fundamental assumption for operating a solution without client-side support, is that protocols are agnostic to internal URL semantics, meaning protocol semantics are not explicitly stored in the URL. This assumption clearly holds for HTML content, e.g., a link is a link regardless of its path. However if we look at examples where URL Paths or even parts of the Host are used to convey meaning, as is the case in HTTP REST APIs, then the previous set of requirements might not be sufficient. For many applications, the full set of constraints may be unnecessary. For now all these assumption are left in place, while Sections 7 and 8 discuss the practical benefits of relaxing them. 4.3. Viable mapping functions The ESBN () function needs to be reversible for the service provider to be able to determine the original URL. This implies that either the service provider has a method to store and lookup the mappings between arbitrary URL and their original representation, or the new URL components contain all the original information within their new representation.
126
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137 Table 1 Encoding examples for different URL components (K() is ECC/p256). Component
Input
ESBN (Input)
Hostname
sbn.atnog.org
ad4cbfteczl5i5jlqdyop7ji7lb7kxsy3lgcvqzcw5i5ks3kt4kdsr exsc5deqw.s7moiamauu7uogxhmi53dwj5b3m y36bh2c3wkkqgfi7rf3eydqwuq.sbndomain.tk
Path
/hello
/@AUNuJ4C2jyxQxkWdN-jKX689p4wLfInywtEqTiAVW 4EgazeDo4Wq1n84iXJm2JjL0A
Query
?action=42
?sbnquery=AL8Z-RdNVJ5-41-Oqu3K09l4xhOy8mmorBz j3xnDNXVNI2PPjVje2cHtC0LMH3dlI8J7fw
The first option is typically used by URL shortening services: they generate a unique short URL that maps into the original URL, and then lookup the new URL. However this approach does not fit into several use cases, since it requires a synchronised method to generate new URLs, and if the service is distributed all instances require access to the URL mappings. A second option is to encode the original URL information as part of the new one. To prevent third-parties from reversing the transformation this means encrypting the content before being encoded as part of the new URL. For the remainder of this work this is the approach we will emphasize. However the results (Section 7) are still applicable under a different assumption (as long as the cost of encryption/decryption is replaced by lookup costs). Choosing an encryption scheme for this task means balancing the overhead of longer URLs with the performance costs of different encryption schemes and the deployment constraints for different services. For example, do we use a symmetric key (shared by all service instances), or asymmetric keys and deploy public keys in the instances that redirect into the namespace and private keys at the SBN instances (step 4 in Fig. 1)? Without full knowledge of operational conditions it is hard to determine which strategy would be more viable, and thus we will consider several options in the following section. Note this paper does not tackle all potential scalability aspects, as key distribution and mixed encryption schemes (e.g., derive a symmetric key from the public/private key and include the encrypted symmetric key as part of the ciphertext) are not addressed.
5. The SBN namespace We can now define a mapping function for an example session bound namespace based on the constraints identified in Section 4. When converting an URL, the original URL components are encrypted (designated K()), where K is one of the encryption schemes that were studied. Our choice of encryption schemes is not meant to be exhaustive but illustrative: we selected a few of the more popular off-the-shelf implementations available. For symmetric encryption, we considered AES-CBC-1283 and Salsa204 . The later is a stream cipher, and requires the caller to ensure non repeatable nonces are used with the same key. For asymmetric encryption we considered RSA, and Elliptic Curve Cryptography5 . The conversion process for each URL component is described in separate subsections. The Username, Password, Port and Fragment components are not covered since they are left unaltered for compatibility purposes. Table 1 shows an example of the output of the ESBN function for each individual component.
3 AES-CBC-128 as used by the python-cryptography module, with PKCS7 padding and HMAC256 4 As used by the sodium/NaCl implementation 5 Standard ECC curves in the seccure implementation
5.1. Authority For a distinct namespace, we want to replace the DNS hostname with a new one, which is apparently unrelated to the original and unique for each session. As an example the two hosts in the relation ESBN (sbn.atnog.org) > session1.sbndomain.tk seem unrelated to an external observer. This means the new Host should not be based on the original domain name (e.g., session11.srv.com shares a common prefix with srv.com), but hosts for different sessions can be part of a common domain (e.g., session1.sbndomain.tk and session2.sbndomain.tk). The problem can be subdivided into the following points: 1. Efficient allocation and management of bulk DNS names as session identifiers that point to the correct servers 2. Verifying a hostname as being valid for a given SBN This implies allocating a separate DNS domain. Even assuming the process could be time consuming, domain names can be acquired/configured in bulk. The process of configuring one hostname to point to the proper host would take at most a DNS update [39]. If necessary, the DNS hosts can be preconfigured, long before they are needed by the service provider. Alternatively, each individual hostname doesn’t have to actually exist as a record. A wildcard rule can be enforced at the authoritative DNS server for a subdomain to the same set of servers. The hostname part of the URL is converted as follow:
H ost = base32(K (H ost ) ).2LD Where the 2LD is a separate level domain under the control of the service provider, and Host is the original service hostname. As a result, the reverse mapping function allows the service provider to extract the original hostname and verify its validity. Fig. 2 shows an overhead comparison for different encryption schemes(K). The final sizes include the overhead for base32 encoding. Even using the most expensive scheme (ECC/p521 in Fig. 2) would still allow us to encode 80 bytes within the new hostname. Using a 20 byte nounce, further implies the original hostname must not exceed 60 bytes. RSA is always padded to a constant size, and a larger RSA key cannot be used because the ciphertext would be larger than 250 bytes. The 1024 key size implies we cannot encrypt plaintext larger than 86 bytes (maximum plaintext size when using RSA-OAEP). For the example in Table 1, since the encoded content is larger than 63 octets, it was split over two labels. 5.2. Path As outlined in Section 4.2, for compatibility purposes each path segment is encoded separately in order to retain the original Path structure. If K() is an encryption function, then any path segment will be encoded as @base64(K(segment)). The full path is encoded as
/p0 /p1 / . . . → /@base64(K ( p0 ) )/@base64(K ( p1 ) )/ . . .
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
127
Fig. 2. Host overhead for multiple encryption schemes.
Fig. 3. Path segment overhead for multiple encryption schemes.
The “at” symbol(@) is used here as a special marker to distinguish between Path segments encoded into the namespace, and those that are not (Section 4, Requirements 4 and 7). This marker must not appear in non-SBN URLs, which led to our choice for @. We use it here as an example, but other symbols or distinction rules could be used instead. The encoding overhead for each path segment for different encryption schemes can be observed in Fig. 3. As in the previous case RSA (PKCS7) is also included, now for larger key sizes to allow for larger payloads. As before, they are still limited in the maximum
payload (RSA1024 - 86 bytes; RSA2048 - 214 bytes ; RSA4096 - 470 bytes) and padded to a constant size. 5.3. Query Query arguments can be concealed by encoding the original query inside a new query argument. For example, using the sbnquery argument to hold the encoded query
Query = sbnquery = base64(K (Query ))
128
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
Concerning data encoding, the same method used for Path segments (Fig. 3) is reused here for the original Query.
As pointed out in Section 5.1, step 2 in Fig. 4 can be avoided if replaced with a wildcard configuration in the authoritative DNS server. This is the method used for our prototype deployment.
5.4. Additional considerations 6.2. Implementation While the previous scheme conceals information through encryption, it does not necessarily enforce uniqueness for different sessions. The underlying encryption scheme may guarantee that encrypting the same plaintext twice will result in different ciphertexts, but if this assumption does not hold we can consider two strategies: 1. Use distinct keys for each session, and identify keys based on signed HTTP cookies, 2. If using a single key insert a random session specific nonce (e.g., 20 byte UUID) as part of encrypted payloads Relying on encryption for the various URL components implies additional overhead when handling these URLs. From the encoding methods we have outlined in this section, processing each URL would require a number of encoding operations that increases linearly with the number of path segments as
K (hostname ) +
N
K (Pathi ) + K (Query )
i=0
We will not go into a exhaustive performance comparison between all encryption schemes, since the choice of algorithm is application specific, and performance may vary as software6 and hardware implementations of these schemes become available. However, we point-out that the service provider can cache the results of encryption/decryption for the same key to avoid those penalties, turning K() into an amortised constant time operation for each session at the expense of more memory. This issue is discussed further in Section 8. 6. Instantiation over HTTP After considering the methods and limitations for building Session Bound Namespaces, we can now instantiate them in existing protocols and applications. Our scenario and implementation is focused on HTTP since it enables an implementation of Session Bound Namespaces without client side support and is the most widespread protocol for service provider access. A prototype instance of our implementation is available for consultation at http://sbn.atnog.org. The source code for the application is publicly available as an Open Source package7 . 6.1. Scenario A sequence diagram of an instantiation of SBN in HTTP is depicted in Fig. 4. As with any other HTTP content, the user starts browsing through a well known (or previously recorded) URL. The Service Provider then decides when to redirect the browser to start using URLs in the Session Bound Namespace (step 3 in Fig. 4). This can be done arbitrarily, e.g., when the user logs in at the web site or for every new session, or any other rule that is defined by the “space privacy policy”. However, some care must be taken when handling the redirection step: since the hostname in the URL may change, cookies that were established when accessing the first instance may not be in effect due to the same-origin policy. As in other login procedures that cross over different domains, the original Service Provider instance should redirect the browser along with a signed token to establish the session at the new instance. 6 7
http://www.cryptopp.com/benchmarks.html https://github.com/ATNoG/flask-sbn
One observation to be made for implementation purposes is that many web development frameworks exercise complete control over the internal (i.e. within the same domain) links in HTML pages. Popular web frameworks (such as Django, Ruby on Rails, or JSP) generate HTML content based on page templates, where internal URLs are formed automatically based on dynamic settings (service hostname, path locations, localisation settings). This means that these frameworks are well suited to implement the kind of logic required for our purposes, since they already deal with dynamic URL schemes, due to virtual hostnames, component path discovery and contextual URL creation. Several proposals (such as [29]) implement their solutions as transparent HTTP proxies that rewrite HTML content. We prefer instead to implement our proposal as part of an existing web framework. Achieving this as part of an existing framework places control of the namespace closer to the application. But we can also use it as a transparent proxy for testing and instrumentation purposes. Our implementation is based on the Flask8 web development framework, and is defined by three main components (Fig. 5): 1) a middleware that handles incoming HTTP requests, decrypts URLs in HTTP headers, and replaces them with the decrypted values; 2) a set of URLs generator stubs, that replace the ones used by default in application code or as part of the template engine, and rewrite all URLs created in HTML content (encrypting URLs in the SBN); 3) a module to rewrite HTTP responses. All these components share a common configuration that defines Session Bound Namespace parameters (encoding and encryption scheme), and communicate with each other through a context shared for each HTTP request or using a shared cache for URL encryption and decryption operations. URL components are never encrypted or decrypted more than once for the same HTTP request, since the results are cached during the request. By default the application code is unaware of the real URLs being used, as it always uses the original decrypted URLs. This is possible because application code in Flask (as in other frameworks) internally rarely handles URLs directly, and instead builds URLs based on internal APIs that go through our URL generators. In addition we also provide extensions for the application to take control of the namespace transformation. These are necessary for application code to redirect the browser into the new namespace, or, conversely, to end a session by redirecting back to the original website. 7. Results To evaluate the impact of this solution we analyse the HTTP traffic for a set of popular websites, through a transparent HTTP proxy that rewrites HTTP requests and responses, built on top of our implementation. Our dataset uses some of the top websites in the Alexa ranking, but ignores some popular cases that are optimised to display minimal content on load, or defer content loading using Javascript, making it harder to gather a reliable comparison on website behaviour over multiple samples. We are mainly interested in the potential impact caused by these methods, in direct proportion to the amount of links in a webpage, and the URL length (number of path segments due to Requirement 7 from Section 4). This analysis can determine the num-
8
http://flask.pocoo.org/
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
129
Fig. 4. SBN workflow for an HTTP browser.
Fig. 5. Functional diagram for SBN implementatio.
ber of encryption and decryption operations required to enable the use of SBN in these websites. As a starting point for our analysis, we assume a extreme scenario where all links in a web page are encoded into the Session Bound Namespace, including links and references to other websites (i.e. other domains). When an HTTP request is made, all URLs in headers are decrypted (if the URL falls under the SBN), and all URLs in the headers of the HTTP response and in HTML content (anchors, images, css links, etc) are encrypted. This goes beyond the scope of our objectives, but enables us to observe how our solution behaves under extreme conditions. Naturally different web pages have distinct load behaviours in number of HTTP requests (Fig. 6) and HTML URLs (Fig. 7) that must be encoded according to the SBN. For each case we distinguish between requests and URLs that fall under the Same Origin Policy (SOP) as the visited website (same domain/port); this distinction is considered for optimisation purposes. Fig. 6 shows the amount HTTP requests made for each webpage, and Fig. 7 shows the amount of URLs in an HTML page that need to be rewritten (anchors, links, images, etc).
For reference, an existing survey in this topic9 provides insight onto what can be considered an average web page (at 100 objects per web page). The site nytimes.com (Fig. 7) is well above average in this regard, but we also observe that some websites defer the loading of content until the user performs some action (e.g., scroll). For the measurements presented, the amount of HTTP requests includes all requests, since the page starts loading until it is fully visible in the browser, and the browser stops issuing new requests. Fig. 8 shows the amount of decryption and encryption operations (per URL component) that were required to serve each webpage. The amount of required operations is certainly high, but we point out that this includes all requests involved in fetching webpage (i.e. images, CSS, javascript, HTML, etc), and some sites split the load process over multiple HTTP requests (Fig. 6), e.g., amazon.com (142 requests). In the following subsections we will analyse how relaxing some of the requirements identified earlier and adjusting the cache settings for our implementation can impact these costs. 7.1. Same origin policy content only An approach more aligned with privacy goals for a service provider is to only encode links that fall into the Same Origin Policy(SOP) as the original website domain, i.e. all URLs under the same domain or subdomain. This avoids encoding URLs in HTTP requests from CDN hosts, or encoding links pointing to external domains. By ignoring these requests, the website load patterns reveal that most websites only issue between 1 and 6 requests (Fig. 6) under the Same Origin Policy except amazon.com that needs 13 requests. Similarly, the number of HTML URLs that fall under the SOP is also smaller (Fig. 7). However the SOP distinction is not always a clear indicator of origin, e.g., nytimes.com loads static content from the nyt.com domain, and facebook.com from fbstatic-a.akamaihd.net which are not
9
http://www.websiteoptimization.com/speed/tweak/average-web-page/
130
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
Fig. 6. Number of HTTP requests.
Fig. 7. Number of URLs in HTTP content.
rewritten since they fall outside the SOP. However this information is known to the service provider itself that operates all those domains and take advantage of this knowledge. The most noticeable difference is the decrease in number of decryption operations for the Host and Path (Fig. 9). Decryption operations take place when decoding the URL in HTTP headers, and since the number of HTTP requests under the same origin policy is lower, the number of decryption operations decreases proportionally. For a “selfish” privacy policy, from the service provider, this is the simplest optimisation to use. 7.2. Encryption/decryption caching As pointed out earlier in Section 5.4, there is room for optimisation, provided the server can cache encryption/decryption operations across HTTP requests. Repeating the previous scenario, but now with a encryption/decryption cache that keeps results from previous HTTP requests, results in Fig. 10. Unsurprisingly, there is a decrease in the number of operations for all sites. There are some cases where
caching is not very effective, e.g., Linkedin, that generates HTML links using 238 different subdomains causing a high number of encryption operations. Similarly Wikipedia uses 289 different subdomains. The benefits of caching operations are less significant for pages with a high number of unique links, such as amazon.com or yahoo.com.
7.3. Optimise path encoding Looking at the previous results (Fig. 10) we observe that the majority of effort is spent encoding path labels. Only two cases (linkedin.com and wikipedia.com) spend more time encoding encoding the hostnames due to the large number of domains used by the website. Requirement 7, as defined in Section 4, requires that path labels be encoded separately. If we drop this requirement and encode the full path in a single encryption operation (just like the Host and Query), this reduces the number of Path decryption cost to a fraction of the previous case (Fig. 11).
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
131
Fig. 8. Encryption/Decryption operations for different web pages.
Fig. 9. Encryption/Decryption operations for different web pages (Same Origin Policy content only).
While the Path encryption operation count still looks high for amazon.com and and yahoo.com, these websites have a high number of URLs that need to be rewritten (290 and 307 respectively). As such the results in Fig. 11 are already under one path encryption operation per URL (due to caching).
Table 2 Encryption/decryption microbenchmarks for a 10 0 0 byte plaintext (average per operation). Scheme
Encryption(ms)
Decryption(ms)
Salsa20+Poly1305 ECC 160 AES-CBC
0.063 3.524 0.235
0.049 1.914 0.238
7.4. Encryption/Decryption delay To estimate how the previous results can impact operation times, we benchmarked execution times for encryption/decryption operations (Table 2). All presented times (in milliseconds) are based on 30 0 0 executions, and the benchmark was performed in a Intel i5-4570R CPU at 2.70GHz. We omit the remaining ECC curves and RSA since their running times are always higher than the other schemes and are too expensive for our purposes.
For example, we can consider the results for amazon.com from the initial scenario (Fig. 8, with 776 encryption and 605 decryption operations). This would present a total delay of 79 ms (Salsa20), 326 ms (AES) and 3892 ms (ECC). In the final caching scenario (as shown in Fig. 11, 370 encryption and 3 decryption operations) amazon.com would suffer a delay of 11 ms (Salsa20), 59 ms (AES) and 1410 ms (ECC). One can expand these estimates using differ-
132
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
Fig. 10. Encryption/Decryption operations for different web pages (SOP content only; With caching).
Fig. 11. Encryption/Decryption operations for different web pages (collapsed path encoding).
ent encryption and decryption counts to highlight (potentially targeted) delay zones (Fig. 13). Fig. 12 shows the 99.999% confidence interval for encryption and decryption times for various payload sizes. We can observe that the plaintext size holds little impact in the delay of each individual operation. While AES has similar times for encryption and decryption, in the other algorithms encryption is generally more expensive than decryption which is contrary to our goals since the number of URLs in a page that are encrypted is higher than the number of URLs in HTTP headers that are decrypted. The same relation can be observed as the slope in Fig. 13. 8. Discussion This section discusses the main aspects of these techniques for generalised use in “digital privacy spaces”, followed by the challenges that were encountered while deploying them and a discus-
sion on how to address them. Fundamentally we are motivated by the notion that services can improve privacy in the namespaces they control, even if it requires additional effort or potentially compromising other goals. But it is relevant to discuss what are the choices that must be made and future directions that can be considered. A number of privacy attacks focus on extracting information from the URL metadata, through traffic analysis, active probing of browser caches, remote exploits or forensic analysis of user equipment. If all URLs are transformed by the service provider and the namespace changes with each user session (or other privacy policy, such as per-minute/hour), the applicability of such attacks is diminished. Since URLs lose their global meaning, an attacker would have less information to gain from a remote attack, and even forensic analysis of a stolen user terminal would hardly be effective.
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
133
Fig. 12. Encryption/Decryption times vs plaintext size.
8.1. Relation to other privacy tools The use of HTTPS protects from content scanning of network traffic. However our key concern is focused on attacks that can either probe the user browser for URL metadata, or can access the user terminal through other means. Since URLs are sent or stored unchanged, protection against these types of attacks falls out of the scope of HTTPs. Furthermore HTTPS is not guaranteed to conceal the hostname component of the URL, which is revealed in the initial exchange [8], unless recent extensions [40] are used. Similarly to HTTPS, ToR [1] provides traffic encryption (within the ToR circuit). It also provides anonymity, concealing the real source network address in transit, although anonymity is orthogonal to the goals of this work. ToR also provides the ability to create hidden services. Those are pseudo hostnames that use the.onion TLD that can be used to connect with services connected to ToR. The remainder of the host in a.onion address is derived from a public key and holds no special semantics. Clients connect to these services using software that automatically maps hostnames in the.onion TLD into connection parameters in the ToR network. This works as an effective form of privacy, as an.onion address can belong to any party connected to ToR. However such privacy only applies to the hostname, and the remaining URL remains unchanged. For transient hostnames a service provider would need to setup multiple.onion addresses, much like we do here for DNS. The main disadvantage of ToR hidden services in the context of our scenario are that it requires server side setup of the ToR connection, as well as client side configuration (i.e. installing Tor). Alternatively, the user can access these services using proxy services, such as ToR2Web10 , at the expense of anonymity, and assuming such services are trustworthy. This is not surprising as the primary goal of ToR is network anonymity, while other client side tools (e.g., ToR browser) address privacy issues. With regards to the
10
https://www.tor2web.org
solution proposed in this paper,.onion addresses could be used for the hostname, while SBN is used for other URL components. The Veil framework [33] implements multiple techniques for URL and page content obfuscation. It does not use URL encoding schemes like those described within this paper, instead it injects a javascript library in web pages to decode and fetch encrypted URLs. Since it completely recompiles HTML, CSS and Javascript content before application deployment it can avoid constraining its requirements on URL usage. On the client side there are multiple tools for browser privacy protection. For example most browsers currently implement a private browsing mode (PBM), under which no browsing records are stored, although these are sometimes faulty, can be undermined by third party components [35,41,42], or fail against forensic inspection [43]. The same can be said of privacy savy habits, such as regularly cleaning the cache and history, or using specialised extensions. All these are not practices the service provider can control or enforce on users. Finally, disk encryption tools can protect data privacy in stolen devices, but passwords can still be obtained through coercion, or poor judgement can lead to the use of a weak password. 8.2. HTTP Applications We detail our design rationale in Sections 4 and 5 in order to highlight the choices that are available. Our example namespace clearly favours backward compatibility for the HTTP case, but other options could be more efficient or useful based on different goals (path encoding is one such case). There are clear limitations in URL lengths, but we did not encounter actual cases where this was an issue. In general terms, a service provider can use the techniques discussed earlier to place arbitrary information in URLs. For our analysis we considered both symmetric and asymmetric encryption, meaning the data can be read by third parties that hold the appropriate key. However, it is important to emphasise that since URLs are not trustworthy, the encrypted content does not prevent other
134
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
vice provider or the URL includes a signed token. While this study was focused on HTTP (due to its pervasive use and because it can be implemented and tested without changing the browser client application), similar methods can be applied to URLs in other protocols (for example email addresses are pseudo-URLs, and URLs are often used as token carriers in NFC tags or QR codes). The methods discussed here can be used in similar scenarios, where the service provider wants to simultaneously minimise internal state by including data in URLs and needs to protect the enclosed data. 8.3. URLs lose global meaning In “digital privacy spaces”, URLs lose global meaning because they are only valid within a certain scope. Immediate implications of this feature follow from the browser/user being unaware of the transient nature of these URLs, as such: 1. Indexing of content (e.g., search engines) will not work 2. Bookmarking and sharing of URLs will not work These aspects are significant usability issues that can hinder user experience, but can be addressed with additional work. Indexing of content by search engines is also covered in [29], which notes that search engines use well known user agents. A similar approach can be adopted here by disabling the use of SBNs for well known indexers, if this is so desired. Bookmarking and sharing stop working because the URLs may have a limited life time, or worse, if they are pinned to the user session, a shared URL will always result in an error. To approach this issue, one needs to enable the client browser to decode the URLs under specific conditions. Possibilities include sharing the encryption key with the client (e.g., as the session starts send the key as part of a cookie) and enabling a browser with the ability to understand the SBN namespace. For simpler cases, like bookmarking or content sharing widgets, this can be implemented using javascript served as part of the web page. In more complex scenarios, where we want to decode all possible URLs in a webpage at the client side, it would be easier to implement this logic as a browser extension. On the other hand, the impact of these usability issues varies with the scenario. In a privacy dominated scenario these consequences may be seen as irrelevant, and can even be considered desirable. 8.4. Impact in DNS caches The number of entries in DNS caches increases linearly with the number of sessions cached at each DNS resolver. In addition the size of each entry is usually large due to the base32 encoding. Since SBN URLs use unique hostnames for each session, all DNS requests arrive from the same user terminal and there is little benefit to DNS caching. Assuming DNS caching resolvers apply a least recently used eviction policy these entries are removed as soon as the cache becomes saturated. A non optimal case occurs for alternative eviction policies, because the cache may keep these transient entries when they are not needed and evict more relevant entries. 8.5. Impact in HTTP caching Fig. 13. Estimated total Encryption+Decryption delay based on op-count.
parties from capturing and using them (i.e. replay attacks). To overcome this, transient strategies must be in place, with a specific timing granularity. Transient (time limited, or one time based) URLs are commonly used for email or registration confirmation workflows, where a unique URL is mapped into a record (e.g., email) stored by the ser-
Given their nature as resource identifiers URLs are used as keys for caching content, at various locations: 1. The client terminal, in order to reduce loading times. 2. The Service Provider, to reduce load on its infrastructure. 3. Transparently, at the network provider infrastructure, to reduce network load.
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
4. In leased infrastructure, close to the access network such as Akamai, where the service provider can place content to minimise retrieval times. At the client terminal, the impact of this scheme will be inversely proportional to session duration. Short sessions can result in a cache filled with URLs that will not be reused. In fact, privacy conscious web sites could request a cache cleanup when logging out, minimising this issue, but to our knowledge no such mechanism is widely available. Within service provider premises, the ability to revert the namespace mapping for incoming requests means content can be cached based on the original URL. If the mapping function is considered to be expensive, its results can also be cached to minimise retrieval delays. Transparent caches are those most affected by this scheme since, as a consequence, multiple copies of the same content can exist in the same cache. In the specific case of HTTP this is a known issue when serving user specific content that cannot be cached as a whole. To minimise this issue, the Service Provider may choose to serve some recurrent content outside of the SBN (potentially sacrificing privacy for performance). Finally, for service provider caches leased at the access network, the mapping can be reversed to retrieve from the cache. But this would require distributing the necessary key through this infrastructure. Such a compromise does not seem adequate for strong privacy scenarios and hardly in the best interest of either the user or service provider.
8.6. Impact in transport protocols With regards to the network protocols being used while serving content two additional consequences were also observed while conducting our experiments. First, in DNS traffic, the payload of a DNS query will increase with the hostname length. In some systems where the maximum DNS message size is 512 bytes this means the use of TCP [44] might be required to serve requests for these hostnames. With regards to HTTP and its caches, the use of different hostnames means that optimisations like the use of a single TCP connection to handle multiple requests are not triggered, as a consequence instead of a single TCP connection we will have multiple connections. When using HTTPS this difference is more pronounced since the TLS negotiation needs to be performed multiple times.
8.7. TLS deployment complexity The use of multiple hostnames for the same service requires operating more complex service deployments for DNS and TLS. The later in particular has high maintenance costs - acquiring a dozen wildcard TLS certificates would be expensive. An additional usability concern is that the change into a strange hostname as the session starts clashes with established security practices - users don’t expect the hostname in webpages to change drastically as they log in, and look for the correct hostname for fear of phishing or TLS attacks. Its also debatable whether extending this level of protection to the host name is worth the cost and performance impact. Since the user always starts by using the real service hostname (step 1 in Fig. 1), it can be argued a capable attacker will always be able to extract the hostname (using TLS or DNS man in the middle attacks), unless other privacy tools are in place. Existing privacy tools can help address these issues. For example ToR [1] can protect the
135
initial TLS handshake and DNSCrypt11 (or [45]) can protect the DNS traffic. Alternatives to the economic constraints associated with these issues may still be worth exploring. If one service can’t host multiple domains for SBN, maybe several services can agree on sharing a pool of subdomains for this purpose. One can easily imagine a .sbn-sessions.org service, that provides pools of subdomains for privacy conscious services to use. Nevertheless the final compromise in these design issues is dictated by service provider policy, and the required privacy levels. 8.8. Privacy caveats While primarily motivated by privacy, there are a number of related caveats that are worth considering when using the proposed privacy strategy. Under the proposed scheme URLs retain their original structure: the number of Path labels presence of Hostname and Query are not changed. To our knowledge no privacy study addresses the possibility of uniquely identifying a service provider (or content) based on the structure of the URLs it generates - i.e. if one inspects a browser cache solely based on URL structure and the URL relation graph, is there a significant probability of identifying a specific website? Objectively this is left as an open question, however the proposed scheme could be extended to randomise URL structure as a way to mitigate that possibility (e.g., insert discardable path labels). The primary assumption of this work is that transient URLs function as privacy preserving mechanism. As such, a valid concern is that the uniqueness of the URL components can be used to track the client. While the placement of unique identifiers inside URLs for tracking purposes is not new, unique session hostnames are not common, and will be “leaked” as the client terminal issues DNS requests. An attacker with the ability to intercept DNS traffic and a-priori knowledge about the SBN namespace, could exploit this to track user movement, albeit with reduced metadata. The service provider can minimise this issue by forcing periodic rotation of the namespace/session, otherwise the issue becomes more significant for long lived sessions. Finally, notice that since this type of schemes are implemented by the service provider, the level of privacy is at the discretion of the service provider, much like real world venues enforce different rules to apply different levels of privacy. For example, the service provider still holds information about who (IP, usernames) visited what (canonical resource URL) and lawful interception can still occur at the service provider. 8.9. Reasonable use policy and performance We have weighted some pros and cons of systematically using this approach. While the sacrifices of adopting this solution are considerable (transferability, memorability, performance, setup costs), it is worth mentioning that it is possible for a service provider to adopt a partial scheme that only covers some URL components, e.g., Path: Protecting the URL Path alone will not prevent an attacker from determining which websites (hostnames) a user visited, but it does hide specific resources represented in the Path. In some cases this may be sufficient, if the Service Provider sees no need to conceal the remainder of the URL. Hostname: From the points already discussed in this section, assigning new hostnames for different sessions may involve additional setup costs, and the privacy benefits are arguable without
11
https://www.opendns.com/about/innovations/dnscrypt/
136
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137
further considerations, e.g., the first contact will always disclose the hostname, and the leakage in DNS queries may be detrimental. Query: Concealment of Query information through this scheme is highly dependent on the application, as some service providers only use queries as part of client (browser) crafted URLs. Mixed use: Using a mixture of bound and unbound URLs in the same website is also possible, provided the limitations of the Same Origin Policy are not an issue. It is already common behaviour for websites to offload popular website components (e.g., Javascript, images, videos) to external CDNs. Whether or not this is privacy leaking depends on the website. These choices have direct impact in the overall performance of this solutions, because they impact the technical approaches made earlier, namely: 1. The encryption scheme 2. The number of encryption/decryption operations per URL 3. The effectiveness of caching these operations The microbenchmark in Table 2 offers some indication on the cost of different encryption schemes. In the presented example, a delay of 10–30 ms (Salsa20) might be acceptable for websites that are not delay sensitive, but the 1–4 s delay seen when using ECC is unacceptable for most cases. An additional observation can be made that, for the described mapping function, the number of encryption operations is higher than the number of decryption operations (because the user might not open all encoded URLs). The choice of encryption scheme could take advantage of this. Some other approaches are left unexplored, for example would it be possible to pre-compute all URLs in the Session Bound Namespace when a session starts? This seems unlikely to be feasible: for large websites with dynamic URLs one usually assumes that the amount of valid URLs is not known (and depends on external factors), but it might be a valid approach to reduce delay in some cases. 9. Conclusions In the real world, private locations apply rules to uphold expected privacy standards: in some private spaces, the use of cameras or phones is forbidden to avoid recording. For digital privacy, it is not always clear what technical tools are used to this end, or the involved trade-offs. URLs hold a significant amount of metadata about online interaction, describing user action or content information in some detail, and their leakage can be damaging to either individuals or services. But how can digital services enforce reasonable expectations of privacy against URL metadata leakage, often outside of its control (e.g., network and browser caches, user terminal), creating digital private spaces? In this work we are concerned with the feasibility for service providers to conceal URL metadata through encryption, creating a Session Bound Namespace of URLs. These are meant to conceal (all) parts of an URL from unintended disclosure (browser cache attacks, forensic inspection of the user terminal) and cross reference. Compatibility issues, implementation constraints and established practices in HTTP, limit the length of and properties of URLs and consequently the transformations that can be applied for privacy support. Our privacy proposal has implications on service deployment that affect HTTP caching and incur in additional costs or deployment work for DNS and TLS. However the service provider can choose the best compromise between privacy, cost and performance through selection of namespace features. Whenever possible we maintain compatibility, and we also identify significant benefits of compromising compatibility for performance.
The choice of encryption scheme and namespace transformation functions have also performance implications, that to some degree can be addressed through various caching strategies. However, note that some encryption schemes introduce too much overhead to be viable for widespread deployment. Our namespace proposal intentionally sacrifices performance for backwards compatibility, but this need not be the case in all scenarios. Furthermore, we also identified some usability issues with our approach, as a consequence of using transient and opaque URLs. But we also consider that these issues can be partially addressed through client side support, such as browser extensions, that understand the namespace semantics. Our expectation is for this work to enable society to design services with different expectations of privacy with regards to URL metadata, rather than on absolute privacy or complete lack of it, much like we hold different notions of privacy for physical locations. Acknowledgments This work was supported within the scope of R&D Unit 50 0 08, and financed by the applicable financial framework (FCT/ MEC through national funds and when applicable co-funded by FEDER PT2020 partnership agreement) with ref. no. UID/EEA/ 50 0 08/2013. Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.jisa.2019.03.010. References [1] Dingledine R, Mathewson N, Syverson P. Tor: the second-generation onion router. In: Proceedings of the 13th conference on USENIX security symposium - volume 13. Berkeley, CA, USA: USENIX Association; 2004. p. 21. [2] Berners-Lee T, Fielding R, Masinter L. Uniform resource identifier (URI): generic syntax; 2005. [3] Jackson C, Bortz A, Boneh D, Mitchell JC. Protecting browser state from web privacy attacks. In: Proceedings of the 15th international conference on world wide web. ACM; 2006. p. 737–44. [4] Janc A, Olejnik L. Web browser history detection as a real-world privacy threat. In: Proceedings of the 15th European conference on research in computer security. Springer-Verlag; 2010. p. 215–31. [5] Vanhoef M, Matte C, Cunche M, Cardoso LS, Piessens F. Why mac address randomization is not enough: an analysis of wi-fi network discovery mechanisms. In: Proceedings of the 11th ACM on Asia conference on computer and communications security. New York, NY, USA: ACM; 2016. p. 413–24. ISBN 978-1-4503-4233-9. [6] Rajavelsamy R, Das D, Choudhary M. Privacy protection and mitigation of unauthorized tracking in 3gpp-wifi interworking networks. In: 2018 IEEE wireless communications and networking conference (WCNC); 2018. p. 1–6. [7] Acar G, Eubank C, Englehardt S, Juarez M, Narayanan A, Diaz C. The web never forgets: persistent tracking mechanisms in the wild. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. New York, NY, USA: ACM; 2014. p. 674–89. [8] Gonzalez R, Soriente C, Laoutaris N. User profiling in the time of https. In: Proceedings of the 2016 internet measurement conference. New York, NY, USA: ACM; 2016. p. 373–9. ISBN 978-1-4503-4526-2. [9] Weinberg Z, Chen EY, Jayaraman PR, Jackson C. I still know what you visited last summer: leaking browsing history via user interaction and side channel attacks. In: 2011 IEEE symposium on security and privacy; 2011. p. 147–61. [10] Englehardt S, Reisman D, Eubank C, Zimmerman P, Mayer J, Narayanan A, et al. Cookies that give you away: the surveillance implications of web tracking. In: Proceedings of the 24th international conference on world wide web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee; 2015. p. 289–99. ISBN 978-1-4503-3469-3. [11] Englehardt S, Narayanan A. Online tracking: a 1-million-site measurement and analysis. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. New York, NY, USA: ACM; 2016. p. 1388–401. ISBN 978-1-4503-4139-4. [12] Fielding RT, Reschke J. Hypertext transfer protocol (HTTP/1.1): semantics and content; 2014. https://rfc-editor.org/rfc/rfc7231.txt [13] Preibusch S, Peetz T, Acar G, Berendt B. Shopping for privacy: purchase details leaked to paypal. Electron Commer Res Appl 2016;15:52–64. doi:10.1016/ j.elerap.2015.11.004. [14] Krishnamurthy B, Naryshkin K, Wills CE. Privacy leakage vs. protection measures: the growing disconnect. In: In Web 2.0 workshop on security and privacy; 2011. p. 2–11.
R. Ferreira and R.L. Aguiar / Journal of Information Security and Applications 46 (2019) 121–137 [15] Malandrino D, Petta A, Scarano V, Serra L, Spinelli R, Krishnamurthy B. Privacy awareness about information leakage: who knows what about me?. In: Proceedings of the 12th ACM workshop on workshop on privacy in the electronic society. New York, NY, USA: ACM; 2013. p. 279–84. ISBN 978-1-4503-2485-4. doi:10.1145/2517840.2517868. [16] Starov O, Gill P, Nikiforakis N. Are you sure you want to contact us? quantifying the leakage of pii via website contact forms. Proc Priv Enhanc Technol, 2016, Iss 1, Pp 20–33 (2016) 2016:20. [17] Ruiz-MartÃnez A. A survey on solutions and main free tools for privacy enhancing web communications. J Netw Comput Appl 2012;35(5):1473–92. Service Delivery Management in Broadband Networks doi: 10.1016/j.jnca.2012.02. 011 [18] Gómez-Boix A, Laperdrix P, Baudry B. Hiding in the crowd: an analysis of the effectiveness of browser fingerprinting at large scale. In: Proceedings of the 2018 world wide web conference. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee; 2018. p. 309– 18. ISBN 978-1-4503-5639-8. doi:10.1145/3178876.3186097. [19] Lipp M, Schwarz M, Gruss D, Prescher T, Haas W, Fogh A, et al. Meltdown: Reading kernel memory from user space 27th USENIX security symposium (USENIX Security 18); 2018. [20] Kocher P, Horn J, Fogh A, Genkin D, Gruss D, et al. Spectre attacks: exploiting speculative execution 40th IEEE symposium on security and privacy (S&P’19); 2019. [21] Frigo P, Giuffrida C, Bos H, Razavi K. Grand Pwning unit: accelerating microarchitectural attacks with the gpu. In: 2018 IEEE symposium on security and privacy (SP); 2018. p. 195–210. doi:10.1109/SP.2018.0 0 022. [22] Said H, Mutawa NA, Awadhi IA, Guimaraes M. Forensic analysis of private browsing artifacts. In: 2011 International conference on innovations in information technology; 2011. p. 197–202. doi:10.1109/INNOVATIONS.2011.5893816. [23] Suma GS, Dija S, Pillai AT. Forensic analysis of Google chrome cache files. In: 2017 IEEE international conference on computational intelligence and computing research (ICCIC); 2017. p. 1–5. doi:10.1109/ICCIC.2017.8524272. [24] Leung K, Faucheur FL, van Brandenburg R, Downey B, Fisher M. URI signing for CDN interconnection (CDNI); 2015. https://tools.ietf.org/id/ draft- ietf- cdni- uri- signing- 04.txt. [25] Maggi F, Frossi A, Zanero S, Stringhini G, Stone-Gross B, Kruegel C, et al. Two years of short URLs internet measurement: security threats and countermeasures. In: Proceedings of the 22Nd international conference on world wide web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee; 2013. p. 861–72. ISBN 978-1-4503-2035-1. [26] Jackson C, Barth A, Bortz A, Shao W, Boneh D. Protecting browsers from dns rebinding attacks. In: Proceedings of the 14th ACM conference on computer and communications security. New York, NY, USA: ACM; 2007. p. 421–31. [27] Cantor S, Hirsch F, Kemp J, Philpott R, Maler E. Bindings for the OASIS security assertion markup language (SAML) V2.0; 2005. http://docs.oasis-open.org/ security/saml/v2.0/. [28] Tang S, Dautenhahn N, King ST. Fortifying web-based applications automatically. In: Proceedings of the 18th ACM conference on computer and communications security. New York, NY, USA: ACM; 2011. p. 615–26. ISBN 978-1-4503-0948-6.
137
[29] Jakobsson M, Stamm S. Web camouflage: protecting your clients from browser-sniffing attacks. IEEE Secur Priv 2007;5(6):16–24. [30] West A, Aviv A. Measuring privacy disclosures in url query strings. Internet Comput, IEEE 2014;18(6):52–9. [31] Rapoport M, Suter P, Wittern E, Lhoták O, Dolby J. Who you gonna call? Analyzing web requests in android applications CoRR 2017. arXiv: 1705.06629. [32] Lee LN, Chow R, Rashid AM. User attitudes towards browsing data collection. In: Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems. New York, NY, USA: ACM; 2017. p. 1816–23. ISBN 978-1-4503-4656-6. doi:10.1145/3027063.3053078. [33] Wang F, Mickens J. Veil: private browsing semantics without browser-side assistance. NDSS. San Diego, CA; 2018. [34] Lowet D, Goergen D. Co-browsing dynamic web pages. In: Proceedings of the 18th international conference on world wide web. New York, NY, USA: ACM; 2009. p. 941–50. ISBN 978-1-60558-487-4. doi:10.1145/1526709.1526836. [35] Aggarwal G, Bursztein E, Jackson C, Boneh D. An analysis of private browsing modes in modern browsers. In: Proceedings of the 19th USENIX conference on security. Berkeley, CA, USA: USENIX Association; 2010. p. 6. 888-7-66665555-4 [36] Braden R. Requirements for internet hosts - Application and support; 1989. [37] Elz R, Bush R. Clarifications to the DNS specification; 1997. [38] Josefsson S. The base16, base32, and base64 data encodings; 2006. [39] Vixie P, Thomson S, Rekhter Y, Bound J. Dynamic updates in the domain name system (DNS UPDATE); 1997. [40] Rescorla E, Oku K, Sullivan N, Wood CA. Encrypted server name indication for TLS 1.3. Internet-Draft. Internet Engineering Task Force; 2018. https: //datatracker.ietf.org/doc/html/draft- ietf- tls- esni- 02. [41] Zhao B, Liu P. Private browsing mode not really that private: dealing with privacy breach caused by browser extensions. In: 2015 45th annual IEEE/IFIP international conference on dependable systems and networks; 2015. p. 184–95. doi:10.1109/DSN.2015.18. [42] Wu Y, Gupta P, Wei M, Acar Y, Fahl S, Ur B. Your secrets are safe: how browsers’ explanations impact misconceptions about private browsing mode. In: Proceedings of the 2018 world wide web conference. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee; 2018. p. 217–26. ISBN 978-1-4503-5639-8. doi:10.1145/3178876. 3186088. [43] Satvat K, Forshaw M, Hao F, Toreini E. On the privacy of private browsing â a forensic approach. J Inf Secur Appl 2014;19(1):88–100. doi:10.1016/j.jisa.2014. 02.002. [44] Dickinson J, Dickinson S, Bellis R, Mankin A, Wessels D. DNS Transport over TCP - Implementation requirements; 2016. https://rfc-editor.org/rfc/rfc7766. txt. [45] Zhu L, Heidemann J, Wessels D, Hoffman PE, Mankin A, Hu Z. Specification for DNS over TLS; 2016. https://tools.ietf.org/html/draft-ietf-dprive - dns- over- tls- 09.