Geolocation Inference

Inferring user geolocation via browser cache.

Based on a paper by Yaoqi Jia, Xinshu Dongy, Zhenkai Liang, and Prateek Saxena.

Presented by Mohammad Doleh

Locating the User

  • IP Address
    • Users can use VPN or Tor for anonymity
    • Not accurate for mobile networks
  • Device GPS Sensors
    • Users can easily decline permission

Some Websites Provide Location-Oriented Pages

  • Google.com
  • Craigslist
  • Google Maps

What's the Big Deal?

  • Modern browsers tend to cache static resources
  • Can't query the browser cache
  • But can observe the load time of the resource

Threat Model

  • Attacker has a website
  • A victim visits their site and gets their browser cache probed
  • Attacker can attempt to load location-based resources from popular sites to determine the victim's location
  • We are assuming the victim as rejected GPS access and is using IP Address masking services (VPN or Tor)

Querying the Victim's Browser Cache #1

    <img src="some-location-specific-resource" />
  • Save the start time and end time and compute the difference after the resource loads
  • var image = document.createElement(‘img’);
    image.src = url;
    image.setAttribute(‘startTime’, (new Date().getTime()));
    image.onload = function()
    {
    	var endTime = new Date().getTime();
    	var loadTime = endTime - parseInt(this.	getAttribute(‘startTime’));
    }

Querying the Victim's Browser Cache #2

  • Cross-Origin Resource Sharing (CORS)
    • Similar to the previous technique but make the request manually in JavaScript
    • var starTime, endTime, loadTime;
      var xmlhttp = new XMLHttpRequest();
      xmlhttp.onloadstart = function()
      {
      	startTime = (new Date()).getTime();
      }
      xmlhttp.onloadend = function()
      {
      	endTime = (new Date()).getTime();
      	loadTime = endTime - startTime;
      }
      

Querying the Victim's Browser Cache #3

    <img src="some-location-specific-resource" complete="complete"/>
    • Check the complete attribute of the image tag immediately after setting its src property
    function cached(url)
    {
    	var image = document.createElement(‘img’);
    	image.src = url;
    	return image.complete || image.width+image.height > 0;
    }

Querying the Victim's Browser Cache #4

  • <iframe>
    • Similar to CORS, measure the start and end time then compute the difference
    var page = document.createElement(‘iframe’);
    page.setAttribute(‘startTime’, (new Date()).getTime());
    page.onload = function ()
    {
      var endTime = (new Date()).getTime();
      var loadTime = ( endTime - parseInt(this.getAttribute(‘startTime’)));
    }

Geo-interence Attacks on Mainstream Browsers

Browser Image Load Time CORS <img> complete <iframe>
Chrome X X - X
Firefox X - X X
Safari X - X X
Opera X X - X
IE X - X X

Locating a Victim's Country

  • Google has 191 geography-specific domains (currently around 200)
  • Utilize Google's logo image
    • google.com.sg
    • /images/srpr/logo11w.png

Experiment

  • Attempted to request the logo image and observe the load time
  • Utilized the <img> technique
  • Queried for the logo 3 times and compared the first query to the other 2
  • Logo's Max Age in Cache-Control header: 31536000ms ~ 8.76 hours

Results

    Locating Country Graph
  • Cache hits are easily distinguishable from cache misses

Live Demo

Locating a Victim's City

  • Craigslist has 712 geography-specific domains (currently around 714)
  • Utilize the <iframe> method
    • https://cleveland.craigslist.org/
    • https://toledo.craigslist.org/

Experiment

  • Attempted to request the local page and compare load times utilizing the <iframe> method
  • Queried each of Craigslist's 712 city-oriented websites 3 times
  • The first query was compared to the other 2

Results

Locating a Victim's Neighborhood

  • Google Maps tiles are cached with coordinate information
  • google.com/maps/vt/pb=!1m5!1m4!1i15!2i12627!3i23720!4i128!2m1!1e0!3m3!5e1105!12m1!1e47!4e0
  • A specific area will have similar URLs so it is possible to predict the URL for other map tiles

Experiment

  • Utilized the <img> method and analyzed load times
  • Measured image load time of 4,646 map tiles in New York City
  • Each tile queried 3 times and the first query was compared to the other 2
  • Max Age in Cache-Control header: 22222222ms ~ 6.17 hours

Results

Reliability of Timing-Based Attacks

  • Chrome, Firefox, Safari, Opera, and IE are all vulnerable

Experiment

  • Measure page load times 3 times per site against the top 100 Alexa websites
  • The average of the 2nd and 3rd measurements are considered cache hit times

Results

    Locating Neighborhood Graph
  • Can reliably measure load time and sniff browser history

Some Observations

  • Websites can set X-Frame-Options to SAMEORIGIN or DENY in response headers
  • This would prevent loading sites into frames
  • Page load time would equal request load time preventing this attack
  • However, location-specific resources can still be targeted individually

Prevalence of Location-Specific Resources

  • How many websites utilize location-sensitive resources?

Experiment

  • Analyzed top 100 Alexa websites and identified location-sensitive sites and their resources
  • Excluded 45 domains due to the following:
    • Sites related to specific countries (e.g. google.de)
    • Sites with pornographic material
    • Unreachable sites (e.g. akamaihd.net)
  • Visited 55 websites in 5 different countries and recorded URLs of cached resources

Results

  • 62% of the analyzed websites have location-specific resources
  • A user having visited any of those sites would be vulnerable to the geo-inference attacks

Is this an Effective Attack?

  • Does not directly pinpoint geo-locations of users
  • Can only verify if a user has been to a given location

Usage Examples

  • Paper author wants to know which of 20 cities a review came from
  • Targeting specific areas

Problems with Time

  • Google Maps has 4,646 map tiles for New York City alone
  • 5-10 geo-locations can verified every second
  • Can take up to 8 minutes of time
  • Can reduce time by running calls in parallel

Potential Defenses

  • Existing defenses can prevent this type of attack

What Doesn't Work

  • Private browsing
    • Cache is only cleared once the browser is closed not during its use
  • VPN
    • Masking an IP Address does nothing about the browser cache
  • Tor
    • Tor browser disables disk cache by default
    • Browser cache is still active in memory

What Could Work

  • Segregating browser cache between websites
    • Would work but is expensive in terms of load time
    • 50% performance overhead
  • XMLHttpRequest
    • Browsers can block status notifications if the website denied the request to access it
    • Would prevent the CORS & <iframe> methods but <img> method could still work

What Else Could Work?

  • Add noise into measurement mechanisms related to browser cache queries
    • Would need to be carefully engineered to not interfere with browser performance
  • Server-side protection
    • Websites may switch to non-geo-targeting mode upon request of users
    • Likely will degrade site performance
    • Could also periodically randomize URLs of geo-targeting resources
    • Could also use cache-invalidating headers in HTTP response on location-based resources

Even More Solutions

  • Server-side protection Part 2
    • Prototype tool collects URLs of cached resources and labels location-sensitive resources
    • Can be utilized to identify resources that require the cache-invalidating header
  • Browsers can allow end users to opt out of caching
    • In solutions aimed for privacy, it should be the default option

Some Ideas of My Own

  • Browser extension
    • Automatically clear the browser cache on each tab open and close
    • Could make it smarter by only clearing items in cache containing a given URL
  • <img> tags are not treated with the same restrictions as straight web requests from the client
    • Browsers could treat <img> tags with the same restrictions as CORS requests
    • Could break a significant portion of websites with this change (ex: Buzzfeed)

Some Thoughts

  • If I can convince a user to visit my site, couldn't I just have malware downloaded to their machine?
  • It's slightly misleading when they say they've scanned the top 100 Alexa websites when they left out almost half
  • Slides Available Here

    http://mdoleh.github.io/Geolocation-Inference/