Wednesday, April 12, 2017

ISP Data Pollution: Hiding the Needle in a Pile of Needles?


theatlantic |  The basic idea is simple. Internet providers want to know as much as possible about your browsing habits in order to sell a detailed profile of you to advertisers. If the data the provider gathers from your home network is full of confusing, random online activity, in addition to your actual web-browsing history, it’s harder to make any inferences about you based on your data output.

Steven Smith, a senior staff member at MIT’s Lincoln Laboratory, cooked up a data-pollution program for his own family last month, after the Senate passed the privacy bill that would later become law. He uploaded the code for the project, which is unaffiliated with his employer, to GitHub. For a week and a half, his program has been pumping fake web traffic out of his home network, in an effort to mask his family’s real web activity.

Smith’s algorithm begins by stringing together a few words from an open-source dictionary and googling them. It grabs the resulting links in a random order, and saves them in a database for later use. The program also follows the Google results, capturing the links that appear on those pages, and then follows those links, and so on. The table of URLs grows quickly, but it’s capped around 100,000, to keep the computer’s memory from overloading.

A program called PhantomJS, which mimics a person using a web browser, regularly downloads data from the URLs that have been captured—minus the images, to avoid downloading unsavory or infected files. Smith set his program to download a page about every five seconds. Over the course of a month, that’s enough data to max out the 50 gigabytes of data that Smith buys from his internet service provider.

Although it relies heavily on randomness, the program tries to emulate user behavior in certain ways. Smith programmed it to visit no more than 100 domains a day, and to occasionally visit a URL twice—simulating a user reload. The pace of browsing slows down at night, and speeds up again during the day. And as PhantomJS roams around the internet, it changes its camouflage by switching between different user agents, which are identifiers that announce what type of browser a visitor is using. By doing so, Smith hopes to create the illusion of multiple users browsing on his network using different devices and software. “I’m basically using common sense and intuition,” Smith said.