Multiplexing Proxy for Tailored Web Content Detection

A Distributed Approach to Uncovering tailored Information and exploitation on the web Kenton P. Born Kansas State University October 26th, 2011

“The power of individual targeting – the technology will be so good it will be very hard for people to watch or consume something that has not in some sense been tailored for them” Eric Schmidt, Google

ROADMAP • Introduction • Background • Problem Statement • Hypothesis • Methodology • Results • Limitations • Future Work

Background • What attributes can be used to distinguish a system? • Anything with consistency on one machine, but entropy across varying machines • MAC and IP Address • TCP/IP Stack • Application Layer Content • Client Fingerprint – The combination of all identifiable attributes of a client system that distinguish it from others.

HTTP Fingerprinting • Browser Identification • User-Agent string • Object detection • Plugins • Operating System Identification • User-Agent string • TCP/IP stack • User Identification • IP Address • Cookies • Aggregation of fingerprintable attributes

Panopticlick

The Ephemeral Web • Websites are growing in complexity • Static content Dynamic content Instantly dynamic content Tailored content • Has an affect on: • Web crawlers / Search engines • Change detection services • Semantic analysis • Trust • These are not solved problems! Third Party Ads Client Tracking Analytics

Tailored web content • How do you know when a web response has been modified because of the fingerprintable attributes of your system? • User and location-based tailoring • Services • Misinformation • Browser and operating system tailoring • Software downloads • Exploits for specific client fingerprints How can we assign a level of trust to web content?

HYPOTHESIS • A Multiplexing proxy, through the real-time detection and categorization of web content that has been modified for specific locations, browsers, and operating systems, provides enhanced misinformation, exploit, and web design analytics. • This study analyzed the utility of the multiplexing proxy’s detection, classification, and visualization methods for three different roles: open source analysts, cyber analysts, and reverse engineers. • Both qualitative and quantitative approaches were taken in an attempt to understand whether the techniques used by many sites are similar, or whether the breadth of dynamic changes is too vast to ever be handled well in an automated system.

Methodology • Multiplex requests at a proxy • Change at most one fingerprintable attribute per request • Modify User-Agent string (Browser) • Modify TCP/IP stack (Operating System) • Modify IP address (Location) • Send duplicate requests • Aggregate the responses at the proxy • Analyze them against the original response • Present the user with detailed analysis along with the original response

false positive MITIGATION • How do we handle the false positives due to instantly dynamic content? • Send several requests that duplicate the fingerprint of the original request! • Provides a baseline of the instantly dynamic data • Anomalies from this baseline are the tailored content! • Accuracy improves with additional duplicate requests

Classification • Classify the resources and provide tools for analyzing them • Look at: • MD5 hash (byte-to-byte comparison) • Stripped hash (Only compare content) • Structure hash (Do they share the same structure) • Response length anomalies • Visualization

Open Source analyst

CYBER analyst

Reverse engineer • Analyze and Investigate websites from empirical classification study • Determine insights gained about server behavior • Determine value of information extracted by the tool’s analytical and visual capabilities

System Components • Client-side Firefox Browser Plugin (XUL/JavaScript) • Easily distributed • Piggy-back off of browser functionality • APIs for request/header manipulation • Other Plugins • i.e. Ad-Block Plus • Customized Proxy (Java) • Distributed Agents (Java) • Fingerprint modifications • Load-balancing • Web-Service (Java/GWT)

Configuring the Multiplexing Proxy Navigate to the configuration screen and select the specific browser, operating system, and location modifications you would like made to the requests going through the multiplexing proxy.

Using the Multiplexing Proxy The Firefox plugins adds an additional toolbar to the browser Anomalies will first be signaled in the toolbar. Click on them to view a summary of the analysis of the resources! Additional Search Bar

Listing the dynamic resources The dynamic resources are listed in order of “most interesting” to “least interesting” Click on a resource to analyze further Overview of the analysis for each resource

RESULTS

Theoretical classification accuracy • Dissertation includes many formulas for calculating the probability of misclassification • Generally revolves around “dice rolling” calculations • Websites returning multiple versions of a webpage with different probabilities requires more in-depth formulas than versions with equal probabilities • “Weighted dice” vs. “Fair dice” • Increasing the number of duplicate requests increases the accuracy! • Three duplicate requests provides sufficient accuracy for most cases • Chicken-or-the-egg problem: you can only calculate the accuracy of a particular website once you know everything about the website • Looking at the empirical classification accuracy helps with this

Empirical classification accuracy

UNDERSTANDING the Analysis RESULTS The response to the original request A response that matches, byte-for-byte, the original response A response that matches the original after stripping out benign attributes, formatting, etc. A response that does not seem to match the original response A response that has a significantly anomalous length compared to the original/duplicates A request that failed to receive a response from the server Each type of response has various tools for analyzing it more deeply against the original response: i.e. diff / inline diff / stripped diff / view

Client Fingerprint / Analytics Collection Location/IP based modifications flag an anomaly! IP Address geolocation! Woah!

Location-based Blocking/tailoring Location-based anomalies! The request sent through a proxy in China failed… “The great firewall of China”

AD tailoring Modifying the User-Agent string flags a response anomaly! You have Windows 7 Starter, upgrade to Home Premium! You do not have Windows 7, get it! Get the latest service pack for Windows! You do not have Windows 7, buy a new PC!

Misconfigured / malicious proxy The response from the proxy in China did not match the rest! Default Apache response, a misconfiguration! While this was a misconfigured proxy, a malicious proxy would similarly throw an anomaly!

JavaScript injection (lab environment) The Internet Explorer fingerprint triggered an anomaly! JavaScript Injection! A malicious, obfuscated redirect…

REAL JAVASCRIPT INJECTION! Seemingly random anomalies! Performance analysis script!

Price TailorinG Location-based anomalies! Prices are shown in a new currency… same price?

Search Engine redirect Location-based anomalies! Redirect to www.google.ru! Significant length anomaly!

image tailoring Internet Explorer 6 anomaly! Handling image transparency issues for IE6!

Formatting inconsistencies Tailored formatting for different browsers Tailored images for IE6

Current limitations • No support for HTTPS • Requires the proxy to hijack the handshake and play MITM • No ability to manipulate and analyze the effects of HTTP cookies • HTTP POST requests are ignored • If they aren’t idempotent, could cause issues • i.e. Adding items to shopping carts • TCP/IP stack manipulation is not robust • Requires a machine or VM for each operating system fingerprint • Need a tool to quickly modify the stack as necessary on any machine

Future work • More robust real-time TCP/IP stack manipulation tool • Cookies • Tailored Content due to the presence of certain cookies • Find websites that share cookies across various domains! • Look at other protocols • DNS • Routing • Etc.

Contact information Kenton Born Kansas State University Lawrence Livermore National Laboratory kenton.born@gmail.com

BACKUP SLIDES

Related Work • Many studies on dynamic aspects of websites • Cho and Garcia-Molina (2000) • 25% of pages in .com domains changed within a day • 50% of pages changed after 11 days • Other TLDs such as .gov were less dynamic • Olston and Pandey(2008) • Developed web crawl policies that accounted for longevity of information • Periodic crawling of dynamic material instead of batched crawling Must Switch from batched crawling to complex, incremental crawlers!

Related Work (2) • Measuring website differences • Cho and Garcia-Molina (2000) • MD5 checksum • Fetterly et al. (2003) • Vector of syntactic properties using the shingling metric • Most changes are trivial (non-content) • Greater frequency of change in top level domains • Larger documents have a greater frequency and degree of change • Past changes can predict the size and frequency of future changes • Adar et al. (WSDM 2009) • Xpath extraction of “cleaned” website • Calculated survivability of each element Patterns can be found in website changes by analyzing them more deeply

Related Work (3) • Change frequency • Adar et al. (WSDM 2009) • Over 50% of the websites examined had frequently changing data • Over 10% of the websites contained instantly dynamic data • An instantly dynamic website typically modifies similar amounts of information each time, in contrast with sites such as blogs • Ntoulas et al. (2004) • 8% of downloaded sites each week were new web pages • Calculated TF-IDF Cosine distance and word distance between versions. • Most website changes were minor, not causing significant differences • Kim and Lee (2005) • After 100 days, 40% of URLs were not found on initial crawls. • Calculated download rate, modification rate, coefficient of age • Did nothing to handle instantly dynamic data!

Related Work (4) • Website comparison • Kwan et al. (2006) • Analyzed comparison methods against specific types of change (markup removed) • Byte-to-byte (checksum) • TF-IDF cosine distance not sensitive enough for most changes • Word distance only effective for “replace” changes • Could not report on “moved” text • Edit distance differed from word distance by treating “move” and “replace” similarly • Shingling metric performed best against “add” and “drop” changes • Over-sensitive to the rest Different types of changes are best detected using different methods!

Related Work (5) • Structural changes • Dontcheva et al. (2007) • Removed structurally irrelevant elements and analyzed the DOM tree • Small changes or layout modifications happen toward the leaves of the DOM tree • Major websites changes happen deeper in the truee. • Larger websites with large amounts of traffic and highly dynamic content tended to have a larger number of structural changes. • Automated extraction is difficult for changes away from leaf nodes • Did not take AJAX/Flash applications into account Element depth can help classify the type of change!

Related Work (6) • Revisitation patterns • Adar et al. (CHI 2008; CHI 2009) • Enhance user experience by highlighting relevant content that changed from previous visits. • Polled user’s to find relationships between user’s intentions and site revisitation. • Dynamic website revisitation - users searching for new information • Static website revisitation - people revisiting something previously viewed Is it possible to identify relevant information for a user?

Related Work (7) • Change detection in XML/HTML • Longest common subsequence • Diff • HtmlDiff • Hirschberg algorithm • Mikhaiel and Stroulia (2005) • Labeled-ordered tree comparison • Chawathe and Garcia-Molina (1997) • Minimum cost edge cover of a bipartite graph • Wang et al. (2003) • X-Diff • Tree-to-tree correction techniques. • Xing et al. (2008) • X-Diff+ • Visual representation of how an XML document conformed to its DTD

Related Work (8) • Tools for visualizing web differences • Chen et al. (2000) • AT&T Difference Engine (AIDE) • Used TopBlend - Heaviest common subsequence solver • Jacobson-Vo algorithm • Web-crawler that collects temporal version of websites and highlights differences. • Adar et. al (UIST 2008) • Builds a collection of documents and snapshots of website over time. • Explore websites through different lenses • Greenberg and Boyle (2006) • Stored bitmaps of user-selected regions, notifying the user when significant changes were detected. • Limiting and ineffective in many cases.

Related Work (9) • Many services that monitor websites and alert users • Liu et al. (2000) • WebCQ • http://www.rba.co.uk/sources/monitor.htm

Related Work (10) • Real-time comparative web browsing • Nadamoto and Tanaka (2003) • Comparative Web Browser (CWB) • Displays and synchronizes multiple web pages based on relevance. • Nadamoto et al. (2005) • Bilingual Comparative Web Browser (B-CWB) • Same as CWB, but attempts to do it across varying languages • Selenium • Framework providing an API to invoke web requests in varying browsers and run tests against their responses.

Multiplexing Proxy for Tailored Web Content Detection

Multiplexing Proxy for Tailored Web Content Detection

Presentation Transcript

Finding Information on the web

Connecting Distributed People and Information on the Web

SNP Information on the Web

A New Approach to Exploring Information on the Web

Information on the Web and in Databases

Information Resources on The Web

Searching and Integrating Information on the Web

Energy Information on the Web

Information Sources on the Web

Where next? A critical approach to ACGT Exploitation and Ambitions

Information Extraction on the Web

A ‘Tailored Tools’ Approach to Developing a Comprehensive

Searching and Integrating Information on the Web

A collaborative approach to tackling migrant worker exploitation

Retrieving Information on the Web

Validating Information on the Web

Evaluating Information on the Web

Searching and Integrating Information on the Web

A Chemical Approach to Distributed Computing

Information Credibility on the Web

Evaluating Information on the Web

Searching and Integrating Information on the Web