230 likes | 333 Views
Statistical Identification of Encrypted Web-Browsing Traffic. Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research. Outline. Motivation & Problem Intuition Hypothetical Attacker Attacker’s Success Rate
E N D
Statistical Identification of Encrypted Web-Browsing Traffic Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research
Outline • Motivation & Problem • Intuition • Hypothetical Attacker • Attacker’s Success Rate • Countermeasures • Conclusion
R1 R2 R3 R4 Anonymous Web Browsing • Protect personal information from Attacker’s Inference • Medical (Online support group) • Questionable Activities • Question: Is this REALLY anonymous?
What’s Different? In anonymous Web browsing • The chain of routers are used for both sending and receiving data Can link HTTP requests and responses! • The target Web pages are publicly accessible Responses are known! Implication: The first link/router is an exploitable weakness.
HTTP Get Browser Response 1st Router HTTP Get Response R1 R2 R3 R4 What Information is Available? • Number of objects • Object sizes • Ordering of the objects • Delay between packets
Intuition • Number of objects and object sizes are sufficient to identify a Web page! • On average, a Web page has 11 objects with each object yielding 8.4 bits of information 8.4*11 – log2(11!) 67 bits 1020 possibilities!! • Currently, there are about 109 Web pages
Programmatic Access to URL & Traffic recording Traffic pattern Construction & Database update List of target Sensitive sites URLs Traffic Pattern Database Traffic Pattern Traffic recording & Pattern construction Similarity scores Calculation R1 Negative Decision module History Browser Positive An Hypothetical Attacker
Traffic Pattern Database Traffic Pattern Similarity scores Calculation For example: S1 = {3KB, 3KB, 5KB} S2 = {3KB, 5KB, 5KB} Sim(S1, S2) = = 0.5 Decision module | {3KB, 5KB} | | {3KB, 3KB, 5KB, 5KB} | Guts of the Pattern Matching • Given two multisets of object sizes S1 and S2 Sim(S1, S2) = S1 S2 / S1 S2 • Decision module uses an absolute threshold.
Experiment Setup • Approximately 100,000 Web pages in total (URLs obtained from the Open Directory Project). • The hypothetical attacker chooses about 2200 pages as target pages. • Goal: Can these 2200 pages be identified without causing many false positives?
What is a Success and Failure? • Successful Identification: • A target page passes the similarity threshold and is not confused with other pages in the target set. • False Positive: • A non-target page is incorrectly identified as one of the target pages. • Potential False Positive: • A page passes the similarity threshold when compared with a single selected target page.
Is this small enough? Attacker’s Success Rate • A threshold of 0.5 is sufficient. 80.4% 2.1%
Common-looking pages HTTP 404s 0-identifiable pages A Detailed Look Inside • False-positives are NOT generated uniformly!
Dynamism in Web Pages • Most pages are relatively static One-day-old pattern database is sufficient
Countermeasures • Padding • Individual objects • Add random-sized objects • Morphing • Pipelining the HTTP GET requests • Pre-fetching • Mimicking • Common templates or Web-hosting services
Padding Object Size • Linear – Nearest multiple of padding size • Exponential – Nearest power of 2
Two-chunk Pipelining • Approximately 36% of the target pages are 0-identifiable. • Very close to the theoretical limit of 1/e (assuming traffic patterns are random) • Implication: Can harness the total entropy in the Web page traffic patterns.
Conclusion • Encrypted Web browsing can be identified by • the target page’s “unique” traffic pattern.