1 / 50

An Invitation to Data-Mining

Lecture Outline. Introducing Data-MiningGoogle HackingIntermissionExamples of Using Data-Mining for:MoneyPowerSexClosing. The Advent of Databases and the Internet. Fact: The amount of data we have access to is greater than ever before and is still growing exponentially.If nothing else, the

ghalib
Download Presentation

An Invitation to Data-Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. An Invitation to Data-Mining Virgil -- virgil@yak.net GregR -- gregr@yak.net Interz0ne IV March 12, 2005

    2. Lecture Outline Introducing Data-Mining Google Hacking Intermission Examples of Using Data-Mining for: Money Power Sex Closing

    3. The Advent of Databases and the Internet Fact: The amount of data we have access to is greater than ever before and is still growing exponentially. If nothing else, the continued archival of current data will quickly add up.

    4. Continued Growth of the Internet

    5. Growth of Digital Information A Practical Example… Back in the old days news of interesting websites propagated through word of mouth. Then it moved to USENET groups (blogs are a modern equivalent). But, then it became difficult to find the hottest newsgroups. To compensate for this we started using search engines. Today, we’re frequently using meta-search engines & meta-blogging sites like technorati.com, memestreams.net, and del.icio.us. Data-Mining is an increasingly a powerful tool to take advantage of the availability of huge amounts of digitized information.

    6. What is Data-Mining… From Wikipedia... Data mining is been defined as [1] "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [2] "The science of extracting useful information from large data sets or databases". Like Artificial Intelligence, “Data-Mining” is an widely used term with general connotations.

    7. What is Data-Mining (contd.) Data-Mining is usually broken up into two distinct steps. 1. “Data-Warehousing” – Collecting large amounts of data 2. “Mining / Extraction” – Analysis (often statistical) of the collected information:

    8. Some Examples of Data Mining... Amazon.com’s Recommendation System MusicPlasma.com National Security Agency’s ECHELON ECHELON is the largest electronic spy network in history, run by the United States, the United Kingdom, Canada, Australia, and New Zealand. It captures telephone calls, faxes, e-mails, and IMs from around the world. ECHELON is estimated to intercept about 3 billion communications every day. (text-mining)

    9. Other Users of Data Mining Nazi’s in France during WWII Mormons The Alexa/Google Toolbar Wal-Mart (i.e. urban myth of correlation of purchase of beer and diapers) RIAA/MPAA in P2P Microsoft in BitTorrent Rotten.com’s NNDB Basically, just about everyone is using data mining for all sorts of things.

    10. Getting your feet wet in Data-Mining: Using Google Using Google is a great place to start data-mining. The data collection stage has already been done for you! All you need to do is craft the perfect query to find the interesting parts.

    11. But what could you possibly find just using Google?

    12. How About…

    13. Intro: “Google Hacking” "Google Hacking” is the use of Google’s data stores for naughty things. Makes extensive use of the advanced Google syntaxes. Is trivially easy to do and is rather trendy. An excellent guide to get up to speed on the techniques of "Google Hacking” is the O'reily book Google Hacks by Tara Calishain.

    14. Google Hacking: Tools of the Trade On the surface, searching Google is straight forward. But, there are many special parameters (some of which are undocumented) You can use these parameters to exclude everything but the data you're looking for.

    15. Google Syntax Examples '' ''/-/+/( ) Site: Filetype: Related: Link: [all]inanchor [all]inurl: [all]intext: [all]intitle: (interz0ne | outerz0ne) extraz0ne site:.mil filetype:.doc related:yak.net inanchor:''miserable failure'' inurl:robots.txt

    16. Some Undocumented Syntaxes… Find between ranges of numbers Single word wild-card “Fuzzify” Search only documents indexed within a particular timeframe.

    17. Google Hacking: Further Reading Due to its ease, Google Hacking already has a large following. Johnny Long runs a user-contributed a "Google Hacking Database" which contains over 1,000 ready made search queries. http://johnny.ihackstuff.com/ Johnny Long also has a concise Google Hacking guide. http://johnny.ihackstuff.com/security/premium/The_Google_Hackers_Guide_v1.0.pdf

    18. Intermission Questions on anything related or unrelated so far?

    19. Going Beyond Google “Google Hacking” is just the easy stuff. Data Mining techniques are applicable to virtually everything. There is a large amount of interesting information digitally available which is not indexed by Google (or anyone else). To do more interesting things you'll typically be using one of these as your data set. All sorts of data is already out there, all you need is the ingenuity to find applications for it.

    20. Further Examples of Data Mining

    21. Using Data-Mining to… Derive Mother's Maiden Names Uncover Corporate and Government Secrets Embarrass minor-celebrities

    22. Deriving Mother's Maiden Names Mother’s Maiden Names (MMN’s) are a common security authenticator Used as an authenticator for credit cards, email accounts, websites, etc. etc. Idea: You could mine public records information from online databases to automatically derive MMNs for random people.

    23. About our Study The most relevant records are the birth and marriage records, both of which are “vital records” within public domain. At the very least, there will be some easy cases to derive MMNs (i.e. uncommon last names, hyphenated last names, “Jr.”, “III”, etc.) Although thse techniques can be applied anywhere, we focused on Texas.

    24. Availability of Related Records Related public records are available at the county, state, and national level. US Census aged 72 years before released Searchsystems.net has a large listing of county-level records Rootsweb provides full user-submitted family trees We got most of our records from the Texas Bureau of Vital Statistics’ website

    25. Getting Texas Vital Records Collected marriage data from the State Dept of Vital Statistics (records 1966-2002). However, the birth records were sealed in 2000, the death records in 2003. We found partial copies of the sealed records on archive.org and full copies on rootsweb.com and searchsystems.net. Furthermore, the death records were only unlinked, and you can still download death info from their own servers 2 ½ years later.

    26. Analyzing the Records Once we have a large corpus of both birth and marriage data, we can apply whatever heuristics we want in connecting children to marriages. Lucky us! Birth records for <= 1950 include the MMN in plaintext! This left us mostly state marriage records from 1966-2002 and state birth records from 1951-1995 to analyze.

    27. Children will have the same last name as their parents. We do not have to link a child to a particular marriage record, only to a particular maiden name. An attacker doesn’t have to pick the correct parents, just the correct MMN! The parents' first and middle names are often repeated within a child's first or middle name. Children are often born in the same county in which their parents were recently married. Factor in Divorce Records [public domain] Factor in SSDI / State Death Records [public domain]

    28. Measuring our Success for Compromise Recall we need only match up to the correct MMN, not the correct parents. After applying our heuristics we’ll have a list of possible maiden names. We use data entropy (Shannon entropy) to measure the ‘disorder’ of the set of remaining MMNs. We then compare the entropy before and after the application of the heuristics to measure the success of our attack. Before heuristics applied set of MMN’s ˜ 13 bits.

    29. Entropy Graph assuming only same last names

    30. Results from just assuming same last name.

    31. Questions? (By the way, George Bush’s MMN is “Pierce”)

    32. Data-Mining for .doc’s In case you weren't aware, the Microsoft .doc format contains all sorts of interesting “metadata” within the document. At times, this metadata has been known to be intensely interesting. This metadata includes (among other things) the: Title, Author, Date Created, Date Last Saved, Editing Time, User’s Machine ID#, and usernames of who made the last 10 revisions. This fact is known to some groups (such as lawyers), but by in large people don't know about it.

    33. Past Incidents UK Prime Minister Tony Blair published a dosier on the Iraq War A Cambridge prof revealed that most of the documented was plagiarized from a grad student in Monterey. Inspired by this, Richard Smith of computerbytesman.com ran analysis of the dosier's .doc metadata. Smith uncover a good deal more of incriminating evidence and made the Blair government squirm. [Link]

    34. That's a great idea! Lets do it better! Do massive crawling for all .doc’s on a particular domain Extract all of their metadata Put into a database with web-interfacee See if anything interesting turns up!

    35. What we've done (work in progress) No conclusive word metadata analysis system exists. We’ve been weaving together bits and pieces together into an eventual whole. Demonstrations: [Demo of “The Revisionist” by Michal Zalewski] [Demo of Yak’ified “WordLeaker” by Madelman] [Demo of unreleased script strings_against_references. (Works similarly to Simon Byer’s work)]

    36. .doc Mining -- Conclusions Okay, it's not finished yet. But not bad for starting this project last week. The core concept works completely, but needs a little more refinement. Better integration is needed, still a few bugs.

    37. Last Example

    38. Cat Schwartz, TechTV eye candy As one of her fans comments…. Cat Schwartz is one of the cute girls on TechTV. I know everybody jerks it to Morgan Webb, but Cat has that nerdy emo girl cuteness that I and many others find hard to resist. She has a blog on which she does bloggy things like posting pics of herself, writing crappy poems, and keeping her fans abreast of her schedule.

    39. Cat Schwartz and her blog Like all blog girls, she likes to post suggestive images of herself on her blog. No one knows why blog girls do this, but for now let us simply accept that they do. [www.catschwartz.com]

    42. A little known fact… Programs like photoshop store a full thumbnail of the photo in the EXIF header extension. Furthermore, if only a slight alteration is made (I.e. cropping), Photoshop doesn’t regenerate the thumbnail stored in the EXIF header.

    43. So....

    48. And the net goes wild! One enthusiastic fan comments… “I SPANKED TWICE IN A ROW TO THESE!!! AND I'M GONNA SPANK AGAIN!!! OMG! OMG! OMG! I EVEN LICKED MY MONITOR!!!!!!!”

    49. Doing This Even Better Crawl USENET for images Do math to determine if the image in the EXIF thumbnail is different from the actual image Display the images Live Demo using a “Hot or Not” rating system Sadly, the results haven’t been that amazing, most are just uninteresting croppings. But a few interesting bits….

    50. Some Data Sets dying for interesting applications FEC Political Donation Data http://ftp.fec.gov/FEC/presidential/ GPS Coordinates of Zipcodes + TerraServer http://www.census.gov/geo/www/tiger/zip1999.zip More Public Records // Sexual Offender Databases http://www.searchsystems.net/ Social Security Death Index htttp://ssdi.genealogy.rootsweb.com/ Library of Congress Print Cataolog http://www.loc.gov/rr/print/catalog.html Flickr.com Ex:http://www.mappr.com P2P Network User Behavior Nanpa.com

    51. End V. Griffith, M. Jakobsson (2005); Messin with Texas: Deriving Mother’s Maiden Names Using Public Records is available at: http://romanpoet.org/1/mmn.pdf EXIF Data Mining References: Steven J. Murdoch: www.cl.cam.ac.uk/~sjm217 Maximillian Dornseif: md.hudora.de

More Related