Data Mining. Chris Nelson CS 157 A Fall 2007. Data Mining. New buzzword, old idea. Inferring new information from already collected data. Traditionally job of Data Analysts
PowerPoint Slideshow about 'Data Mining' - oshin
An Image/Link below is provided (as is) to download presentation
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Wikipedia definition: “Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data.”
Knowledge DiscoveryConcrete information gleaned from known data. Data you may not have known, but which is supported by recorded facts. (ie: Diapers and beer example from previous presentation)
Knowledge PredictionUses known data to forecast future trends, events, etc. (ie: Stock market predictions)
Wikipedia note: "some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.“ These include applications in AI and Symbol analysis
In terms of software and the marketing thereofData Mining != Data Analysis
Data Mining implies software uses some intelligence over simple grouping and partitioning of data to infer new information.
Data Analysis is more in line with standard statistical software (ie: web stats). These usually present information about subsets and relations within the recorded data set (ie: browser/search engine usage, average visit time, etc. )
Data DredgingThe process of scanning a data set for relations and then coming up with a hypothesis for existence of those relations.
MetaDataData that describes other data. Can describe an individual element, or a collection of elements. Wikipedia example: “In a library, where the data is the content of the titles stocked, metadata about a title would typically include a description of the content, the author, the publication date and the physical location”
Applications for Data Dredging in business include Market and Risk Analysis, as well as trading strategies.
Applications for Science include disaster prediction.
Old data mining methods relied on Propositional Data, or data that was related to a single, central element, that could be represented in a vector format. (ie: the purchasing history of a single user. Amazon uses such vectors in its related item suggestions [a multidimensional dot product])
Current, advanced data mining methods rely on Relational Data, or data that can be stored and modeled easily through use of relational databases. An example of this would be data used to represent interpersonal relations.
Relational Data is more interesting than Propositional data to miners in the sense that an entity, and all the entities to which it is related, factor into the data inference process.
Whether Knowledge Discovery or Knowledge Prediction, data mining takes information that was once quite difficult to detect and presents it in an easily understandable format (ie: graphical or statistical)
Data mining Techniques involve sophisticated algorithms, including Decision Tree Classifications, Association detection, and Clustering.
Since Data mining is not on test, I will keep things superficial.
User Behavior ValidationFraud DetectionIn the realm of cell phonesComparing phone activity to calling records. Can help detect calls made on cloned phones.Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards.
Health and ScienceProtein FoldingPredicting protein interactions and functionality within biological cells. Applications of this research include determining causes and possible cures for Alzheimers, Parkinson's, and some cancers (caused by protein "misfolds")Extra-Terrestrial IntelligenceScanning Satellite receptions for possible transmissions from other planets.
For more information see Stanford’s Folding@home and SETI@home projects. Both involve participation in a widely distributed computer application.
Mining of public and government databases is done, though people have, and continue to raise concerns.
Wiki quote:"data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics."
Your data is already being mined, whether you like it or not.
Many web services require that you allow access to your information [for data mining] in order to use the service.
Google mines email data in Gmail accounts to present account owners with ads.
This allows access to your blog RSS feed (rather innocuous), as well as information obtained through partner sites (worthy of concern).
Latest one: Facebook's Beacon Advertising program (Just popped on Slashdot within the last week)
What Beacon does: “when you engage in consumer activity at a [Facebook] partner website, such as Amazon, eBay, or the New York Times, not only will Facebook record that activity, but your Facebook connections will also be informed of your purchases or actions.” [taken from http://trickytrickywhiteboy.blogspot.com/2007/11/beware-of-facebooks-beacon.html]
Verdict is still out. This may violate an old (100+ years) New York law prohibiting advertising using endorsements without the endorsee’s consent.
Facebook currently offers users no way to opt out of Beacon (once it has been activated ?). Users can close the accounts, but account data is never deleted.