The Dangers and Delights of Data Mining

The Dangers and Delights of Data Mining Glenn Roe Digital.Humanities@Oxford Summer School July 3 2012

Some opening thoughts.... • Machine Learning (ML) and Data Mining (DM) techniques will drive future humanistic research as a central component of future digital libraries. • Old Digital Humanities (DH) tools were transparent. ML/DM are opaque. • General impact of ML on all humanities research: categorize, link, organize, direct attention to some texts rather than others automatically. • Examine three areas of possible critical assessment. • DH is uniquely well-suited to critique the application of machine learning techniques in the humanities.

Emerging Digital Libraries Scale of digital collections requires machine assistance to: • categorize and organize • propose intertextual relations • evaluate and rank queries • facilitate discovery and navigation There are only about 30,000 days in a human life -- at a book a day, it would take 30 lifetimes to read a million books and our research libraries contain more than ten times that number. Only machines can read through the 400,000 books already publicly available for free download from the Open Content Alliance. -- Gregory Crane Only machines will read all the books.

And 5 million books? We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics”focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities. www.sciencexpress.org / 16 December 2010

Culturomics…

Reading from afar… (or not at all). Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large,the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, “Conjectures on World Literature” (2000) http://www.newleftreview.org/A2094

“Not Reading” has a long history. • L’Histoire du livre • Dépot légal • After death inventories • Library holdings/circulation records • Archives of publishers • Vocabulary of titles (Furet) • Censorship records • … • Martin, Furet, Darnton, Chartier, etc…

From Not Reading to Text Mining By “not reading” we examine: concordances, frequency tables, feature lists, classification accuracies, collocation tables, statistical models, etc… We track: Literary topoi (E.R. Curtius), concepts (R. Koselleck, Begriffsgeschichte), and other semantic patterns: over time, between categories, across genres. So that distant reading and text mining can provide larger contexts for close reading.

Text Mining as Pattern Detection Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful. -- Ian Witten, Data Mining: Practical Machine Learning Tools and Techniques, xvix.

Transparency of traditional DH approaches PhiloLogic: A few choice words... Open Source: http://philologic.uchicago.edu/ Advantages: • Fast, robust, many search and reporting features. • Collocation tables, sortable KWICS, etc. • Handles various encoding schemes and object types. • Known to work with most languages. Limitations: • User initiated search for small number of words. • Limited order of generalization. • How to address larger issues (gender or genre). • What to do with 150,000 (or more) hits?

Transparency of traditional DH approaches PhiloLogic searches return what you asked for in the order in which you asked. Example: search for various forms of moderni.* 1850-99 You get 82 hits. Results can be sorted and organized. Requires user selection. The user sifts through results and analyzes effectively raw output data.

Machine Learning is opaque... ML systems depend on many assumptions and selections that are not readily available to end users. The hunt for Google’s infamous “secret sauce.” Open competition to find the over 250 ingredients in the Google search/sauce algorithms. A “Black-box” industry: analyzing the secret sauce for profit. Many commercial organizations examine Web mining extensively: e.g., “Search Engine Watch” www.searchenginewatch.com

Two ways of using DM in the humanities 1) Tool approach: PhiloMine, MONK, etc. allows direct manipulation of data mining materials. 2) Embedded approach: results of machine learning or text mining become part of general systems. - Google and other WWW search engines - Dedicated library systems (AquaBrowser) Most humanities scholars will use embedded machine learning systems.

Embedded Machine Learning Systems Humanists are already using machine learning and data mining in general applications: spam filters movie recommendations (Netflix) related book/article suggestions (Amazon) Adwords (monetizing the noun) etc... And coming soon to a library near you: LENS....

Embedded Machine Learning Systems

Building Data Mining Tools:Three types of data/text mining *Distinction is arbitrary and does not cover all text mining tasks. Predictive Classification: learn categories from labeled data, predict on unknown instances. Comparative Classification: learn categories from labeled data to find accuracy rate, errors, and most important features. Similarity: measure document/part similarities, looking for meaningful connections.

Predictive Classification Widely used: spam filters, recommendation systems, etc. Computer “reads” text, identifies the words (features) most associated with each class (author, class of knowledge). Humanities applications: extract classes or labels from contemporary documents. Use contemporary classification system rather than modern system to predict classes. *Problem: information space can be noisy, incoherent.

Predictive Classification Text Mining the Digital Encyclopédie 74,131 articles in the current database 13,272 articles without classification (18%) We trained our classifiers on the 60K classified articles (comprised of 2,899 individual classes) to generate a model which is then used to classify the unknown instances, and then reclassify all 74K articles. The resulting “ontology” was optimized to 360 classes – this is a typical result of machine classification.

Predictive Classification Classifying the unclassified: • DISCOURS PRELIMINAIRE DES EDITEURS, Class=Philosophy • DEMI-PARABOLE, Class=Algebra • Bois de chauffage, Class=Commerce • Canard, Class=Natural history; Ornithology • Chartre de Champagne, Class=Jurisprudence • Chartre de commune, Class=Jurisprudence • Chartre aux Normands, Class=Jurisprudence • Chartre au roi Philippe, Class=Ecclesiastical history Chartre au roi Philippe fut donnée par Philippe Auguste vers la fin de l'an 1208, ou au commencement de l'an 1209, pour régler les formalités nouvelles que l'on devoit observer en Normandie dans les contestations qui survenoient pour raison des patronnages d'église, entre des patrons laiques & des patrons ecclésiastiques. Cette chartre se trouve employée dans l'ancien coûtumier de Normandie, après le titre de patronnage d'église; & lorsqu'on relut en 1585 le cahier de la nouvelle coûtume, il fut ordonné qu' à la fin de ce cahier l'on inséreroit la chartre au roi Philippe & la chartre Normande. Quelques - uns ont attribué la premiere de ces deux chartres à Philippe III. dit le Hardi; mais elle est de Philippe Auguste, ainsi que l'a prouvé M. de Lauriere au I. volume des ordonnances de la troisieme race, page 26. Voyez aussi à ce sujet le recueil d' arrêts de M. Froland, partie I. chap. vij.

Comparative Classification “Comparative Categorical Feature Analysis” Use classifiers as a form of hypothesis testing. Train a classifier on a set of categories (gender of author, class of knowledge). Run the trained model on the same data to find: • Accuracy of classification • Most salient features • Errors or Mis-classified instances *Classification errors can be rich sources of inquiry for humanists.

Comparative Classification Text Mining the Digital Encyclopédie Original # of classes: 2,899 - New # of classes: 360 73.3% of articles were assigned to their original class, a number that is amazing given the complexity of the ontology. Which means that 26.7% of articles have a different class? This also means that of the 74,131 articles: 44,628 classified correctly 16,231 classified “incorrectly” 13,272 unclassified were classified

Comparative Classification Text Mining the Digital Encyclopédie Accrues: original classification too specific Tepidarium: reclassification seems more logical Achées: incorrect prediction although appropriate given vocabulary

Comparative Classification Predict classifications in other texts: Classification of Diderot's Éléments de physiologie by chapter.Most chapters classed as anatomy, medicine, physiology."Avertissement": literatureChapter "Des Etres": metaphysicsChapter "Entendement": metaphysics and grammarChapter "Volonté": ethicsLeverage a contemporary classification system as way to support search and result filtering.

Clusters of Knowledge Top: History, Geography, Literature, Grammar, etc. Middle : Physical Sciences, Physics, Chemistry, etc. Lower: Biological Sciences & Natural History

Similarity: Documents Comparative and Predictive Classification one way to find meaningful patterns by abstracting data from the text. Typically build abstract models of a knowledge space based on identified characteristics of documents. (Supervised learning) Document similarity: unsupervised learning based on statistical characteristics of contents of texts.Many applications: Clustering, Topic Modeling, kNN classifiers etc.

Vector Space Similarity (VSM) Documents are “bags of words” (no word order). Each bag can be viewed as a vector. Vector dimensionality corresponds to the number of words in our vocabulary. Value at each dimension is number of occurrences of the associated word in the given document: amour ancien livre propre 1 0 3 0 All document vectors taken together comprise a document-term matrix *Used for many applications: information retrieval to topic segmentation.

Identification of similar articles dj = (w1,j,w2,j,...,wt,j)q = (w1,q,w2,q,...,wt,q) Similarity: cosine of angle of two vectors in n-dimensional space, where dimensionality is equal to the number of words in the vectors.

Identification of similar articles Vector Space can be used to identify similar articles. Size matters - some unexpected results. GLOIRE, GLORIEUX, GLORIEUSEMENT, Voltaire, VANITÉ, NA, [Ethics] [0.539] VOLUPTÉ, NA, [Ethics] [0.514] FLATEUR, Jaucourt, [Ethics] [0.513] GOUVERNANTE d’enfans, Lefebvre, [0.511] CHRISTIANISME, NA, [Theology| Political science] [0.502] PAU, Jaucourt, [Modern geography] [0.493] PAU: birthplace of Henri IV.

VSM: Strengths/Limitations Well understood. Standard and robust. Many applications: kNN classifiers, clustering, topic segmentation. Assigns a numeric score which can be used with other measures (e.g., edit distance of headword) Numerous extensions and modifications : Latent Semantic Analysis, etc. Bag of words: no notion of text order. Requires identification of documents or block: articles. Not suitable for running text. Cannot identify smaller borrowings in longer texts. Similarity can reflect topic, subject, or theme, unrelated to “borrowing” or reuse.

Topic Modeling and LDA • Topic modeling is a probabilistic method to classify text using distributions over words. • In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. • This method of analyzing text was first demonstrated by David Blei, Andrew Ng and Michael Jordan in 2002. Johann Peter Gustav Lejeune Dirichlet

What does LDA do? • LDA is an unsupervised word clusterer and classifier. • Preliminary assumption : each text is a combination of several topics. • Each document is given a classification with a ranking of the most important topics. • LDA generates distributions over words, or topics, from the text and classifies the corpus accordingly. dieu ame monde etre nature matiere esprit chose homme substance principe corps univers philosophe systeme idee intelligence eternite rien divine existence creature

Prior research on LDA • David Blei ran a series of experiments on the journal Science from the year 1880-2002. Topic : energy molecule atoms matter atomic molecular theory (1900-1910) "The Atomic Theory from the Chemical Standpoint" "On Kathode Rays and Some Related Phenomena" "The Electrical Theory of Gravitation" "On Kathode Rays and Some Related Phenomena" "A Determination of the Nature and Velocity of Gravitation" "Experiments of J. J. Thomson on the Structure of the Atom"

Future research with LDA • Text Segmentation : identify topic shifts within a document by classifying paragraphs. • Dynamic topic modeling : understand how discourse evolves over time. Example from David Blei on epidemiology : 1880 : disease, cholera, found, fever, organisms 1910 : disease, fund, fungus, spores, cultures 1940 : cultures, virus, culture, strain, strains 1970 : mice, strain, strains, host, bacteria 2000 : bacteria, strain, strains, resistance, bacterial

Strengths/Weakness of LDA • LDA is a powerful tool to classify unclassified data sets. • A lot of research is being done on Topic Modeling by computer scientists : it is our challenge to use their findings and apply to text analysis. • LDA is just an aspect of the wider goal of having machines contextualize text, identify coherent segments, and ultimately ease the processing of very large corpuses.

A “Critical” Approach to Data Mining Critique is a fundamental humanistic activity which is not necessarily limited to texts (i.e., “reading the body”). Machine learning will be a necessary component of future humanities research, and Digital Humanities is uniquely situated and suited to a critique ML tools and their applicability moving forward. I will touch on three primary areas of critique drawn from our own experiments with machine learning: 1) algorithms, features, and parameters; 2) classification and ontologies; 3) intertextual relations.

Opening the Black Box: PhiloMine Open Source: http://code.google.com/p/philomine/ • PhiloLogic extension uses existing services. • Permits moving to particular texts or features. • WWW based form submission with defined tasks. • Many classifiers (Support Vector Machine, etc). • Many features (words, n-grams, lemmas, etc). • Many feature selection and normalization options.

Opening the Black Box: PhiloMine

Algorithms, Features, & Parameters Algorithms = classifiers, segmenters, similarities, aligners Features = salient to task, elements of texts which can be computed (words, lemmas, n-grams, etc.) Parameters = many which have significant impact on results The devil is in the combination of details at all levels...

Features and Parameters Matter Parameter selection includes: • type of features such as words, n-grams, and lemmas • range of features, such as limiting to features that appear in a minimum number of instances • statistical normalization of features • thresholds for various functions Algorithm and parameter selection are task and data dependent Selection of algorithms and adjustment of parameters can radically alter results. For example...

The Dangers and Delights of Data Mining

The Dangers and Delights of Data Mining

Presentation Transcript

Data Warehousing and Data Mining

The Art and Science of Data Mining

Data Mining: Concepts and Techniques Mining Text Data

Mining the Data

The Process of Data Mining

The Promise and Peril of Data Mining

The dangers of mining

Delights

Culinary Delights of the World

Data Mining: Concepts and Techniques Mining data streams

Data Mining and the OptIPuter

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

Data Fusion and Data Mining

Box of Delights!

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

THE DELIGHTS OF COMPOSTING

Data Mining: Concepts and Techniques Mining data streams

Box of Delights!