The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652

Why “Unreasonable Effectiveness”? • Title taken from an article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” • Physics formulas are nice an tidy • Linguistic formulas not so simple (e.g. an incomplete grammar of the English language would be over 1700 pages long) • But what linguistics lacks in elegant formulas, it makes up for with LOTS of data • In 2006, Google released an annotated corpus of over one trillion words

What makes a task “easy”? • Biggest successes in NLP: statistical speech recognition and statistical machine translation • Harder than tasks such as document classification • But there’s lots of data available that doesn’t require expensive manual annotation (e.g. European Union translators, closed captioning) • Automatically discover semantic relationships from the accumulated evidence of web-based text patterns • “Invariably, simple models and a lot of data trump more elaborate models based on less data”

False dichotomy • “Deep” approach • Hand coded grammars and ontologies, represented by complex networks of relationships • Statistical approach • N-gram statistics from large corpora

Actually three problems of NLP • Choosing a representation language • First order logic, finite state machines, etc. • Encoding a model in that language • Manual encoding, word counts, etc. • Performing inference on that model • Complex inference models, Bayesian statistics

Semantic Web vs. Semantic Interpretation • “The Semantic Web is a convention for formal representation languages that lets software services interact with each other ‘without needing artificial intelligence.’” • Agree on standards for representing dates, prices, locations, etc. • Services can then interact with other services that use the same standard or a different one with a known translation • “The problem of understanding human speech and writing—the semantic interpretation problem—is quite different from the problem of software service interoperability.”

Challenges of Building Semantic Web Services • Ontology writing • Difficulty of implementation • Competition • Inaccuracy and deception

Challenges of Achieving Accurate Semantic Interpretation • The semantic web has managed to gethundreds of millions of authors to share a trillion pages of content, and aggregated and indexed context • Still need to find meaning of entries: • Does “HP” refer to “Helmerich and Payne” or “Hewlett-Packard” • Which “Joe’s Pizza” are we talking about

Example Task: Find Synonyms for Attribute Names • Looking to recognize facts such as “Company Name” = “Company” or “Price” = “Discount” • Extract 2.5 million distinct schemata from 150 million tables • Examine co-occurrence of names in these schemata • If A and B rarely occur together, but both often occur with C, then A and B may be synonyms

“So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.”

The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data

Presentation Transcript

The Effectiveness of Automatic Stabilizers

The Effectiveness of Psychotherapy

Measuring the effectiveness of the workforce

“SUBJECTIVISM: AN UNREASONABLE FACSIMILE”

Evaluating the Effectiveness of the Organization

Assessing the Effectiveness of K.P.M.s

The Effectiveness of Nutritional Supplements

Using Big Data To Improve the Effectiveness of Lifecycle Campaigns

Creating or Improving the Effectiveness of Data Teams

The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Mathematics

THE UNREASONABLE USEFULNESS OF PRIME NUMBERS

The Effectiveness of Achievers Weekend

The effectiveness of cost-effectiveness of analysis: Stories of two practitioners

The Effectiveness of Competition Policy

Measuring the Effectiveness of

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining

The effectiveness of USMEF's programs

Using Data to Evaluate the Effectiveness of Professional Development

Identify The Errors In Unreasonable Results

The Effectiveness of Divorce Mediation