1 / 10

The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data. Alon Halevy, Peter Norvig, and Fernando Pereira. Kristine Monteith May 1, 2009 CS 652. Why “Unreasonable Effectiveness”?. Title taken from an article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”

rae
Download Presentation

The Unreasonable Effectiveness of Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652

  2. Why “Unreasonable Effectiveness”? • Title taken from an article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” • Physics formulas are nice an tidy • Linguistic formulas not so simple (e.g. an incomplete grammar of the English language would be over 1700 pages long) • But what linguistics lacks in elegant formulas, it makes up for with LOTS of data • In 2006, Google released an annotated corpus of over one trillion words

  3. What makes a task “easy”? • Biggest successes in NLP: statistical speech recognition and statistical machine translation • Harder than tasks such as document classification • But there’s lots of data available that doesn’t require expensive manual annotation (e.g. European Union translators, closed captioning) • Automatically discover semantic relationships from the accumulated evidence of web-based text patterns • “Invariably, simple models and a lot of data trump more elaborate models based on less data”

  4. False dichotomy • “Deep” approach • Hand coded grammars and ontologies, represented by complex networks of relationships • Statistical approach • N-gram statistics from large corpora

  5. Actually three problems of NLP • Choosing a representation language • First order logic, finite state machines, etc. • Encoding a model in that language • Manual encoding, word counts, etc. • Performing inference on that model • Complex inference models, Bayesian statistics

  6. Semantic Web vs. Semantic Interpretation • “The Semantic Web is a convention for formal representation languages that lets software services interact with each other ‘without needing artificial intelligence.’” • Agree on standards for representing dates, prices, locations, etc. • Services can then interact with other services that use the same standard or a different one with a known translation • “The problem of understanding human speech and writing—the semantic interpretation problem—is quite different from the problem of software service interoperability.”

  7. Challenges of Building Semantic Web Services • Ontology writing • Difficulty of implementation • Competition • Inaccuracy and deception

  8. Challenges of Achieving Accurate Semantic Interpretation • The semantic web has managed to gethundreds of millions of authors to share a trillion pages of content, and aggregated and indexed context • Still need to find meaning of entries: • Does “HP” refer to “Helmerich and Payne” or “Hewlett-Packard” • Which “Joe’s Pizza” are we talking about

  9. Example Task: Find Synonyms for Attribute Names • Looking to recognize facts such as “Company Name” = “Company” or “Price” = “Discount” • Extract 2.5 million distinct schemata from 150 million tables • Examine co-occurrence of names in these schemata • If A and B rarely occur together, but both often occur with C, then A and B may be synonyms

  10. “So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.”

More Related