The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig and Fernando Pereira Google 2011. 10. 24 Eun-Sol Kim
The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve. • Eugene Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences • Essentially, all models are wrong but some are useful • George Box
Two approaches to AI • GOFAI ( Good Old-Fashioned Artificial Intelligence ) • Based on Logic • Symbolic AI • SML ( Statistical Machine Learning ) • Based on empirical data ( sensor data or databases ) • Inductive inference based on data, generalize data to rules, predict on future data
Scene completion using millions of photographs - Hays et al., CMU, SIGGRAPH 2007
Learning from Text at Web Scale • Brown Corpus • 1 Million English words • Complete sentences, no spelling errors, no grammatical errors • Google a trillion-word corpus • 100 time larger than Brown corpus • Frequency counts for all sequences up to 5 words long.
Some lessons of web-scale learning 1. Use available large-scale data rather than annotated data • We can find useful semantic relationships automatically from the statistics of search queries and the corresponding results or from the accumulated evidence of web-based text patterns without annotated data.
2. Memorization is a good policy • Memorizing specific phrases is more effective than general patterns. • Machine translation example : Large memorized phrase tables that give candidate mappings between specific source- and target-language phrases. • For many tasks, words and word combinations provide all the representational machinery we need to learn from text.
Conventional two approaches to NLP • Deep approach • Hand-coded grammars and ontologies • Complex networks of relations • Statistical approach • Learning n-gram statistics from large corpora
New approaches to NLP • Combination of two conventional approaches • Statistical relational learning • Represent relations between objects with rule ( first-order-logic) • Model built by statistical learning
Semantic interpretation • Semantic web • A convention for formal representation languages that lets software services interact with each other • Semantic interpretation • Imprecise, ambiguous natural languages. • Embodied in human cognitive and cultural processes whereby linguistic expression elicits expected responses and expected changes in cognitive states
The challenges for achieving accurate semantic interpretation • Interpreting the content • methods to infer relationships between column headers or mentions of entities in the world. • Web-scale data might be an important part of the solution. • Hundreds of millions of independently created tables. • Tables represent structured data • With table, we can resolve semantic heterogeneity.
Choose a representation That can use unsupervised learning On unlabeled data Which is so much more plentiful than labeled data.