1 / 6

How to make sense out of unstructured data?

This paper explores techniques for extracting information from unstructured data, such as search engines, information extraction, and natural language processing. It discusses the use of databases for unstructured data and highlights challenges faced, such as incomplete and error-prone data. The paper also introduces the QUIC system for handling data incompleteness and query imprecision. Additionally, it covers querying data generated from natural language processing and introduces LPath, a query language for linguistic annotation data. Lastly, the paper raises the question of how to close the loop in dealing with unstructured data.

dyerj
Download Presentation

How to make sense out of unstructured data?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University

  2. Databases Have Been a Great Success • for managing structured data • But, 85% of the World’s Data is Not in Databases!

  3. How to Obtain Information from Unstructured Data? • Efforts have been made by other areas • Search engines: Google, Yahoo, MSN, Ask,… • Information extraction (IE) [Avatar, TIES, …] • Natural language processing (NLP) [Treebank, UIMA, …] • What can databases do for unstructured data? • XML provides a good basis for representing semi-structured data, • However, challenges remain!! They produce semi-structured data from texts

  4. Querying Data Generated from IE • Information extraction produces data about specific entities and relationships • Data generated from information extraction are error prone • incomplete data [Imieliski, Koch,…] • probabilistic databases [Getoor, Jagadish, Halevy, Subrahmanian, Suciu, Tannen, Widom, …] • malleable schemas [Chang, Halevy, Ives…] • Query posed by naïve users are inaccurate • keywords [Agrawal, Chaudhuri, Das, Doan, Gravano, Papakonstantinou, Shanmugasundaram..] • over- or under-specified queries [Chaudhuri..] • natural language queries [Jagadish..] • QUIC: a system that handles data incompleteness and query imprecision at the same time for autonomous databases [CIDR 07, ICDE 07] • Collaborated with Subbarao Kambhampati, Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, and Ullas Nambiar

  5. S NP NP VP NP V PP NP Prep Det NP Alice a dog today with Bob saw Querying Data Generated from NLP • Natural language processing generates tree structured data (parse trees) • Understanding the lexical structure of a sentence helps query answering • E.g. find the NP after “Bob” and “with” within an NP • Demands queries similar to but different from XQuery/XPath queries • LPath: a query language for linguistic annotation data generated from NLP over text documents [ICDE06] • Collaborated with Susan Davidson, Steven Bird, Haejoong Lee, and Yifeng Zheng

  6. Challenge • How should we close the loop? Result 1 Documents Queries Revised queries Data bases Result 2

More Related