70 likes | 266 Views
Managing Unstructured Data. AnHai Doan University of Wisconsin-Madison. Unstructured Data. Appears in many forms emails, Web pages, memos, call center text record, etc. Is pervasive 80% of the world data, and is growing Managed by many players
E N D
Managing Unstructured Data AnHai Doan University of Wisconsin-Madison
Unstructured Data ... • Appears in many forms • emails, Web pages, memos, call center text record, etc. • Is pervasive • 80% of the world data, and is growing • Managed by many players • SIGIR/WWW/KDD/AAAI, Google/Yahoo/Microsoft/IBM We should work on it, or risk missing the boat! But what sets us apart from the above guys?
Structure + System Focus! • Make it very easy to extract structures from raw data • in raw form keyword search / bag analysis • many apps want to go beyond that, they want structure • we should encourage this back to our play ground • not just DB + IR, but DB + IR + IE • Instead of working on isolated research problems, lets build end-to-end UDMS • should repeat what we did with System R / Ingres: system blueprint, followed by 20 years of rapid progress • unifies & accelerate our research efforts • keeps work grounded, make impact
What Does this System Look Like? DB + IR + IE + II, in a best-effort, Web 2.0 fashion Joe Hellerstein Flexible modes of interaction Extraction + Integration Joe Six-Pack Mass collaboration Best-effort, pay-as-you-go, improving over time Scale up to huge data (by running over clusters)
Broader Impacts • Great for many current applications • e-science, business, personal data, Web data, etc. • Great for many current research topics • IR, integration, PIM, data spaces • user interfaces, HCI, mashup • provenance, uncertainty • cluster management • query processing • monitoring, handling changes, pub/sub systems • Raises novel research issues • mass collab, best-effort, extraction, helping Joe Six-Pax • Helps define data mgt principles in broader contexts