E N D
1. Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong
University of Washington
June 24, 2005
2. Is Your Personal Informationa Mine or a Mess? Mention Tim-Bernslee
PIM workshop last VLDB?Mention Tim-Bernslee
PIM workshop last VLDB?
3. Is Your Personal Informationa Mine or a Mess? Mention Tim-Bernslee
PIM workshop last VLDB?Mention Tim-Bernslee
PIM workshop last VLDB?
4. Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)?
5. Index Data from Different SourcesE.g. Google, MSN desktop search Mention Tim-Bernslee
PIM workshop last VLDB?Mention Tim-Bernslee
PIM workshop last VLDB?
6. Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)?
Who are working on SEMEX?
What are the emails sent by my PKU alumni?
What are the phone numbers and emails of my coauthors?
7. Organize Data in a Semantically Meaningful Way Mention Tim-Bernslee
PIM workshop last VLDB?Mention Tim-Bernslee
PIM workshop last VLDB?
8. Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)?
Who are working on SEMEX?
What are the emails sent by my PKU alumni?
What are the phone numbers and emails of my coauthors?
Whom of SIGMOD05 authors do I know?
9. Integrate Organizational and Public Data with Personal Data Mention Tim-Bernslee
PIM workshop last VLDB?Mention Tim-Bernslee
PIM workshop last VLDB?
11. SEMEX (SEMantic EXplorer) I. Provide a Logical View of Data
12. SEMEX (SEMantic EXplorer) II. On-the-fly Data Integration
13. How to Find Alons Papers on My Desktop?
14. How to Find Alons Papers on My Desktop? Google Search Results
15. How to Find Alons Papers on My Desktop? Google Search Results
16. How to Find Alons Papers on My Desktop? Google Search Results
17. Semex Goal Build a Personal Information Management (PIM) system prototype that provides a logical view of personal information
Build the logical view automatically
Extract object instances and associations
Remove instance duplications
Leverage the logical view for on-the-fly data integration
Exploit the logical view for information search and browsing to improve peoples productivity
Be resilient to the evolution of the logical view
18. An Ideal PIM is a Magic Wand
19. An Ideal PIM is a Magic Wand
20. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
21. System Architecture
22. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
23. Reference Reconciliation in Semex
24. Semex Without Reference Reconciliation
25. Semex Without Reference Reconciliation
26. Semex Without Reference Reconciliation
27. Semex Without Reference Reconciliation
28. Semex NEEDS Reference Reconciliation
29. Reference Reconciliation A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003])
Traditional approaches assume matching tuples from a single table
Based on pair-wise comparisons
Harder in our context
30. Challenges Article: a1=(Bounds on the Sample Complexity of Bayesian Learning, 703-746, {p1,p2,p3}, c1) a2=(Bounds on the sample complexity of bayesian learning, 703-746, {p4,p5,p6}, c2)
Venue: c1=(Computational learning theory, 1992, Austin, Texas) c2=(COLT, 1992, null)
Person: p1=(David Haussler, null) p2=(Michael Kearns, null) p3=(Robert Schapire, null) p4=(Haussler, D., null) p5=(Kearns, M. J., null) p6=(Schapire, R., null)
31. Challenges Article: a1=(Bounds on the Sample Complexity of Bayesian Learning, 703-746, {p1,p2,p3}, c1) a2=(Bounds on the sample complexity of bayesian learning, 703-746, {p4,p5,p6}, c2)
Venue: c1=(Computational learning theory, 1992, Austin, Texas) c2=(COLT, 1992, null)
Person: p1=(David Haussler, null) p2=(Michael Kearns, null) p3=(Robert Schapire, null) p4=(Haussler, D., null) p5=(Kearns, M. J., null) p6=(Schapire, R., null) p7=(Robert Schapire, schapire@research.att.com) p8=(null, mkearns@cis.uppen.edu) p9=(mike, mkearns@cis.uppen.edu)
32. Intuition Complex information spaces can be considered as networks of instances and associations between the instances
Key: exploit the network, specifically, the clues hidden in the associations
33. I. Exploiting Richer Evidences Cross-attribute similarity Name&email
p5=(Stonebraker, M., null)
p8=(null, stonebraker@csail.mit.edu)
Context Information I Contact list
p5=(Stonebraker, M., null, {p4, p6})
p8=(null, stonebraker@csail.mit.edu, {p7})
p6=p7
Context Information II Authored articles
p2=(Michael Stonebraker, null)
p5=(Stonebraker, M., null)
p2 and p5 authored the same article
34. Considering Only Attribute-wise Similarities Cannot Merge Persons Well
35. Considering Richer Evidence Improves the Recall
36. II. Propagate Information between Reconciliation Decisions Article: a1=(Distributed Query Processing,169-180, {p1,p2,p3}, c1) a2=(Distributed query processing,169-180, {p4,p5,p6}, c2)
Venue: c1=(ACM Conference on Management of Data, 1978, Austin, Texas) c2=(ACM SIGMOD, 1978, null)
Person: p1=(Robert S. Epstein, null) p2=(Michael Stonebraker, null) p3=(Eugene Wong, null) p4=(Epstein, R.S., null) p5=(Stonebraker, M., null) p6=(Wong, E., null)
37. Propagating Information between Reconciliation Decisions Further Improves Recall
38. III. Reference Enrichment p2=(Michael Stonebraker, null, {p1,p3})p8=(null, stonebraker@csail.mit.edu, {p7})p9=(mike, stonebraker@csail.mit.edu, null)
p8-9 =(mike, stonebraker@csail.mit.edu, {p7})
39. References Enrichment Improves Recall More than Information Propagation
40. Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall
41. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
42. Importing External Data Sources
43. Traditional approaches: proceed in two steps
Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001])
Generate term matching candidates
E.g., paperTitle in table Author matches title in table Article
Step 2. Query discovery [Miller et al., 2000]
Take term matching as input, generate mapping expressions (typically queries)
E.g., SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id IntuitionExplore associations in schema mapping
44. Traditional approaches: proceed in two steps
Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001])
Generate term matching candidates
E.g., paperTitle in table Author matches title in table Article
Step 2. Query discovery [Miller et al., 2000]
Take term matching as input, generate mapping expressions (typically queries)
E.g., SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id
Users input is needed to fill in the gap between Step 1 output and Step 2 input
Our approach: check association violations to filter inappropriate matching candidates
IntuitionExplore associations in schema mapping
45. Integration Example
46. Integration Example
47. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
48. Explore the association network 1. Find the relationship between two instances Example: How did I know this person?
Solution: Lineage
Find an association chain between two object instances
Shortest chain?
Earliest chain OR Latest chain
49. Explore the association network 2. Find all instances related to a given keyword Example: Who are working on Schema Matching?
Solution:
Naive approach: index object instances on attribute values
?A list of papers on schema matching
?A list of emails on schema matching
?A list of persons working on schema matching
?A list of conferences for schema-matching papers
?A list of institutes that conduct schema-matching research
Our approach: index objects on the attributes of associated objects
50. Explore the association network 3. Rank returned instances in a keyword search Example: What are important papers on schema matching?
Solution:
Naive approach: rank by TF/IDF metric
Our approach: ranking by
Significance score: PageRank measure
Relevance score: TF/IDF metric
Usage score: last visit time and modification time
51. Explore the association network 4. Fuzzy Queries Queries we pose todaysomething we can describe
Find me something with (related to) keyword X
Find me the co-authors of Person Y
Fuzzy queries:
Q: What do I want to know?
A: In this webpage, 5 papers are written by your friends
Q: What significant things have happened today?
A: The President wrote an email to you!!
52. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
53. The Domain Model
54. Problems in Domain Model Personalization Problem: hard to precisely model a domain
At certain point we are not able to give a precise domain model
Not enough knowledge of the domain
Inherently evolution of a domain
Non-existence of a precise model
Overly detailed models may be a burden to users
Modeling every details of the information on ones desktop is often overwhelming
We may want to leave part of the domain unstructured
Extract descriptions at different levels of granularity Address v.s. street, city, state, zip
55. Malleable Schemas
56. Malleable Schema Introduce text into schemas
Phrases as element names E.g., InitialPlanningPhaseParticipant
Regular expressions as element namesE.g., *Phone, State|Province
Chains as element namesE.g., name/firstName
Introduce imprecision into queries
SELECT S.~name, S.~phone
FROM Student as S, ~Project as P
WHERE (S ~initialParticipant P) AND (P.name = Semex)
57. Outline Problem definition and project goals
Technical issues:
System architecture and instance extraction [CIDR05]
Reference reconciliation [Sigmod05]
On-the-fly data integration
Association search and browsing
Domain model personalization and evolution [WebDB05]
Interleaved with Semex demo [Best demo in Sigmod05]
Overarching PIM Themes
58. Overarching PIM Themes It is PERSONAL data!
How to build a system supporting users in their own habitat?
How to create an AHA! browsing experience and increase users productivity?
There can be any kind of INFORMATION
How to combine structured and un-structured data?
We are pursuing life-long data MANAGEMENT
What is the right granularity for modeling personal data?
How to manage data and schema that evolve over time?
59. Related Work Personal Information Management Systems
Indexing
Stuff Ive Seen (MSN Desktop Search)[Dumais et al., 2003]
Google Desktop Search [2004]
Richer relationships
MyLifeBits [Gemmell et al., 2002]
Placeless Documents [Dourish et al., 2000]
LifeStreams [Freeman and Gelernter, 1996]
Objects and associations
Haystack [Karger et al., 2005]
60. Summary 60 years passed since the personal Memex was envisioned
Its time to get serious
Great challenges for data management
Deliverables of the project
An approach to automatically build a database of objects and associations from personal data
An algorithm for on-the-fly integration
Algorithms for data analysis for association search and browsing
The concept of malleable schema as a modeling tool
A PIM system incorporating the above
61. Association Network for Semex