1 / 17

Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang

Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang. Problem Definition. -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class ). Name. Personal Dictionary. Domain Dictionary. Heuristics. Weighting function .

keely
Download Presentation

Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang

  2. Problem Definition -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class )

  3. Name Personal Dictionary Domain Dictionary Heuristics Weighting function To distinguish personal homepages from common sites To distinguish Database-related webpages from the rest Homepage Architecture

  4. Domain Dictionary A set of words that are common in the database community. Our approach: DBWorld DBConference Contrast Area Our Dictionary (Virtual) + = -

  5. Domain Dictionary DBWorld DBConference Contrast Area Our Dictionary (Virtual) + = - Dictionary Building: parse documents from each source into 2-word phrases and calculate their frequency data mine 4.47E-03 dbworld messag 4.38E-03 paper submiss 3.78E-03 program committe 3.10E-03 import date 2.98E-03 state univers 2.74E-03 intern confer 2.73E-03 comput scienc 2.70E-03 hong kong 2.65E-03 camera readi 2.56E-03 data manag 2.33E-03 queri process 1.63E-02 mobil databas 1.36E-02 languag featur 1.09E-02 data manag 1.09E-02 xqueri implement 0.008174387 queri languag 8.17E-03 queri optim 0.005449591 process data 0.005449591 data mine 0.005449591 research prototyp 0.005449591 databas architectur 0.005449591 program committe 0.019085487 mathemat scienc 0.007952286 mathemat physic 0.006361829 intern confer 0.0055666 date june 0.005168986 intern institut 0.004373758 schr dinger 0.003976143 erwin schr 0.003976143 dinger intern 0.003976143 degli studi 0.003578529

  6. Domain Dictionary (cont.) Similarity Measuring: • Parse the webpage into 2-word phrases, and calculate their frequency • Use cosine similarity measure based on phrase frequency to get a score from each dictionary: Sdbworld, Sdbconf, Scontrast • Combine Sdbworld, Sdbconf, (1- Scontrast) using geometric average.

  7. Personal Dictionary A set of words related to the specific person that we are looking for. Our approach: use DBLP to find information about co-authors, keywords of research, and conferences

  8. Personal Dictionary • Given a researcher’s name, find his/her DBLP page • Build the personal dictionary, using Term Frequency and Entry Frequency (#publication entries where a term appears) • Use cosine measure to evaluate the similarity between a webpage and this personal dictionary

  9. Heuristics Rules to distinguish a homepage from other websites. Our Heuristics: • In title: Name, “Homepage”, “DBLP”, “eventseer”, • In URL: A version of person’s name, “citeseer” • In body: Visual cues, specific keywords {University, Department, Professor, Research, Homepage} • Co-occurrence of “publication” and person’s name.

  10. Name Personal Dictionary Domain Dictionary Heuristics Weighting function Homepage Recall…

  11. Combining Scores Experimentally assign weights for the previous scoring functions. Return the URL with the highest score.

  12. Strengths • Disambiguating between people with the same name, given that there is only one of them in the databases field. • Fits well in the DBLife architecture, since our algorithm run offline for the whole researchers list that we get from DBLP.

  13. Strengths (cont) • Incremental architecture: • Finds new researchers through DBLP • Finds new domain related words through DBWorld • Modular architecture: we can add more scoring functions.

  14. Limitations • Can’t distinguish between pages that look like the homepage that we are looking for. • Can’t distinguish between people with the same name, working in the same area (databases). • Google, DBLP, DBWorld dependent.

  15. Demo

  16. Questions ?

  17. Thank you!

More Related