1 / 11

Harvesting useful data from researchers’ homepages

Harvesting useful data from researchers’ homepages. Outline . Researchers’ homepages Challenges Related works. Researchers’ homepages. Lots of useful information about the researchers themselves Basic information Contact information Educational history Publications. Challenges .

sadah
Download Presentation

Harvesting useful data from researchers’ homepages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvesting useful data from researchers’ homepages

  2. Outline • Researchers’ homepages • Challenges • Related works

  3. Researchers’ homepages • Lots of useful information about the researchers themselves • Basic information • Contact information • Educational history • Publications

  4. Challenges • Different layouts • Templates • Personal pages • Different content • Pages introducing researchers • CV-like • Personal pages • Different content structures • Tables / lists • Natural language text

  5. Challenges • Different data presentations • hangli at microsoft dot com • cs.duke.edu, junyang • ASJMZheng@ntu.edu.sg • erafalin(at)cs.tufts.edu • <Image src=’email.jpg’/> • Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk • wmt then the at-sign then uci dot edu

  6. Related works – Tang et al (2008) • Tang et al.(2008) – ArnetMiner • Separate text into tokens (5 token types) • Assign possible tags to each tokens (CRF) • Extract profile properties (Amilcare tool and SVM) • F1 = 83.37% (1,000 researchers) • Name disambiguation: may be simpler in our case

  7. Related works - Cai et al (2003) • Cai et al (2003) - Visual-based content structure extraction • Underlying documentation presentation independent • Visual-based Page Segmentation (VIPS) • By combining DOM structure and visual cues (tag, color, text, size)

  8. Related works - Cai et al (2003)

  9. Related works - Cai et al (2003) • Strength • Domain independent  layout independent • No data training required • Good results in evaluation report (97% of pages correctly detected) • Applicability • Can be used to improve speed and correctness of the retrieval • Different levels of complexicity in homepages layouts

  10. References • J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’2007 pp292-301, 2007. • D. Cai, S. Yu, J.R. Wen and W.Y. Ma (2003). Extracting content structure for web pages based on visual representation. In the 5th APWC, pp. 406-417 • C.H. Lee (2004). PARCELS: PARser for Content Extraction and Logical Structure (Stylistic detection). Honours Thesis, School of Computing, NUS, 2004. • J. Chen, K. Xiao (2008). Perception-oriented Online news extraction. In JCDL 2008 pp.363 • Amilcare Webpage - http://nlp.shef.ac.uk/amilcare/amilcare.html • Wikipedia Webpage – http://en.wikipedia.org • W3Schools Webpage – http://www.w3schools.com/default.asp

  11. Thank You !

More Related