Harvesting useful data from researchers’ homepages

Harvesting useful data from researchers’ homepages

Outline • Researchers’ homepages • Challenges • Related works

Researchers’ homepages • Lots of useful information about the researchers themselves • Basic information • Contact information • Educational history • Publications

Challenges • Different layouts • Templates • Personal pages • Different content • Pages introducing researchers • CV-like • Personal pages • Different content structures • Tables / lists • Natural language text

Challenges • Different data presentations • hangli at microsoft dot com • cs.duke.edu, junyang • ASJMZheng@ntu.edu.sg • erafalin(at)cs.tufts.edu • <Image src=’email.jpg’/> • Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk • wmt then the at-sign then uci dot edu

Related works – Tang et al (2008) • Tang et al.(2008) – ArnetMiner • Separate text into tokens (5 token types) • Assign possible tags to each tokens (CRF) • Extract profile properties (Amilcare tool and SVM) • F1 = 83.37% (1,000 researchers) • Name disambiguation: may be simpler in our case

Related works - Cai et al (2003) • Cai et al (2003) - Visual-based content structure extraction • Underlying documentation presentation independent • Visual-based Page Segmentation (VIPS) • By combining DOM structure and visual cues (tag, color, text, size)

Related works - Cai et al (2003)

Related works - Cai et al (2003) • Strength • Domain independent  layout independent • No data training required • Good results in evaluation report (97% of pages correctly detected) • Applicability • Can be used to improve speed and correctness of the retrieval • Different levels of complexicity in homepages layouts

References • J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’2007 pp292-301, 2007. • D. Cai, S. Yu, J.R. Wen and W.Y. Ma (2003). Extracting content structure for web pages based on visual representation. In the 5th APWC, pp. 406-417 • C.H. Lee (2004). PARCELS: PARser for Content Extraction and Logical Structure (Stylistic detection). Honours Thesis, School of Computing, NUS, 2004. • J. Chen, K. Xiao (2008). Perception-oriented Online news extraction. In JCDL 2008 pp.363 • Amilcare Webpage - http://nlp.shef.ac.uk/amilcare/amilcare.html • Wikipedia Webpage – http://en.wikipedia.org • W3Schools Webpage – http://www.w3schools.com/default.asp

Thank You !

Harvesting useful data from researchers’ homepages