The BINGO! System for Information Portal Generation and Expert Web Search

The BINGO! System forInformation Portal Generationand Expert Web Search University of the Saarland, Saarbrücken, Germany weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer • Outline: • Problem & research challenge • Our work in progress • Future research avenues

Google (and Yahoo) top 3: CpSc 862 Assignment 6... You can open it by double clicking the Mars ...(logfunc ... and any deviations from the standard Aries recovery ... www.cs.clemson.edu/~jzwang/cpsc862/hw07/cpsc862hw07.htm Tutorial: Application Servers and Associated Technologies... an OIA for his inventions (ARIES, ARIES/IM, Commit_LSN ... www.almaden.ibm.com/u/mohan/ AppServersTutorial_SIGMOD2002_Slides.pdf Recovery from Spiritual Abuse... with you no matter how long your recovery takes ... powerful form of spiritual abuse and a source of spiritual ... their denial in hopes that ... www.nacronline.com/dox/library/ biblestudies/sp_abuse.pdf Sourceforge: BZFlag - OpenSource OpenGL Multiplayer Multiplatform battlezone ... JBoss.org - The JBoss/Server is the leading Open Source ... Dev-C++ - Dev-C++ is an full-featured Integrated ... libraries ... (In-) Effectivity of Web Search Engines query = „ARIES recovery open source download“ • But there is hope: • exploit structure • explore neighborhood • use topic directory Topic –> Database –> Database Engines/Servers

Research Avenues: Structure and annotate information: XML Organize documents „semantically“: ontologies Leverage machine learning: automatic classification More computer time for better result: focused crawling Goal: should be able to find results for advanced info request in one day with < 5 min intellectual effort that the best human experts can find with infinite time From Observations to Research Avenues Key observation: yes, there are ways to find what you are searching, but intellectual time is expensive !  requires „intelligent“ automation

DBS D XML Challenge: Expert Web Queries Where can I download an open source implementation of the ARIES recovery algorithm? Find the text and notes of the western song Raw Hide. What are Chernoff-Hoeffding bounds? Find Fermat‘s last / Wiles‘ theorem in MathML format. Are there any theorems isomorphic to my new conjecture? Find related theorems. Which professors from D are teaching DBS and have research projects on XML? Which Shakespeare drama has a scene where a woman talks a Scottish nobleman into murder? Who was the Italian woman that I met at the PC meeting where Moshe Vardi was PC Chair?

Challenge: (Meta-) Portal Building Who are the top researchers in the database system community? Who is working on using machine learning techniques for searching XML data? What are the most important results in large deviation theory? Find information about public subsidies for plumbers. Find new EU regulations that affect an electrician‘s business. Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g ? Are there metabolic models for acid reflux that could be related to the gene expression data?

Materialography Portal ... Chamber of Crafts Portal HTML2XML Auto Annotation XXL BINGO! Filter (XML ...) Query Processor Craw- ler Focus Mgr Classi- fier Search Engine Index Mgr Onto- logy Mgr Onto- logies DB / Cache / Queue Mgr Portal Explorer Ontology Service Deep Web Semantic Web Our Research Agenda Web Applications Intranet Applications Web Service Generator (UDDI, WSDL)

WWW ...... ..... ...... ..... • critical issues: • classifier accuracy • feature selection • quality of training data automatially build personal topic directory (Soumen Chakrabarti et al. 1999) Focused Crawling seeds Crawler training Classifier Link Analysis Root DB Core Technology Semistrutured Data Web Retrieval Data Mining XML

WWW ...... ..... ...... ..... Root DB Core Technology Semistrutured Data Web Retrieval Data Mining XML The BINGO! Focused Crawler seeds Crawler training Classifier Link Analysis high SVM confidence high authority score re-training topic-specific archetypes

x2 x1 Training: Compute separating hyperplane that maximizes the min. distance of all positive and negative samples to the hyperplane  solve quadratic programming problem Decision: Test new vector for membership in C: Distance of from hyperplane yields classification confidence Classification using Support Vector Machines (SVM) ? C C n training vectors with components (x1, ..., xm, C) with C = +1 oder -1  

BINGO! Adaptive Re-trainingand Focus Control for each topic V do { archetypes(V) := top docs of SVM confidenceranking  top authorities of V ; remove from archetypes(V) all docs d that do not satisfy confidence(d)  mean confidence(V) ; recompute feature selection based on archetypes(V) ; recompute SVM model for V with archetypes(V)as training data } combine re-training with two-phase crawling strategy: • learning phase: • aims to find archetypes (high precision) •  hard focus for crawling vicinity of training docs • harvesting phase: • aims to find results (high recall) •  soft focus for long-range exploration with tunnelling

strategy  with best ratio of estimated precision to runtime cost Feature Space Construction & Meta Strategies possible strategies: • single term frequencies or tf*idf with top n cross-entropy terms • term pairs within proximity window • (e.g., support vector, match about world championship) • text terms from hyperlink neighbors • anchor text terms from neighbors • (e.g., <a href=...> click here for soccer results </a>) meta strategies (over m feature spaces for class k): • unanimous decision: • weighted average: with  estimator (Joachims‘00) for precision of model  for class k based on leave-one-out training

learning phase for improved feature selection and classification: depth-first crawl limited to domains of seeds followed by archetype selection and retraining (for high precision) harvesting phase for building a rich portal: prioritized breadth-first crawl (for high recall) Experiment on Information Portal Generation (1) for single-topic portal on „Database Research“ start with only 2 initial seeds: homepages of DeWitt and Gray goal: automatic gathering of DBLP author homepages (with DBLP excluded from crawl)

Experiment on Information Portal Generation (2) • result after 12 hours (on commodity PC): • 3 mio. URLs crawled on 30 000 hosts, 1 mio. pages analyzed, • 0.5 mio. pages positively classified • found 7000 homepages out of 30 000 DBLP authors, • 712 authors of the top 1000 DBLP authors • with 267 among the 1000 best crawl results • + postprocessing for querying and analysis: • ranking by SVM confidence, authority score, etc. • clustering, relevance feedback, etc.

Ongoing and Future Work Deep Web exploration with auto-generated queries Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Construct richer feature spaces Exploiting linguistic analysis methods e.g.: „... cut his throat ...“ act: killing subject:... object:... Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units • Combining focused crawling with XML search • auto-annotation of HTML, Latex, PDF, etc. docs • cross-document querying à la XXL User guidance & portal admin methodology

Towards “Semantically Coherent” Units Teaching: DBS IR ... Homepage of ... My work: J2EE XML WSDL ... My hobbies: Poisonous frogs South Amerian stamps Canoeing ... Homepage of ... Teaching Research Address: ... vs. Research: XML Auto-tuning ...

Ongoing and Future Work Deep Web exploration with auto-generated queries Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Construct richer feature spaces Exploiting linguistic analysis methods e.g.: „... cut his throat ...“ act: killing subject:... object:... Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units • Combining focused crawling with XML search • auto-annotation of HTML, Latex, PDF, etc. docs • cross-document querying à la XXL User guidance & portal admin methodology

Concluding Remarks long-term goal: exploit the Web‘s potential for being the world‘s largest knowledge base XML and Semantic Web are key assets, but by themselves not sufficient; we need to cope with diversity, incompleteness, and uncertainty  absolute need for ranked retrieval view information organization and information search as dual views of the same problem combine techniques from DBS, IR, CL, AI, and ML need better theory about quality/efficiency tradeoffs as well as large-scale experiments

The BINGO! System for Information Portal Generation and Expert Web Search