Web of Concepts: Transforming Hyperlinked Information

A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger

Vision • Transform hyperlinked bags of words into semantically rich aggregate view of information on the web.

Concept • Things of interest • Searching for information • Accomplishing a task • Reservations, etc.

Instances • Record of a concept • Restaurant • Gochi (19980 Homestead Rd Cupertino CA) • Academia? • Publications, research institutions

Instance Representation • Loosely-structured record (lrec) • Attribute-key, value pairs • Unique id field • Entity matching problem • Metadata • Attribute list

Domain • Set of related concepts • Academic community domain = {publications, people, conferences}

Usage StudyInstance vs. Concept Search • yelp.com • Month of queries resulting in a click (restaurants) • 59% specific business URL • 19% search URL either specific business or group • 11% specific group URL

Usage StudyConcept Attribute Search • Remove restaurant name and location information from query • Co-occuring words: • Menu (3%), coupons (1.8%), online, weekly specials, locations (1.5%) • Nutrition, to go, delivery, careers, cod

Usage StudyAggregation Value • 59% clicked on at least one other URL • 35% clicked on at least two other URLs • Small manual evaluation indicates pages are often about the same business.

Usage StudyConcepts vs. Browsing • 42% of homepage visits are from search engine • Immediately following URL • 11.5% location • 9% menu • 1% coupons • 10.5% of user trails contain more than one distinct instance of the restaurant concept

Extraction • Create new records from the web • Information extraction • Linking • Analysis • Meta-data tagging (cuisine type)

Domain-centric vs. Site-centric Extraction • Site-centric extraction • Wrappers for page structure • Probabilistic models (CRF) • Domain-centric extraction • Fields of interest • Statistical properties (single zip code, etc.) • Structure components (lists, link relationships)

Domain-centric Extraction • Aggregator mining • Learn from extracted knowledge (similar menus) • Matching • Text is “about” a record (restaurant review)

ApplicationAggregation

ApplicationSession Optimization • User understanding • Historical modeling • Session modeling • Content understanding • Example: Birks • Birks and Mayors (luxury Jewelers) vs. Birk’s Steakhouse

ApplicationBrowse Optimization • Alternatives: (Restaurants) • Similar type of cuisine • Similar location • Similar quality • Augmentations: (Camera) • Batteries • Memory cards

Concept Search Result Pages – shows multiple records Concept Pages – information about an instance Article Pages – a piece of authored text

Advertising • Increase in targeted advertisements • Target concepts rather than keywords

Challenges • Transfer learning • Transfer extractor knowledge • Tracking uncertainty • Accuracy issues • “Web of concepts is not a one time affair” • Wrapper problems • Concept updates • Relevance Measures • User satisfaction

Related Work • Information Extraction/Integration Systems • Dataspace Systems • Semantic Web

Future Work • Enrich representation model • Path storage to data • Provenance, versions, uncertainty • Hierarchal relationships (containment or inheritance) • Ranking of disparate sources

Web of Concepts: Transforming Hyperlinked Information