210 likes | 293 Views
Learn how to aggregate and extract semantically rich information from the web using instances, concept search, and domain-centric extraction. Discover applications for user optimization and challenges faced in this dynamic field.
E N D
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger
Vision • Transform hyperlinked bags of words into semantically rich aggregate view of information on the web.
Concept • Things of interest • Searching for information • Accomplishing a task • Reservations, etc.
Instances • Record of a concept • Restaurant • Gochi (19980 Homestead Rd Cupertino CA) • Academia? • Publications, research institutions
Instance Representation • Loosely-structured record (lrec) • Attribute-key, value pairs • Unique id field • Entity matching problem • Metadata • Attribute list
Domain • Set of related concepts • Academic community domain = {publications, people, conferences}
Usage StudyInstance vs. Concept Search • yelp.com • Month of queries resulting in a click (restaurants) • 59% specific business URL • 19% search URL either specific business or group • 11% specific group URL
Usage StudyConcept Attribute Search • Remove restaurant name and location information from query • Co-occuring words: • Menu (3%), coupons (1.8%), online, weekly specials, locations (1.5%) • Nutrition, to go, delivery, careers, cod
Usage StudyAggregation Value • 59% clicked on at least one other URL • 35% clicked on at least two other URLs • Small manual evaluation indicates pages are often about the same business.
Usage StudyConcepts vs. Browsing • 42% of homepage visits are from search engine • Immediately following URL • 11.5% location • 9% menu • 1% coupons • 10.5% of user trails contain more than one distinct instance of the restaurant concept
Extraction • Create new records from the web • Information extraction • Linking • Analysis • Meta-data tagging (cuisine type)
Domain-centric vs. Site-centric Extraction • Site-centric extraction • Wrappers for page structure • Probabilistic models (CRF) • Domain-centric extraction • Fields of interest • Statistical properties (single zip code, etc.) • Structure components (lists, link relationships)
Domain-centric Extraction • Aggregator mining • Learn from extracted knowledge (similar menus) • Matching • Text is “about” a record (restaurant review)
ApplicationSession Optimization • User understanding • Historical modeling • Session modeling • Content understanding • Example: Birks • Birks and Mayors (luxury Jewelers) vs. Birk’s Steakhouse
ApplicationBrowse Optimization • Alternatives: (Restaurants) • Similar type of cuisine • Similar location • Similar quality • Augmentations: (Camera) • Batteries • Memory cards
Concept Search Result Pages – shows multiple records Concept Pages – information about an instance Article Pages – a piece of authored text
Advertising • Increase in targeted advertisements • Target concepts rather than keywords
Challenges • Transfer learning • Transfer extractor knowledge • Tracking uncertainty • Accuracy issues • “Web of concepts is not a one time affair” • Wrapper problems • Concept updates • Relevance Measures • User satisfaction
Related Work • Information Extraction/Integration Systems • Dataspace Systems • Semantic Web
Future Work • Enrich representation model • Path storage to data • Provenance, versions, uncertainty • Hierarchal relationships (containment or inheritance) • Ranking of disparate sources