Extracting and Structuring Web Data

Extracting and Structuring Web Data PowerPoint PPT Presentation

classified AUTO in AUTO SALES as automatic transmission. could adjust exceptions in ... in an automobile accident. He was born. August 4, 1956 in Salt Lake ...

Related searches for Extracting and Structuring Web Data

Download Presentation

Extracting and Structuring Web Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Slide 1:Extracting and Structuring Web Data

Slide 2:GOALQuery the Web like we query a database

Slide 3:PROBLEMThe Web is not structured like a database.

Slide 4:Making the Web Look Like a Database Web Query Languages Treat the Web as a graph (pages = nodes, links = edges). Query the graph (e.g., Find all pages within one hop of pages with the words “Cars for Sale”). Wrappers Find page of interest. Parse page to extract attribute-value pairs and insert them into a database. Write parser by hand. Use syntactic clues to generate parser semi-automatically. Query the database.

Slide 5:Automatic Wrapper Generation

Slide 6:Application Ontology:Object-Relationship Model Instance

Slide 7:Application Ontology: Data Frames

Slide 8:Ontology Parser

Slide 9:Record Extractor

Slide 10:Record Extractor:High Fan-Out Heuristic

Slide 11:Record Extractor:Record-Separator Heuristics Identifiable “separator” tags Highest-count tag(s) Interval standard deviation Ontological match Repeating tag patterns

Slide 12:Record Extractor:Consensus Heuristic

Slide 13:Record Extractor: Results

Slide 14:Constant/Keyword Recognizer

Slide 15:Heuristics Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation

Slide 16:Keyword Proximity

Slide 17:Subsumed/Overlapping Constants

Slide 18:Functional Relationships

Slide 19:Nonfunctional Relationships

Slide 20:First Occurrence without Constraint Violation

Slide 21:Database-Instance Generator

Slide 22:Recall & Precision

Slide 23:Results: Car Ads

Slide 24:Car Ads: Comments Unbounded sets missed: MERC, Town Car, 98 Royale could use lexicon of makes and models Unspecified variation in lexical patterns missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patterns Misidentification of attributes classified AUTO in AUTO SALES as automatic transmission could adjust exceptions in lexical patterns Typographical errors “Chrystler”, “DODG ENeon”, “I-15566-2441” could look for spelling variations and common typos

Slide 25:Results: Computer Job Ads

Slide 26:Obituaries(A More Demanding Application)

Slide 27:Obituary Ontology

Slide 28:Data FramesLexicons & Specializations

Slide 29:Keyword HeuristicsSingleton Items

Slide 30:Keyword HeuristicsMultiple Items

Slide 31:Results: Obituaries

Slide 32:Results: Obituaries

Slide 33:Conclusions Given an ontology and a Web page with multiple records, it is possible to extract and structure the data automatically. Recall and Precision results are encouraging. Car Ads: ~ 94% recall and ~ 99% precision Job Ads: ~ 84% recall and ~ 98% precision Obituaries: ~ 90% recall and ~ 95% precision (except on names: ~ 73% precision) Future Work Find and categorize pages of interest. Strengthen heuristics for separation, extraction, and construction. Add richer conversions and additional constraints to data frames.

  • Login