cs345 data mining n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS345 Data Mining PowerPoint Presentation
Download Presentation
CS345 Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 15

CS345 Data Mining - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

CS345 Data Mining. Virtual Databases. Virtual Relations. Web. Example. Find marketing manager openings in Internet companies so that my commute is shorter than 10 miles. . Structured queries e.g., in SQL. Applications. Comparison shopping shopping.com, fatlens, mobissimo,… Job search

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS345 Data Mining' - sondra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs345 data mining

CS345Data Mining

Virtual Databases

example

Virtual Relations

Web

Example
  • Find marketing manager openings in Internet companies so that my commute is shorter than 10 miles.

Structured queries

e.g., in SQL

applications
Applications
  • Comparison shopping
    • shopping.com, fatlens, mobissimo,…
  • Job search
    • indeed.com, simplyhired,…
  • Classifieds Search
    • oodle
  • Integrating web data with relational enterprise apps
    • purchasing, pricing,…
wrappers

Website

Wrappers
  • Extract tuples from a single website
  • Assume website is a static collection of pages i.e., no forms

Wrapper

Relation

not same as relation extraction
Not same as Relation Extraction
  • Why can’t we use DIPRE or Snowball?
    • Can’t assume that the same tuple can be found on many different websites
    • Need to extract all the tuples from each website
    • May need to normalize data values across websites
    • Data may be behind forms
      • Need to account for query capabilities of websites
brute force approach
Brute force approach
  • Write a custom program tailored to the website
    • e.g., in perl, python,…
  • Does not scale to thousands of websites
    • Each site needs a different wrapper
  • Website changes break wrappers
simpler problem
Simpler problem
  • Simplified version of wrapper problem
    • Given a set of pages from the same website, that share the same structure
      • E.g., product detail pages from Amazon.com
    • We have a target relation schema
      • E.g., (product,description,price)
    • Human labels a small subset of pages
      • Marks tuple components on pages
    • Can we deduce the structure?
two web pages
Two web pages

<body><h1>Apple 20GB iPod</h1>

<img href=“xyz”>

Our Price: $204.99

<p> Cool product.

</body>

<body><h1>Apple 4GB iPod nano</h1>

<img href=“abc”>

Our Price: $250.99

<p> Even cooler product.

</body>

labeled pages
Labeled pages

<body><h1>Apple 20GB iPod</h1>

<img href=“xyz”>

Our Price: $204.99

<p> Cool product.

</body>

<body><h1>Apple 4GB iPod nano</h1>

<img href=“abc”>

Our Price: $250.99

<p> Even cooler product.

</body>

lr left right wrapper
LR (Left-Right) Wrapper

L1 = “<body><h1>” R1 = “</h1><img href=”

L2 = “Our Price: ” R2 = “<p>”

L3 = “<p>” R3 = “</body>”

<body><h1>Apple 20GB iPod</h1>

<img href=“xyz”>

Our Price: $204.99

<p> Cool product.

</body>

  • Fix an order for attributes (product, price, description)
  • Use patterns of the form *Li(attributei)Ri*
example product price
Example: (Product, Price)

<body>

<b>Holiday Sale</b><em>save $$</em>

<p>

<b>Shoes:</b><em>$100</em> <br>

<b>Ship:</b><em>$1000</em>

</body>

<body>

<b>Everyday low prices</b><em>guaranteed</em>

<p>

<b>Sealing wax:</b><em>$1</em>

</body>

L1=“<b>” R1=“</b>”

L2=“<em>” R2=“</em>”

hlrt head left right tail wrappers
HLRT (Head-Left-Right-Tail) Wrappers

<body>

<b>Holiday Sale</b><em>save $$</em>

<p>

<b>Shoes:</b><em>$100</em> <br>

<b>Ship:</b><em>$1000</em>

</body>

<body>

<b>Everyday low prices</b><em>guaranteed</em>

<p>

<b>Sealing wax:</b><em>$1</em>

</body>

L1=“<b>” R1=“</b>”

L2=“<em>” R2=“</em>”

H=“</em><p>” T = “<body>”

example product price1
Example: (Product, Price)

<body>

<b>Holiday Sale</b><em>save $$</em>

<p>

<b>Shoes:</b><em>$100</em> <br>

<b>Ship:</b><em>$1000</em>

</body>

<body>

<b>Cabbages</b><em>$10</em>

<p>

<b>Sealing wax:</b><em>$1</em>

</body>

Cannot construct a HLRT wrapper

book author year example
Book-author-year example

Books by <b>Isaac Asimov</b><ul>

<li>Foundation (1951)</li>

<li>Nightfall (1941)</li>

</ul><p>

Books by <b>Arthur C Clarke</b><ul>

<li>Rendezvous with Rama (1976)</li>

</ul>

limitations of hlrt
Limitations of HLRT
  • Contiguous tuples
    • All tuple components must be on the same page
    • One tuple must end before next one begins
  • Needs human labeling
    • Because labeling needs to be accurate
    • Can we use “noisy” automatic taggers that can make some mistakes?