1 / 18

Automating the Extraction of Data Behind Web Forms

Automating the Extraction of Data Behind Web Forms. Brigham Young University Sai Ho Yau. Hurdles Against Automating Data Extraction. There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons:.

manns
Download Presentation

Automating the Extraction of Data Behind Web Forms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

  2. Hurdles Against Automating Data Extraction There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons: • Web information is stored in databases • Form interfaces • Relevant information can be obtained only after a Web form is filled out and submitted

  3. Problems Dealing with Forms • No general Web form design • Required text fields • One form may lead to another • Resulting information embedded within forms • Returned error messages versus valid data • Elimination of possible duplicate data

  4. Motivations • Eliminate duplicate data and merge resulting information. We want to automatically: • Fill in Web forms. • Extract information behind forms. • Screen out errors.

  5. The Framework

  6. Method: Construct the Query String

  7. Method: Construct the Query String

  8. Method: Construct the Query String

  9. Returned Web Page

  10. Solutions Two phases to deal with many possible responses to a query*: • Sampling phase • Exhaustive phase * Assuming no HTTP error

  11. Sampling Phase Submit the default form. Randomly select N form-field settings and submit the form N times. If no new information, STOP and send the result downstream (N is set so that the probability of subsequent submissions yielding new data is less than 5%). Otherwise, ENTER the Exhaustive Phase.

  12. Exhaustive Phase • Estimate the total time and quantity of data. • If below threshold, exhaustively obtain the rest of the information. • Otherwise, return the results of the sampling and report to the user the estimate of time and quantity of data.

  13. Data Retrieving Strategy Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.

  14. Retrieved Web Pages

  15. Data Retrieving Strategy Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases. Discard duplicates and merge new information.

  16. Duplicates Discarded and New Information Merged

  17. Data Retrieving Strategy Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases. Discard duplicates and merge new information. Send fully merged data downstream for data extraction.

  18. Conclusions • Filter duplicate data and merge resulting information. We can automate data extraction process by automatically: • Fill in Web forms. • Retrieve information behind forms. • Handle errors.

More Related