Using Weight-controlled Token Matching to Extract Data From HTML Files

Using Weight-controlled Token Matching to Extract Data From HTML Files Yan Xu, Tok Wang Ling Dept. of Computer Science National University of Singapore (xuyan, lingtw)@comp.nus.edu.sg

Outline Outline • Motivation and background • Our approach • Generate wrapper • Extraction data • Experimental Result • Conclusion

Motivation and Background Motivation and Background • What is a wrapper • XML and HTML • Related works • Some criteria to build wrappers for Web pages

Motivation and Background What Is a Wrapper? • Wrapper is a software component. • Wrapper is used to extract data from source files and convert them into a structured way. • On the Web, the source files are usually HTML files.

Motivation and Background What is a Wrapper? (Cont…) • The source files are usually semistructured or unstructured. • We only discuss HTML files as source files in this paper.

Motivation and Background XML and HTML • XML is more suitable to organize data than HTML. HTML is simple and widely used and accepted. • More and more XML sites appear on the Web. HTML files are far many more than XML files on the Web

Motivation and Background XML and HTML (Cont…) • Query XML is easier and standard query language is coming. HTML files are usually queried by using web search engines. • XML contains information no less than HTML • XML is easier to be converted to other data models, especially semistructured data models • XML is suitable to be the output of a wrapper • XML provides one of possible semantic interpretations of a document.

Motivation and Background Related Works • Construct wrappers for HTML files manually or automatically • Using specification files to extract data. Such as extraction system in TSIMMIS • Advantages: sufficient expressive and high precision • Limitations: built by experienced programmers and hard to maintain • Very time consuming to build a wrapper

Motivation and Background Related Works (Cont…) • Rule-based wrappers • Using rules to extract data • Inducing rules from training examples • Using delimiter-based rules • Such as: WIEN, STALKER, SoftMealy • Our wrapper is rule-based

Motivation and Background Some Criteria to Build Wrappers for Web Pages • Simple and powerful extracting rules • Need less examples and less user’s interaction • Use HTML structure information as much as possible • Easy to maintain and update • Less time to build a wrapper

Our Approach Our Approach • Rule-based wrapper • Use delimiter to identify data • Use training examples to induce rules • Use Weighted Token List to identify delimiter • Use rules and threshold to extract data

Our Approach Our Approach (Cont…) • Weighted token list (WTL): a list of vector. Each vector contains a set of <token, weight> pair • Token: could be a HTML tag, a word, or a punctuation in HTML files • Weight: how important a token is in its position. It’s a number between 0 and 1. • Generate WTL using labeled examples

Our Approach An Example • Part of a page from Amazon.Com

Our Approach An Example (Cont…) • We hope the wrapper could output the below result : TITLE: professional xml (2nd edition) AUTHOR: nikola ozu, et al TYPE: paperback DATE: may 2001 SHIPINFO: usually ships in 24 hours LISTPRICE: $59.99 OURPRICE: $47.99 Save: 20%

Our Approach An Example (Cont…) • The label information is input by the user • The label is the meaning of the data. So we could identify the extracted data. Such as “TITLE”, “AUTHOR” etc. in the previous page • There are two kinds of users. • The user who build the wrapper • The user who use the generated wrapper to extract data

Our Approach An Example (Cont…) • Part of HTML source code about author information in the example is : …… </td> <td > <A href="/exec/obidos/ASIN/1861005059/qid=99219684 8/sr=1-4/ref=sc_b_4/104-9977965-3139126"> Professional XML (2nd Edition) </A> byNikola Ozu, et al (Paperback – May 2001) ……

Our Approach An Example (Cont…) • If we choose token “by” as the left delimiter of the author data, and HTML tag “” as the right delimiter, we will have high recall but low precision when we try to extract author data. • If we choose a sequence of tokens as the delimiter, for example 5 tokens: • The 5 tokens before the author information: “<a>” “” “ ” “” “by” and the 5 tokens after the author information: “” “(“ “Paperback” “-” “May”

Our Approach An Example (Cont…) • Surveying the entire example page (25 books), we find: • The 5 tokens before the author data do not change and they are expressive enough to be left delimiter. • The 5 tokens after the author data are not precise enough to be the delimiter. For example, there is a book that is hardcover and do not have publish date, the right 5 tokens after author data is: “” “(“ “Hardcover” “)” “ ” • We will have high precision but low recall

Our Approach An Example (Cont…) • Surveying the example page, we find 6 out of 25 books are hardcover and 18 out of 25 books are paperback. • Using 3 books as training example, we obtain the following token lists:

<a>,1.0 , 1.0 ,1.0 ,1.0 “by”, 1.0 Paperback, 0.67 “)”, 0.67 , 0.67 , 1.0 “(“, 1.0 May, 0.33 Hardcover, 0.33 “-”, 0.33 Our Approach An Example (Cont…) • The begin weighted token list (the tokens before the author data): • The end weighted token list (the tokens after the author data):

Our Approach An Example (Cont…) • One token near the data associates its weight at its position • For example: means token “paperback” is found 2 out of 3 times (i.e. 67%) in training examples. We allocate the possibility of token “paperback” (0.67) as weight to this token in this position Paperback, 0.67

Our Approach An Example (Cont…) • The Weighted Token List to identify the left delimiter is: {by,1.0} {,1.0} { ,1.0} {,1.0} {<a>,1.0}

Our Approach An Example (Cont…) • The Weighted Token List to identify the left delimiter is: {,1.0} {(,1.0} {hardcover,0.33} {paperback,0.67} {),0.67} {-,0.33} { ,0.67} {may,0.33} • The colored line means there are two tokens are found in the third position after the author data.

Our Approach An Example (Cont…) • Using Weighted Token List, we achieve: • A list of tokens as the delimiter • Associating weights to tokens, we could obtain a better recall-precision tradeoff • We can “bear” small modification of HTML pages, especially, the modification is not occurred near the data

Our Approach Label the Example Page • Using our GUI tool:

Our Approach Label the Example Page (Cont…) • User highlights the interested data • User clicks the “input label” button • A dialog window pops up and user inputs the label • We insert the label into HTML file following our specification

Our Approach Label the Example Page (Cont…) • After labeling, the modified HTML file is: … [LABEL:TITLE]Professional XML (2nd Edition)[INFOREND] </a> by [LABEL:AUTHOR]Nikola Ozu, et al[INFOREND] ( [LABEL:TYPE]Paperback[INFOREND] – [LABEL:DATE]May 2001[INFOREND]) … • The user input parts are “TITLE”, “AUTHOR” etc.

Our Approach An Extraction Rule Has… • Label information • Delimiters information: • Begin WTL (BWTL): a WTL that describe a list of tokens as begin delimiter • End WTL (EWTL): a WTL that describe a list of tokens as end delimiter • A rule contains enough information to extract a piece of data

Our Approach A Rule Looks Like: • <LABEL, BWTL, EWTL> • LABEL is “AUTHOR” in our example • BWTL in our example is : {by,1.0} {,1.0} { ,1.0} {,1.0} {<a>,1.0} • EWTL in our example is : {,1.0} {(,1.0} {hardcover,0.33}{paperback,0.67} {),0.67}{-,0.33} { ,0.67} {may,0.33}

Our Approach How to Generate Rule? • Find label information after “[LABEL:” and before the next “]” from examples that is user labeled using the our GUI tool • We set the number of tokens needed as 5. User could use it to generate rules and test the result. If not good, user could set it manually • Generate BWTL for left delimiter • Generate EWTL for right delimiter • Assemble label, BWTL and EWTL to a rule

Our Approach How to Generate WTL • Find the begin point and end point of the data from the labeled training example • Detect the lists of tokens before and after the data • Use the collected tokens to generate new WTL or add the lists of tokens into correspond WTL and calculate the weight for each token • Weight of a token is calculated by using the times that the token appears near the data divided by the sum of the times that all the tokens appear in training examplesnear the same data

Our Approach Extract Data Using Rules • Tokenize the object HTML file • Obtain a list of tokens and find the correspond rule in rule set • Obtain the data • Associate label with data • Output the result

Our Approach Find the Correspond Rule • Obtain a list of tokens from web pages • Find a rule in rule set that if the given tokens are found in Weighted Token List and the sum of the weight of the tokens are larger than the threshold multiply the number of tokens

Our Approach Threshold • threshold is between 0 and 1 • After testing, we found the result is usually good when the threshold is set between 0.4 to 0.6. We set it to 0.5 by default • User could test the wrapper and change the threshold

Our Approach Another Example • HTML source code from Amazon.com about author data of a book … </a> by Cisco Systems (Editor), Vito Amato (Hardcover) …

Our Approach Another Example (Cont…) • We detect “<a> by” as left delimiter of author data, the weight is 5 larger than 5*0.5 • We detect “(Hardcover) ” as the right delimiter of author data, the weight is 3.7 larger than 5*0.5 • The author data is between two list of tokens “Cisco Systems (Editor), Vito Amato”

Result Analysis Result Analysis • We define: • field: a piece of data. The smallest unit that our wrapper could handle. For example, the author data of a book • item: a group of fields such as all the data of a book. • Our wrapper’s training example is item. For example, an Amazon.com page usually contains information more than ten books (ten items), we need only several of them (3 items) but not the entire page to be labeled as training examples

Source Size (kb) No. of items No. of fields in each item No. of items as examples CNN 34 10 4 3 WorldFact Book 25 1 165 1 MSN 46 15 3 3 Film.com 22 10 4 3 Amazon 99 25 8 3 Google 17 10 5 3 ACM DL 12 10 6 3 Ebay.com 65 50 5 3 BBC 33 10 4 3 News.com 43 30 3 3 Result Analysis Result Analysis (Cont…) • Ten test web sites’ basic information • Java SDK1.3. PC with Windows NT 4.0 workstation (Intel PIII 800/128 M RAM)

Source Recall (%) Precision (%) generation time (s) extraction time (s) CNN 100 100 3 9 WorldFact Book 100 100 25 40 MSN 100 100 2 4 Film.com 100 100 1 5 Amazon 100 88 10 43 Google 100 86 3 9 ACM DL 90 98 1 2 Ebay.com 100 83.3 3 19 BBC 100 80 1 4 News.com 63.3 100 1 6 Result Analysis Result Analysis (Cont…) • Ten test web sites’ recall-precision table

Result Analysis Recall and Precision • Recall and precision • Recall: 80% has a 100% recall • Precision: 50% has a 100% precision all have a more than 80% precision • Four sites has 100% both in recall and precision test • The recall of News.com is lowest because News.com’s web pages are assembled from several news and newspaper web sites • The result shows the best recall-precision balance. Increase the number of tokens will have a better recall but lower precision. Increase the threshold will cause a better precision but lower recall.

Result Analysis Wrapper Generation Time • Wrapper generation time: Except two examples, all the others need less than 5 seconds • Labeling time is not included in wrapper generation time and labels are input with the help of our GUI tool. The time depends on how many items are selected as training example and how many fields contained in one item. All examples’ labeling time is less than 10 minutes except worldfact book example page

Result Analysis Extraction Time • Data extraction time: 70% less than 10 seconds • The extraction time is related to the HTML file size. The HTML file size is usually not quite large. • The wrapper generation time and the labeling time are acceptable • The Data extraction time is not too long and is bearable when used in real time web applications

Result Analysis Compare to other approaches • Automatically generate wrappers and implement a friendly GUI tool to help user input labels and extract data • Simple and powerful rules that could deal with missed and mis-ordered items in web pages

Result Analysis Compare to other approaches(Cont…) • We need a less number of training examples because • when HTML file does not have missed and mis-ordered items, we demand no more examples than other methods. • When there is missed and mis-ordered items, we need not to meet every situation of missed and mis-ordered items in web pages • Quickly generated wrapper and the allocation Weights to token assures a easier maintenance and update

Conclusion Conclusion • Use weighted token list to find and extract data from HTML files. • A friendly GUI tool to generate wrappers easily • Acceptable result

Reference [1] S. Abiteboul. Querying Semistructured Data. In Proceedings of the International Conference on Datbase Theory (ICDT), January 1997. [2] S. Abiteboul, D.Quass, J.McHugh, J.Widom, and J.Wiener. The Lorel Query Language for Semistructured Data. Journal of Digital Libraries, November 1996 : 68-88 [3] Naveen Ashish, Craig A. Knoblock. Semi-Automatic Wrapper Generation for Internet Information Sources. CoopIS 1997: 160-169 [4] Naveen Ashish and Craig Knoblock. Wrapper Generation for Semi-Structured Internet Sources. . SIGMOD Record26 (4): 8-15, 1997 [5] S. Chawathe, H.Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom: The TSIMMIS Project: Integration of Heterogeneous Information sources. Proceedings of Tenth Anniversary Meeting of Information Processing Society of Japan, Tokyo, Japan, 1994: 7-18. [6] J. Hammer, H. Garcia-Molina , J. Cho , R. Aranha, A. Crespo. Extracting Semistructured Information from the Web. In Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997 [7] Chun-nan Hsu et al. Finite-State Transducers for Semi-structured Data Extraction From the Web. Information Systems, 23(8):521-538, 1998 [8] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper Induction for Information Extraction. International Joint Conference on Artificial Intelligence: 729-737, 1997 [9] Ion Muslea, Steve Minton, Craig Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent Systems: 4:93-114, 2001 [10] Arnaud Sahuguet, Fabien Azavant. WysiWyg Web Wrapper Factory (W4F). unpublished, 1999. http://db.cis.upenn.edu/Research/w4f.html [11] W3C. HTML 4.01 specification, http://www.w3.org/TR/html4/ [12] W3C. XML1.0, http://www.w3.org/TR/1998/REC-xml-19980210

Using Weight-controlled Token Matching to Extract Data From HTML Files

Using Weight-controlled Token Matching to Extract Data From HTML Files

Presentation Transcript

Data files using cgi/perl

Strategies to extract GPDs from data

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources

How to extract NCSX PDF files from Intralink

Extract data from superpages

Semalt Explains How To Extract The Data Needed From HTML Websites

Semalt Suggests A Tool To Extract Data From HTML Tables

Creating Bot to Extract Data from Yelp

How to extract NCSX PDF files from Intralink

Using Data Files and Streams

Extract Manufacturers Data from MFG

Extract Data from Parts Website

Extract Restaurants Data from Yelp

Extract Dentists Data from Ameritas

How To Extract Data From Twitter?

Extract Data From ClustrampCom Using Clustramp Data Scraper