1 / 23

Multimedia Information extraction from HTML product catalogues

Multimedia Information extraction from HTML product catalogues. Martin Labsk ý 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Šváb 1 {labsky, svatek, xsvao06}@vse.cz, pavel.praks@vsb.cz rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics

thad
Download Presentation

Multimedia Information extraction from HTML product catalogues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia Information extraction from HTML product catalogues Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1 {labsky, svatek, xsvao06}@vse.cz, pavel.praks@vsb.cz rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics 2 Dept. of Applied Mathematics, Technical University of Ostrava DATESO, April 14th 2005

  2. Agenda • Information Extraction from Internet • Annotation using Hidden Markov Models • Extracting images • Instance composition guided by ontology • Bicycle search application DATESO, April 14th 2005

  3. IE from Internet searching for objects of type Bicycle in price range €500- €900 find structures (name, price, equipment) IE from Internet • Motivation • Semantic and structured search over large document collections • Requirements • Identify relevant documents • Perform automatic IE • documents are semi-structured, have heterogeneous layouts and formattings DATESO, April 14th 2005

  4. IE from Internet Our approach to IE Acquire new document HTML Preprocessing w1 w2 ... wn name price picture Annotation using HMMs w1 w2w3 w4 w5w6 w7 w8w9 ... wn w6 w7 w9 w3 w4 Bicycle offer Instance extraction name w3w4 price w6w7 picture w9 DATESO, April 14th 2005

  5. IE from Internet Relevant documents DATESO, April 14th 2005

  6. Agenda • Information Extraction from Internet • Annotation using Hidden Markov Models • Extracting images • Instance composition guided by ontology • Bicycle search application DATESO, April 14th 2005

  7. Annotation using HMMs Preprocessing • HTML cleanup • conversion to valid XHTML • Only potentially relevant blocks kept • blocks that do not directly contain text or images omitted • Formatting tags • attributes removed • several rules matching common constructions (add-to-basket form, choose-amount button) • Images • baseline: all images treated as a single token DATESO, April 14th 2005

  8. Annotation using HMMs Preprocessing – example <p align="center"><a href="/products.php?plid=m1b0s1p979"> <img src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70 alt="TREK Session 77" border=0><br> TREK Session 77</a><br> (2005)<br> OUR PRICE £3000.00 <form method=post action=/products.php?plid=m1b0s1p0 name=buyit> <input type=hidden name=cartadditem id=cartadditem value=979> <select name="selected_size" id="selected_size"> <option value="size not specified">-- Select Size --</option> <option value="15.5">15.5</option> <option value=" 17.5"> 17.5</option> <option value=" 19"> 19</option> </select><br> <input type="hidden" name="selected_colour" id="selected_colour" value="default"> <select name=add_qty id=add_qty><option value=0>0</option><option value=1 SELECTED>1</option><option value=2>2</option><option value=3>3</option><option value=4>4</option><option value=5>5</option></select> <input type=submit name=submit id=submit value="Add to Basket"></form> <p> <img/> <br/> TREK Session 77 <br/> ( 2005 ) <br/> OUR PRICE &pound; 3000 . 00 <p> - - Select Size - - 15 . 5 17 . 5 19 <br/> <_CHOOSEAMOUNT/> <_ADDTOBASKET/> DATESO, April 14th 2005

  9. Annotation using HMMs Document modeling using HMMs word class • Generative model • Document = [w1c1] [w2c2] • P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2) • c1c2 = argmaxi,jP([w1ci] [w2cj]) transition prob. lexical prob. P(c2|c1) c1 c2 estimated from training data (frequencies) P(c1|c2) P(w1|c1) P(w1|c2) DATESO, April 14th 2005

  10. Annotation using HMMs HMM Structure • States • adopted from [Freitag, McCallum 99] • Target, Prefix, Suffix and Background • densely connected • Class trigram model • P(name | name_prefix, name) • Variations • word-ngram models for lexical probabilities of target statesP(w1 | wi-1, name) • state substructures instead of single target states, learned by EM DATESO, April 14th 2005

  11. Agenda • Information Extraction from Internet • Annotation using Hidden Markov Models • Extracting images • Instance composition guided by ontology • Bicycle search application DATESO, April 14th 2005

  12. Extracting Images Extracting Images • Baseline • every image represented by the same <img/> token • HMM only extracts product images based on context, e.g. P(product_picture | name, product_picture_prefix) • Use image classifier to preprocess images • classifies into 3 classes – Pos, Neg, Unk • before HMM annotation, each image occurrence in document is substituted by its class DATESO, April 14th 2005

  13. Extracting Images Image Classification – Features • Image size • estimated 2-dimensional normal distribution from a set of 1000 unique bicycle images  NC(x, y) • estimated decision threshold (1-feature binary classifier) using held-out set of 150 images (60% positive) • Image similarity • latent semantic similarity [Praks 2004]  sim(I1,I2) • estimated decision threshold for 1-feature bin classifier • Does the image repeat in document? DATESO, April 14th 2005

  14. Extracting Images Image Classification • Combined binary classifier • Multi-layer perceptron (Weka) • Features: NC(x,y) , simC(I) , repeats(I) • Performance of binary classifiers • 10-fold cross-validation, document-level folds DATESO, April 14th 2005

  15. Extracting Images Annotation Results • Combined ternary classifier • outputs Pos Unk Neg • decision list based on predictions of all 3 single feature ternary classifiers DATESO, April 14th 2005

  16. Agenda • Information Extraction from Internet • Annotation using Hidden Markov Models • Extracting images • Instance composition guided by ontology • Bicycle search application DATESO, April 14th 2005

  17. Instance Composition Instance Composition Document annotated by HMM Instance extraction algorithm Instances (xml) Presentation ontology Sesame RDF repository DATESO, April 14th 2005

  18. Instance Composition Presentation Ontology Domain ontology DATESO, April 14th 2005

  19. Instance Composition Instance extraction algorithm • Sequentially parses annotated document • Adds annotated attributes to working instance WI • If adding an attribute would cause an inconsitency, an empty working_instance is created. The old working_instance is saved only if it is consistent. http://eso.vse.cz/~labsky/cgi-bin/client/ • WI = empty_instance; • while (more_attributes) { • A = next_attribute; • if (cannot_add (WI, A)) { • if (consistent (WI)) { • store (WI); • } • WI = empty_instance; • } • add (WI, A); • } DATESO, April 14th 2005

  20. Agenda • Information Extraction from Internet • Annotation using Hidden Markov Models • Extracting images • Instance composition guided by ontology • Bicycle search application DATESO, April 14th 2005

  21. Bicycle search application, powered by Sesame RDF DB http://rainbow.vse.cz:8000/sesame/ DATESO, April 14th 2005

  22. Future work • Learn to correct annotation errors • use document structure to detect unlabeled attributes • bootstrap from these new examples • use ontology constraints on values (types, lists, regexps) • Population algorithm • utilize scores for each annotated attribute • augment presentation ontology with frequencies of attribute orderings • use approximate name matching to identify instances • Improve search interface • approximate name matching (word and char edit distance) DATESO, April 14th 2005

  23. Thank you! rainbow.vse.cz DATESO, April 14th 2005

More Related