1 / 15

Automatically Annotating Web Pages Using Google Rich Snippets

Automatically Annotating Web Pages Using Google Rich Snippets.

refugioa
Download Presentation

Automatically Annotating Web Pages Using Google Rich Snippets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Annotating Web Pages Using Google Rich Snippets This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  2. Introduction (1) • Semantically annotating Web pages enhances machine interpretation • Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages • The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  3. Introduction (2) • Automating annotation for static and 3rd party Web sites is deemed necessary • Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  4. Framework (1) • Four main stages: • Hotspot identification • Subjectivity analysis • Information extraction • Page annotation • Web pages are converted to DOM trees in order to enable easy processing 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  5. Framework (2) RDFa 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  6. Framework (3): Hotspots • Reviews are characterized by large blocks of text: hotspots • Headers, navigation elements, footers, etc., do not contain these blocks • Text blocks have few HTML elements • For each element in the DOM tree, we compute the text-to-content-ratio (TTCR): , with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  7. Framework (4): Hotspots • Illustrative example: • The h1 element contains 64/73 × 100% ≈ 88% text • However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements <h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed </h1> <div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  8. Framework (5): Subjectivity • Hotspots are verified as reviews whenever they are subjective enough • We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): • Original: check if document has ≥ ksentences that contain ≥ nsubjectivity words each • Modification: check if document has ≥ m percent of all sentences that contain ≥ nsubjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  9. Framework (6): IE • Various information is extracted: • Authors: • Named entities are detected in the vicinity of hotspots • Named Entity Recognizer (NER) • Dates: • Many different date formats are easily parsed • Regular expressions • Products: • Name often found in title and h1 elements • Overlapping words • Ratings: • Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) • We focus on numerical ratings • Regular expressions on plain text or alt text of images (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) MM ddyyyy ([0-9.,]+)\s?/\s?([0-9.,]+) 4/5 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  10. Framework (7): Annotation • Key elements are tagged using Google Rich Snippets • A new annotated Web page is returned <div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p> </div> 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  11. Implementation (1) • We have implemented the ARROW framework as a Web application: • Java-based • Apache Tomcat server • Input: • URL • Preferred output: • Visualizer • Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  12. Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  13. Evaluation • Test set: 100 review, 100 non-review Web pages • Sub-second performance • Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) • Main problems related to detecting authors, likely caused by the use of nicknames • Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  14. Conclusions • We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets • Framework not bound to vocabulary • Proof-of-concept implementation shows promising results • Future work: • Improve heuristics • Add intelligent (semantically enabled) text parsers • Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

  15. Questions http://www.arrow-project.com/ 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

More Related