1 / 118

Web Mining for Unknown Term Translation

Web Mining for Unknown Term Translation. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information engineering whlu@mail.ncku.edu.tw http://myweb.ncku.edu.tw/~whlu. Web Mining. Research Problems. Difficulties in automatic construction of multilingual translation lexicons

ravi
Download Presentation

Web Mining for Unknown Term Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Mining for Unknown Term Translation Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information engineering whlu@mail.ncku.edu.tw http://myweb.ncku.edu.tw/~whlu

  2. Web Mining

  3. Research Problems • Difficulties in automatic construction of multilingual translation lexicons • Techniques: Parallel/comparable corpora • Bottlenecks: Lacking diverse/multilingual resources • Difficulties in query translation for cross-language information retrieval (CLIR) • Techniques: Bilingual dictionary/machine translation/parallel corpora • Bottlenecks: Multiple-senses/short/diverse/unknown query • Challenges • Web queries are often • Short: 2-3 words (Silverstein et al. 1998) • Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries. • E.g. • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein) • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)

  4. Cross-Language Information Retrieval • Query in source language and retrieve relevant documents in target languages ? SARS 愛因斯坦 老年癡呆症 National Palace Museum Query Translation Information Retrieval Target Translation Source Query Target Documents

  5. Difficulties in Web Query Translation Using Machine Translation Chinese translation:全國宮殿博物館 English source query :National Palace Museum

  6. Research Paradigm New approach Live Translation Lexicon Web Mining Anchor-Text Mining Term-Translation Extraction Applications Internet Search-Result Mining Cross-Language Information Retrieval Cross-Language Web Search

  7. Multilingual Anchor-Texts

  8. Language-Mixed Texts in Search Result Pages

  9. Anchor-Text Mining with Probabilistic Inference Model Conventional translation model • Asymmetric translation models: • Symmetric model with link information: Co-occurrence Page authority

  10. Transitive Translation Model for Multilingual Translation • Direct Translation Model • Indirect Translation Model • Transitive Translation Model Direct Translation t s 新力(Traditional Chinese) ソニー (Japanese) Indirect Translation m Sony (English) … s : source term t : target translation m: intermediate translation

  11. Promising Results for Automatic Construction of Multilingual Translation Lexicons

  12. Search-Result Mining • Goal: Improve translation coverage for diverse queries • Idea • Chi-square test: co-occurrence relation • Context-vector analysis: context information • Context-vector similarity measure • Weighting scheme: TF*IDF • Chi-square similarity measure • 2-way contingency table

  13. Workshop on Web Mining Technology and Applications (Dec. 13, 2006)PanelWeb Mining: Recent Development and Trends 曾新穆 教授 (Vincent S. Tseng) 成功大學 資訊工程系

  14. Main Categories of Web Mining • Web content mining • Web usage mining • Web structure mining

  15. Web Content Mining • Trends • Deep web mining • Semantic web mining • Vertical search • Web multimedia content mining • Web image/video search • Web image/video annotation/classification/clustering • Web multimedia content filtering • Example: YouTube • Integration with web log mining

  16. Web Usage Mining • Developed techniques • Mining of frequent usage patterns • Association rules, sequential patterns, traversal patterns, etc. • Trends • Personalization • Recommendation • Web Ads • Incorporation of content semantics/ontology • Considerations of Temporality • Extension to mobile web applications • Multidiscipline integration

  17. Problems: Under-utilization of Clickstream Data • Shop.org: U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever on Thanksgiving, 2004 • Top five sites: eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com • Aberdeen Group: • 70% of site companies use Clickstream data only for basic website management!

  18. Challenges for Clickstream Data Mining- Arun Sen et al., Communications of ACM, Nov. 2006 • Problems with data • Data incompleteness • Very large data size • Messiness in the data • Integration problems with Enterprise Data • Too Many Analytical Methodologies • Web Metric-based Methodologies • Basic Marketing Metric-based Methodologies • Navigation-based Methodologies • Traffic-based Methodologies • Data Analysis Problems • Across-dimension analysis problems • Timeliness of data mining under very large data size • Determination of useful/actionable analysis under thousands of metrics

  19. Web Information Extraction:The Issues for Unsupervised Approaches Dr. Chia-Hui Chang (張嘉惠) Department of Computer Science and Information Engineering, National Central University, Taiwan (Talk given at 2006 網路探勘技術與趨勢研討會)

  20. Outline • Web Information Extraction • The key to web information integration • Three Dimensions • Task definition • Automation degree • Technology • Focused on Template Pages IE task • Issues for record-level IE • Techniques for solving these issues

  21. Introduction • The coverage of Web information is very wide and diverse • The Web has changed the way we obtain information. • Information search on the Web is not enough anymore. • The stronger need for Web information integration has increased than ever (both for business and individuals). • Understanding those Web pages and discovering valuable information from them is called Web content mining. • Information extraction is one of the keys for web content mining.

  22. Web Information Integration • From information search to information extraction, to information mapping • Focused crawling / Web page gathering • Information search • Information (Data) extraction • Discovering structured information from input • Schema matching • With a unified interface / single ontology

  23. Three Dimensions to See IE • Task Definition • Input (Unstructured free texts, semi-structured Web pages) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, annotation-based or annotation-free approaches • Techniques • Learning algorithm: specific/general to general/specific • Rule type: regular expression rules vs logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models

  24. IE from Nearly-structured Documents Google search result Multiple-records Web page

  25. IE from Nearly-structured Documents Single-record Pages Amazon.com book pages

  26. IE from Semi-structured Documents Ungrammatical snippets A publication list Selected articles

  27. Information Extraction From Free Texts Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Named entity extraction, IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. [Excerpted from Cohen & MaCallum’s talk].

  28. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder Information Extraction From Free Texts As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * [Excerpted from Cohen & MaCallum’s talk].

  29. Dimension 1: Task Definition - Input

  30. Dimension 1: Task Definition - Output • Attribute level (single-slot) • Named entity extraction, concept annotation • Record level • Relation between slots • Page level • All data embedded in a dynamic page • Site level • All information about a web site

  31. Template (T) Database …… …… …… CGI (T,x) Output Pages Template Page Generation & Extraction • Generation/Encoding • Extraction/Decoding: A reverse engineering

  32. Dimension 2: Automation Degree • Programming-based • For programmers • Supervised learning • A bunch of labeled examples • Semi-supervised learning/Active learning • Interactive wrapper induction • Unsupervised learning • Mostly for template pages only

  33. Tasks vs. Automation Degree • High Automation Degree (Unsupervised) • Template page IE • Semi-Automatic / Interactive • Semi-structured document IE • Low Automation Degree (Supervised) • Free text IE

  34. Dimension 3: Technologies • Learning Technology • Supervised: rule generalization, hypothesis testing, statistical modeling • Unsupervised learning: pattern mining, clustering • Features used • Plain text information: tokens, token class, etc. • HTML information: DOM tree path, sibling, etc. • Visual information: font, style, position, etc. • Rule Types (Expressiveness of the rules) • Regular expression, first-order logic rules, HMM model

  35. Issues for Unsupervised Approaches • For Record-level Extraction • Data-rich Section Discovery • Record Boundary (Separator) Mining • Schema Detection & Data Annotation • For Page-level Extraction • Schema Detection - differentiate template from data tokens

  36. Data-Rich Section Record Boundary Attribute Attribute

  37. Some Related Works on Unsupervised Approaches • Record-level • IEPAD {Chang and Liu, WWW2001] • DeLa [Wang and Lochovsky, WWW2003] • DEPTA [Zhai and Liu, WWW2005] • ViPER [Simon and Lausen, CIKM 2005] • ViNT[Zhao et al, WWW 2005] • Page-level • Roadrunner [Crescenzi, VLDB2001] • EXALG [Arasu and Garcia-Molina, SIGMOD2003] • MSR [Zhao et al., VLDB 2006]

  38. Comparing a normal page with no-result page Comparing two normal pages Locate static text lines, e.g. Books Related Searches Narrow or Expand Results Showing Results … Issue 1: Data-Rich Section Discovery ViNT [Zhao, et al. WWW2005] MSE [Zhao, et al. VLDB2006]

  39. HL(R) • [Papadakis, et al., SAINT2005] Issue 1: Data-Rich Section Discovery (Cont.) • Similarity between two adjacent leaf nodes • 1-dimension clustering • Pitch Estimation

  40. String Pattern Mining Tree Pattern Mining Issue 2: Record Boundary Mining • <html><body><b>T</b><ol> • <li><b>T</b>T<b>T</b>T</li> • <li><b>T</b>T<b>T</b></li> • </ol></body><html> IEPAD [Chang and Liu, WWW2001] • <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> •  <P><A>T</A>T</P> <P><A>T</A>T</P> DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005]

More Related