Web Mining for Unknown Term Translation

Web Mining for Unknown Term Translation Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information engineering whlu@mail.ncku.edu.tw http://myweb.ncku.edu.tw/~whlu

Web Mining

Research Problems • Difficulties in automatic construction of multilingual translation lexicons • Techniques: Parallel/comparable corpora • Bottlenecks: Lacking diverse/multilingual resources • Difficulties in query translation for cross-language information retrieval (CLIR) • Techniques: Bilingual dictionary/machine translation/parallel corpora • Bottlenecks: Multiple-senses/short/diverse/unknown query • Challenges • Web queries are often • Short: 2-3 words (Silverstein et al. 1998) • Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries. • E.g. • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein) • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)

Cross-Language Information Retrieval • Query in source language and retrieve relevant documents in target languages ? SARS 愛因斯坦老年癡呆症 National Palace Museum Query Translation Information Retrieval Target Translation Source Query Target Documents

Difficulties in Web Query Translation Using Machine Translation Chinese translation:全國宮殿博物館 English source query :National Palace Museum

Research Paradigm New approach Live Translation Lexicon Web Mining Anchor-Text Mining Term-Translation Extraction Applications Internet Search-Result Mining Cross-Language Information Retrieval Cross-Language Web Search

Multilingual Anchor-Texts

Language-Mixed Texts in Search Result Pages

Anchor-Text Mining with Probabilistic Inference Model Conventional translation model • Asymmetric translation models: • Symmetric model with link information: Co-occurrence Page authority

Transitive Translation Model for Multilingual Translation • Direct Translation Model • Indirect Translation Model • Transitive Translation Model Direct Translation t s 新力(Traditional Chinese) ソニー (Japanese) Indirect Translation m Sony (English) … s : source term t : target translation m: intermediate translation

Promising Results for Automatic Construction of Multilingual Translation Lexicons

Search-Result Mining • Goal: Improve translation coverage for diverse queries • Idea • Chi-square test: co-occurrence relation • Context-vector analysis: context information • Context-vector similarity measure • Weighting scheme: TF*IDF • Chi-square similarity measure • 2-way contingency table

Workshop on Web Mining Technology and Applications (Dec. 13, 2006)PanelWeb Mining: Recent Development and Trends 曾新穆教授 (Vincent S. Tseng) 成功大學資訊工程系

Main Categories of Web Mining • Web content mining • Web usage mining • Web structure mining

Web Content Mining • Trends • Deep web mining • Semantic web mining • Vertical search • Web multimedia content mining • Web image/video search • Web image/video annotation/classification/clustering • Web multimedia content filtering • Example: YouTube • Integration with web log mining

Web Usage Mining • Developed techniques • Mining of frequent usage patterns • Association rules, sequential patterns, traversal patterns, etc. • Trends • Personalization • Recommendation • Web Ads • Incorporation of content semantics/ontology • Considerations of Temporality • Extension to mobile web applications • Multidiscipline integration

Problems: Under-utilization of Clickstream Data • Shop.org: U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever on Thanksgiving, 2004 • Top five sites: eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com • Aberdeen Group: • 70% of site companies use Clickstream data only for basic website management!

Challenges for Clickstream Data Mining- Arun Sen et al., Communications of ACM, Nov. 2006 • Problems with data • Data incompleteness • Very large data size • Messiness in the data • Integration problems with Enterprise Data • Too Many Analytical Methodologies • Web Metric-based Methodologies • Basic Marketing Metric-based Methodologies • Navigation-based Methodologies • Traffic-based Methodologies • Data Analysis Problems • Across-dimension analysis problems • Timeliness of data mining under very large data size • Determination of useful/actionable analysis under thousands of metrics

Web Information Extraction:The Issues for Unsupervised Approaches Dr. Chia-Hui Chang (張嘉惠) Department of Computer Science and Information Engineering, National Central University, Taiwan (Talk given at 2006 網路探勘技術與趨勢研討會)

Outline • Web Information Extraction • The key to web information integration • Three Dimensions • Task definition • Automation degree • Technology • Focused on Template Pages IE task • Issues for record-level IE • Techniques for solving these issues

Introduction • The coverage of Web information is very wide and diverse • The Web has changed the way we obtain information. • Information search on the Web is not enough anymore. • The stronger need for Web information integration has increased than ever (both for business and individuals). • Understanding those Web pages and discovering valuable information from them is called Web content mining. • Information extraction is one of the keys for web content mining.

Web Information Integration • From information search to information extraction, to information mapping • Focused crawling / Web page gathering • Information search • Information (Data) extraction • Discovering structured information from input • Schema matching • With a unified interface / single ontology

Three Dimensions to See IE • Task Definition • Input (Unstructured free texts, semi-structured Web pages) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, annotation-based or annotation-free approaches • Techniques • Learning algorithm: specific/general to general/specific • Rule type: regular expression rules vs logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models

IE from Nearly-structured Documents Google search result Multiple-records Web page

IE from Nearly-structured Documents Single-record Pages Amazon.com book pages

IE from Semi-structured Documents Ungrammatical snippets A publication list Selected articles

Information Extraction From Free Texts Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Named entity extraction, IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. [Excerpted from Cohen & MaCallum’s talk].

NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder Information Extraction From Free Texts As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * [Excerpted from Cohen & MaCallum’s talk].

Dimension 1: Task Definition - Input

Dimension 1: Task Definition - Output • Attribute level (single-slot) • Named entity extraction, concept annotation • Record level • Relation between slots • Page level • All data embedded in a dynamic page • Site level • All information about a web site

Template (T) Database …… …… …… CGI (T,x) Output Pages Template Page Generation & Extraction • Generation/Encoding • Extraction/Decoding: A reverse engineering

Dimension 2: Automation Degree • Programming-based • For programmers • Supervised learning • A bunch of labeled examples • Semi-supervised learning/Active learning • Interactive wrapper induction • Unsupervised learning • Mostly for template pages only

Tasks vs. Automation Degree • High Automation Degree (Unsupervised) • Template page IE • Semi-Automatic / Interactive • Semi-structured document IE • Low Automation Degree (Supervised) • Free text IE

Dimension 3: Technologies • Learning Technology • Supervised: rule generalization, hypothesis testing, statistical modeling • Unsupervised learning: pattern mining, clustering • Features used • Plain text information: tokens, token class, etc. • HTML information: DOM tree path, sibling, etc. • Visual information: font, style, position, etc. • Rule Types (Expressiveness of the rules) • Regular expression, first-order logic rules, HMM model

Issues for Unsupervised Approaches • For Record-level Extraction • Data-rich Section Discovery • Record Boundary (Separator) Mining • Schema Detection & Data Annotation • For Page-level Extraction • Schema Detection - differentiate template from data tokens

Data-Rich Section Record Boundary Attribute Attribute

Some Related Works on Unsupervised Approaches • Record-level • IEPAD {Chang and Liu, WWW2001] • DeLa [Wang and Lochovsky, WWW2003] • DEPTA [Zhai and Liu, WWW2005] • ViPER [Simon and Lausen, CIKM 2005] • ViNT[Zhao et al, WWW 2005] • Page-level • Roadrunner [Crescenzi, VLDB2001] • EXALG [Arasu and Garcia-Molina, SIGMOD2003] • MSR [Zhao et al., VLDB 2006]

Comparing a normal page with no-result page Comparing two normal pages Locate static text lines, e.g. Books Related Searches Narrow or Expand Results Showing Results … Issue 1: Data-Rich Section Discovery ViNT [Zhao, et al. WWW2005] MSE [Zhao, et al. VLDB2006]

HL(R) • [Papadakis, et al., SAINT2005] Issue 1: Data-Rich Section Discovery (Cont.) • Similarity between two adjacent leaf nodes • 1-dimension clustering • Pitch Estimation

String Pattern Mining Tree Pattern Mining Issue 2: Record Boundary Mining • <html><body>T<ol> • <li>TTTT</li> • <li>TTT</li> • </ol></body><html> IEPAD [Chang and Liu, WWW2001] • <A>T</A><A>T</A> T<A>T</A>T •  <A>T</A>T <A>T</A>T DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005]

Web Mining for Unknown Term Translation

Web Mining for Unknown Term Translation

Presentation Transcript

Web Mining

Web Mining

Mining Logs for Long-Term Patterns

Web Mining

Web Mining

Web Mining

Web Mining

Web mining

Web Mining

Web Mining

Translation of Web Queries Using Anchor Text Mining

Web Mining

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING

Translation of Web Queries Using Anchor Text Mining