1 / 18

Translation of Web Queries Using Anchor Text Mining

Translation of Web Queries Using Anchor Text Mining. Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Wen-Hsiang Lu. ACM, June 2002. Outline. Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion

delora
Download Presentation

Translation of Web Queries Using Anchor Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Translation of Web Queries Using Anchor Text Mining Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors :Wen-Hsiang Lu ACM, June 2002

  2. Outline • Motivation • Objective • Introduction • Anchor Text Mining • Probabilistic Inference Model • Query Translation System • Experiments • Discussion • Conclusion • Personal Opinion

  3. Motivation • One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.

  4. Objective • automatically extracting translations of Web query terms

  5. Introduction • In this paper, we are interested in discovering translations of new terminology and proper names through mining Web anchor texts. • the problems of precious research methods • parallel corpora for various subject and multiple languages • lack of parallel correlation between word pairs • short query terms Yahoo 雅虎 雅虎 Yahoo 雅虎 搜尋、雅虎.. 美國雅虎

  6. Uj Uj Ui Uj Anchor Text Mining Uj Uj • We use a triple form <Uj,Ui,Dk> to indicate that page Ujpoints to page Uiwith description text Dk. • For a Web page (or URL) Ui, its anchor-text set AT(Ui) is defined as all of the anchor texts of the links pointing to Ui, i.e., Ui ’s inlinks. • For a query term appearing in AT(Ui), it is likely that its corresponding translations also appear together.

  7. Probabilistic Inference Model • asymmetric similarity estimation model • cause some common terms may become the best translations. • symmetric similarity estimation function based on the probabilistic inference model defined first below: where Tt is target translation ; Ts is source term, the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt). the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt). (2) Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text ) 雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1 雅虎 動物P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1 雅虎 企業 …………. 100

  8. Probabilistic Inference Model (cont.) • Let U=(U1,U2,…,Un) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., Ui∩Uj = ∅ for i≠j. We can rewrite Eq.(2) as follows: Uj L(Ui) 15 where L(Uj) = the number of in-links of pages Uj

  9. Probabilistic Inference Model (cont.) • We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui) • the above estimation approach considers the link information and degree of authority among Web pages.

  10. Query Translation System • three different methods to extract Chinese terms: • PAT-tree-based • check if the strings of candidate terms are complete in a lexical boundary • decide the importance of a term, based on its relative frequency • Query-set-based • take queries from search engines • query sets of different sizes • Tagger-based • use the CKIP’s tagger • extract unknown words Yahoo 雅虎 雅虎 搜尋、雅虎 美國雅虎

  11. Experiments • Experimental Environment • Collected popular query terms with the logs from Dreamer and GAIS. • These query terms were taken as the major test set in our term translation extraction analysis. • We filtered out the terms that had no corresponding Chinese translations in the anchor-text database and picked up 622 English terms as the source query set.

  12. Experiments (cont.) • Evaluation Metric • For a set of test query terms, its top-n inclusion rate is defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations.

  13. Experiments (cont.) • Performance with Various Similarity Estimation Models • MA, Asymmetric model as • MAL, Asymmetric model with link information: • MS, Symmetric model as • MSL, Symmetric model with link information as (the proposed model). • 622 English query terms and query-set-based method

  14. Experiments (cont.) • Performance with Various Term Extraction Methods • use MSL as similarity estimation model

  15. Experiments (cont.) • Performance with Various Query-Set Sizes • medium-sized query set achieved the best performance. • Example: "sakura" • 9709 terms:台灣櫻花(Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) • 228,566 terms:庫洛魔法使(Card Captor Sakura); 櫻花建設(Sakura Development Corporation); 模仿(imitation); 櫻花大戰(Sakura Wars); 美夕(Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) might also produce more noise

  16. Discussion • Comparisons with a translation lexicon • Queries suitable for finding translations • Extracting domain-specific translations • Experiments on Simplified Chinese pages

  17. Conclusion • proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms. • Future research • combining more in-depth linguistic knowledge to remove noisy terms.

  18. Personal Opinion • ……..

More Related