1 / 14

Alexander Yeh MITRE Corp. October 2008

Potential Query Log Sets . Alexander Yeh MITRE Corp. October 2008. Possible Issues with a "Query Log" Corpus. Resembles queries of real interest to somebody Has some 'geo' aspect Multi-lingual Mitre in-house has limitations on languages

yaakov
Download Presentation

Alexander Yeh MITRE Corp. October 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Potential Query Log Sets Alexander Yeh MITRE Corp. October 2008

  2. Possible Issues with a "Query Log" Corpus • Resembles queries of real interest to somebody • Has some 'geo' aspect • Multi-lingual • Mitre in-house has limitations on languages • Permission to use and distribute (even after the evaluation)

  3. More Recent Suggestions (While at Workshop) • Local search queries from various Wikipedias • Multi-lingual • Privacy? -probably not as bad as other search logs (more like encyclopedia lookup) • Permission? • Long enough to be interesting from a "geo" standpoint?

  4. More Recent Suggestions (Continued) • Treat GikiP topics as queriesE.g.: GP4 "Which Swiss cantons border Germany?” • Multi-lingual, have permission, no privacy problem • Combine with GikiP 2009 for publicity purposes • But few in number (15 in 2008 pilot) • Realistic enough? • Use logs generated by an evaluation (like iCLEF) • Multi-lingual, permissions & privacy dealt with • But realistic enough? • Has "geo" aspect?

  5. More Recent Suggestions (Concluded) • Timway search logs from Hong Kong • Chinese, English, usually 1 language in a query • Used in some studies, but usual permission & privacy issues • Also, finding annotator(s) may be an issue: • Chinese probably in Cantonese (versus "official" Mandarin dialect) - not too bad in written form • Probably traditional characters (not mainland China’s simplified characters)

  6. Potential Query Log Data Sets - 1 • Tumba! (Diana Santos, Nuno Cardoso and others) • Available, large amount, a lot not released before • In Portuguese: need to hire and train somebody who can annotate Portuguese

  7. Potential Query Log Data Sets - 2 • Workshop on Web Search Click Data 2009 (WSCD 2009) • http://research.microsoft.com/users/nickcr/wscd09/ • MSN search query log • Large amount, relatively new (and so not seen as much) • Pursuing getting permission (asking Nick Craswell) • Cancelled query parsing task in CLEF 2008 • Current status: cannot release data outside of Microsoft

  8. Potential Query Log Data Sets - 3 • Query parsing task in CLEF 2007 • Query log of 800K English queries (unlabeled), 100 queries of labeled training data and 500 queries of test data • Presumably this log is still available for use in a new query parsing task. • Use same set, but generate new training and test • One disadvantage: the CLEF community is already familiar with this data set

  9. Can Easily Obtain the Following Query Log Data Sets, But … • Can easily obtain a number of data-sets, but • They are old, and so may have been already seen by the CLEF community • Problems getting permissions to use these • Anticipate problems, or • Been asked not to use

  10. Query Log Data Sets that are Easy to Obtain • KDD Cup 2005: Ying Li, a co-chair, asked us not to use • AlltheWeb_2001.gz, AlltheWeb_2002.gz, AltaVista_2002.zip: Jim Jansen: the data sharing agreement has expired • Excite_1997_small.zip, Excite_1997_large.zip, Excite_1999.zip, Excite_2001.gz: from Jim Jansen. Need Excite's permission?

  11. Query Log Data Sets that are Easy to Obtain (Concluded) • AOL query log: from http://gregsadetsky.com/aol-data/ • Was made available to the public for awhile • Created a controversy about privacy • But all these data sets will have similar privacy issues

  12. A Way to ‘Use' these Data-Sets (John Burger): • Use the existing logs as 'inspiration' for a made-up log corpus • May have been done by others, like NIST • Will not need permission • Will not have been seen before • Can insure no privacy disclosures • But will take time to produce the made-up data

  13. Privacy Concerns • Though most well known with the AOL query logs, all these data sets may contain private data • One way to 'remove': use the existing logs as 'inspiration' for a made-up log corpus (mentioned above) • A fast, incomplete way to remove private data:remove the query timestamps and links indicating which queries came from the same site and randomize the order of the queries • A lot of the 'disclosures' comes from grouping the queries to a common source • But the removed information is now not available to a query parser

  14. Privacy Concerns (Concluded) • A slower, more complete way to remove private data:review the data (perhaps as it is annotated) and flag any ones with private data • Either substitute the flagged data with fictional information or remove the queries with flags from the data sets

More Related