Term Filtering with Bounded Error. Zi Yang , Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China { yangzi , tangjie , ljz }@ keg.cs.tsinghua.edu.cn [email protected]
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Zi Yang, Wei Li, Jie Tang, and Juanzi Li
Knowledge Engineering Group
Department of Computer Science and Technology
Tsinghua University, China
{yangzi, tangjie, [email protected]
Presented in ICDM’2010, Sydney, Australia
the number of times that term occurs in document
How can we overcome the disadvantage?
A A B
A A B
Doc 1
Doc 2
the number of occurrences of superterm in document
A transformed representation for document after merging
Can be chosen with different methods: winner-take-all, average-occurrence, etc.
A A B
A A B
Doc 1
Doc 2
Doc 1
Doc 1
Doc 2
Doc 2
S
A
S
B
(1) ,
(2) for all, ,
(3) for all , .
Mapping function from terms to superterms
Bounded by user-specified errors
a projection vector (drawn from the Gaussian distribution)
Efficient Candidate Superterm Generation
the width of the buckets
(assigned empirically)
[Datar et al., 2004].
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
Same hash value
(with several randomized hash projections)
Now search -near neighbors in all the buckets!
Close to each other
(in the original space)
* Optimal resulting from different choice of tuned