Term Filtering with Bounded Error. Zi Yang , Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China { yangzi , tangjie , ljz }@ keg.cs.tsinghua.edu.cn lwthucs@gmail.com
the number of times that term occurs in document
How can we overcome the disadvantage?
A A B
A A B
Doc 1
Doc 2
the number of occurrences of superterm in document
A transformed representation for document after merging
Can be chosen with different methods: winner-take-all, average-occurrence, etc.
A A B
A A B
Doc 1
Doc 2
Doc 1
Doc 1
Doc 2
Doc 2
S
A
S
B
(1) ,
(2) for all, ,
(3) for all , .
Mapping function from terms to superterms
Bounded by user-specified errors
#iterations << |V|
a projection vector (drawn from the Gaussian distribution)
Efficient Candidate Superterm Generation
the width of the buckets
(assigned empirically)
[Datar et al., 2004].
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
Same hash value
(with several randomized hash projections)
Now search -near neighbors in all the buckets!
Close to each other
(in the original space)
Avg. errors
Max errors
* Optimal resulting from different choice of tuned
