1 / 27

Collection Synthesis

This research examines collection synthesis in digital libraries, including the concept of clusters, the document vector space model, and the use of centroids. It also explores the process of building seed URL sets and crawl control. The evaluation of collections is discussed, along with possible future developments in machine learning.

smckelvey
Download Presentation

Collection Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002

  2. Collection – what is it? • For a digital library, it could be a set of URLs • The documents pointed to are about the same topic • They may or may not be archived • They may be collected by hand or automatically

  3. Collections and Clusters • Clusters are collections of items • The items within the cluster are closer to each other than to items in other clusters • There exist many statistical methods for cluster identification • If clusters are pre-existing, then collection synthesis is a “classification problem”

  4. The Document Vector Space • Classic approach in IR • The documents pointed to are about the same topic • They may or may not be archived • They may be collected by hand or automatically

  5. Document Vector Space Model • Classic “Saltonian” theory • Originally based on collections • Each word is a dimension in N-space • Each document is a vector in N-space • Best to use normalized weights • Example: <0, 0.003,0,0,.01,.984,0,.001>

  6. Distance in DV Space • How similar are two documents, or a document and a query? • You look at their vectors in N space • If there is overlap, the documents are similar • If there is no overlap, the documents are orthogonal (I.e. totally unrelated)

  7. Cosine Correlation • Correlation ranges between 0 and 1 • 0  nothing in common at all (orthogonal) • 1  all terms in common (complete overlap) • Easy to compute • Intuitive

  8. Cosine Correlation • Given vectors x, y both consisting of real numbers x1, x2, … xN and y1, y2, …yN • Compute cosine correlation by:

  9. The Dictionary • Usual to keep a dictionary of actual words (or their stems) • Efficient word lookup • Common words left out • Their document frequency df(I) • Their discrimination value idf(I)

  10. Computing the Document Vector • Download a document, get the words, look each one up in our dictionary • For each word that is actually in the dictionary, compute a weight for it: W(I) = tf(I) * idf(I)

  11. Assembling a Collection • Download a document • Compute its term vector • Add it to the collection it is most like, based on its vector and the collection’s vector • How to get the collection vectors?

  12. Collections: virtual to real

  13. The Centroids • “Centroid” is what I called the collection’s document vector • It is critical to the quality of the collection that is assembled • Where do the centroids come from? • How to weight the terms?

  14. The Topic Hierarchy 0 Algebra 1 Basic Algebra 2 Equations 3 Graphing Equations 2 Polynomials 1 Linear Algebra 2 Eigenvectors/Eigenvalues :

  15. Building a seed URL set • Given topic “T” • Find hubs/authorities on that topic • Exploit a search engine to do this • How many results to keep? I chose 7; Kleinberg chooses 200. • Google does not allow automated searches without prior permission

  16. Query: Graphing Basic Algebra… Accessone.com/~bbunge/Algebra/Algebra.html Library.thinkquest.org/20991/prealg/eq.html Library.thinkquest.org/20991/prealg/graph.html Sosmath.com/algebra/algebra.html Algebrahelp.com/ Archives.math.utk.edu/topics/algebra.html Purplemath.com/modules/modules.htm

  17. Results: Centroids • 26 centroids (from about 30 topics) • Seed sets must have at least 4 URLs • All terms from seed URL documents were extracted and weighted • Kept the top 40 words in each vector • Union of the vectors became our dictionary • Centroid evaluation: 90% of seed URLs classified with “their” centroid

  18. Three Knobs for Crawl Control • “On topic”: downloaded page correlates with the nearest centroid at least “Q”, where 0 < Q <= 1.0 • Cutoff – how many off-topic pages to travel through before cutting off this search line? 0 <= Cutoff <= D • Time limit – how many hours to crawl

  19. Results: Some Collections • Built 26 collections in Math • Keep 20-50 of the best correlating URLs for each class • Best Cutoff is 0 • I have crawled (for math) about 5 hours • Some collections are larger than others

  20. Collection “Evaluation” • The only automatic evaluation method is by the correlative value == how close to the collection is an item • With human relevance assessments, one can also compute a “precision” curve • Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

  21. Results: Class 14 Mathforum.org/dr.math/problems/keesha.12.18.01.html Mathforum.org/dr.math/problems/kmiller.9.2.96.html Mathforum.org/dr.math/problems/santiago.10.14.98.html www.geom.umn.edu/docs/education/build-icos : Mtl.math.uiuc.edu/message_board/messages/326.html

  22. Conclusions We are still working on the collections. Picking parameters. Will add machine learning. Discussion? Questions?

More Related