1 / 24

Very Large-Scale Incremental Clustering

Very Large-Scale Incremental Clustering. Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007. Table of Contents. Why Clustering? Why Incremental Clustering? Related Work Incremental C3M (C2ICM) A Former Implementation of C2ICM for very large datasets Conclusion. Why clustering ?.

marybrooks
Download Presentation

Very Large-Scale Incremental Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007

  2. Table of Contents • Why Clustering? • Why Incremental Clustering? • Related Work • Incremental C3M (C2ICM) • A Former Implementation of C2ICM for very large datasets • Conclusion

  3. Why clustering ? • It is an effective tool to manage information overload • To browse large document collections quickly • To easily grasp the distinct topics and subtopics (concept hierarchies) • To allow search engines to efficiently query large document collections

  4. Types of Clustering • Hierarchical vs. Non-hierarchical • Partitional vs. Agglomerative • Deterministic vs. Probabilistic algorithms • Incremental vs. Batch algorithms

  5. Why Incremental Clustering ? • The current information explosion • Popular sources of informational text documents such as Newswire and Blogs • Delay would be unacceptable in several important areas

  6. Related Work • The cluster-splitting approach • Adaptive clustering based on user queries • Cobweb algorithm • Hierarchical Clustering in Incremental manner

  7. C2ICM Algorithm • C3M is known as an efficient, effective and robust algorithm for clustering documents • C3M is well-developed for initial clustering, but maintenance is also necessary in clustering

  8. C2ICM Algorithm • C2ICM algorithm is based on cover coefficient concept as C3M. • C2ICM is suitable for dynamic environments where there are additions and deletions of documents • With C2ICM, reclustering for each update is avoided.

  9. C2ICM Algorithm Details • First we compute the number of clusters and cluster seed powers in the updated database • Then we determine the newly added documents and falsified documents

  10. C2ICM Algorithm Details • How do the clusters become false? • When a seed document becomes non-seed or is deleted • One or more non-seed documents of that cluster becomes seed

  11. C2ICM Algorithm Details • We cluster these documents by assigning them to the cluster of the seed that covers them most • The documents which does not belong to any cluster are grouped into ragbag cluster

  12. C2ICM: An example • Current state of the clusters Seed List d1 d6 d12 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d18 d16 d17 d12 d13 d14 d19 Ragbag cluster

  13. C2ICM: CASE 1 • When a seed document becomes nonseed Old Seed List d1 d6 d12 New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered

  14. C2ICM: CASE 1 • Seed document d12 becomes nonseed New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d22 d13 d14 d12 d16 d17 d18 d19 d20 d21 The set of documents to be clustered

  15. C2ICM: CASE 1 • Final clusters New Seed List d1 d6 d13 d19 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d12 d13 d18 d21 d14 d17 d19 d22 No elements remaining in the ragbag cluster

  16. C2ICM: CASE 2 • When a nonseed document in a cluster becomes seed Old Seed List d1 d6 d12 New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 New documents arrived d18 d16 d17 d12 d13 d14 d19 d20 d21 d22 The set of documents to be clustered

  17. C2ICM: CASE 2 • Nonseed document d14 becomes seed. New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 Becomes new seed d12 d13 d14 d16 d17 d18 d19 d20 d21 d22 The set of documents to be clustered

  18. C2ICM: CASE 2 • Final clusters New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d20 d16 d13 d12 d22 d18 d21 d19 d17 d14 No elements remaining in the ragbag cluster Becomes new seed

  19. A Former Implementation of C2ICM for Very Large Datasets • C2ICM is implemented by two programs (VS Pascal) • Program I selects the seeds • Program II clusters documents by using C2ICM algorithm. • These programs communicate by exchanging files. clusters documents text files Program I Seed Selector Program II C2ICM

  20. Former Experiments • C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995. • 6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents

  21. Results for the Former Experiments • C2ICM provides time savings • Clusters formed with C2ICM was very similar to the clusters formed with C3M

  22. Conclusion • Cluster maintenance problem is challenging • Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents) • HARD dataset will be used for evaluation. Information retrieval performance will be measured. • Implementation of C2ICM must be time and memory efficient.

  23. References • Can, F., Ozkarahan, E. A.  "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases."  ACM Transactions on Database Systems.  Vol. 15, No. 4 (December, 1990), pp. 483-517. • Can, F.  "Incremental clustering for dynamic information processing."  ACM Transactions on Information Systems.  Vol. 11, No. 2 (April, 1993), 143-164. • Can, F., Fox, E. A., Snavely, C. D., France, R. K.  "Incremental clustering for very large document databases: initial MARIAN experience."  Information Sciences.  Vol. 84 (1995), pp. 101-114. • A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999

  24. Questions?

More Related