1 / 19

JJE: INEX XML Competition

JJE: INEX XML Competition. Bryan Clevenger James Reed Jon McElroy. Introduction. Deal with large size of internet through using better categorization techniques Goal: Optimize search time by grouping pages using clusters Wikipedia is the data source. Problem.

baakir
Download Presentation

JJE: INEX XML Competition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JJE: INEX XML Competition Bryan Clevenger James Reed Jon McElroy

  2. Introduction • Deal with large size of internet through using better categorization techniques • Goal: Optimize search time by grouping pages using clusters • Wikipedia is the data source

  3. Problem • Take the Wikipedia data and create a clustering algorithm that leads to a the data being clustered. • This creates a reduction in search space for related information.

  4. Solution • If documents contain several similar links then similar data. • Focused on the link data set: • Link data: 39484 2039 4952 1029 39 1920 10233 30197

  5. Overall solution • Determine sub-communities in the graph using Max-Flow/Min-Cut community Discovery • Heuristics used to find relevant seeds

  6. Max Flow – Min Cut • Edge Capacity – similar to edge weight. Represents the “amount” of information that can be pushed along. • Flow – The sum of minimum capacity of all paths from one node to another.

  7. Max Flow – Min Cut (cont.) • The flow between two nodes in the same cluster should be larger than flow between two nodes in separate clusters.

  8. Max Flow – Min Cut (cont.)

  9. Max-Flow Community Discovery

  10. Implementation

  11. Implementation (Parsing) • Links parsed into a Graph. • Graph: HashMap<Integer, HashMap<Integer,Integer> • Document Id to HashMap of Link Ids to Capacity. • Links structure was created Links[0] = 3244,2645,791 Links[1] = 10293,432,2,1230 ... Links[max] = 1012

  12. Implementation (Initialization of Community Seeds) • Using the Links structure, a percentage of nodes with highest links are chosen as seeds

  13. Implementation (Finding Communities) • Idea, why it didn’t work? • robots

  14. Implementation (Visualization) • Walrus is an interactive 3D visualization tool that works on large directed graphs. • Input and output Parsing. • Grouped clusters by colors.

  15. Results • The INEX links data was composed of 54,000 nodes and 15 million links • Average running time on a DELL Duo Core 2.0 GHz Pentium Laptop to retrieve one cluster was 5.9 hours • Cluster size is between 2-2.5 K

  16. Results • Visual Images of clusters

  17. Conclusion • It worked... kinda. • Looks great! • See pretty pictures.

  18. References [1] Inex 2009 mining track. http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp, October 2009. [2] The standard maximum flow problem. http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=maxFlow, November 2009. [3] Walrus - graph visualization tool. http://www.caida.org/tools/visualization/walrus, December 2009. [4] Mark C. Chu-Carroll. Maximum flow and minimum cut. http://scienceblogs.com/goodmath/2007/08/maximum_flow_ and_minimum_cut_1.php, December 2009. [5] Fordfulkerson algorithm. http://en.wikipedia.org/wiki/FordFulkersos_algorithm, October 2009. [6] Max-flow Min-cut theorem. http://en.wikipedia.org/wiki/Max-flow_ min-cut_theorem, November 2009.

  19. Questions? • O really?

More Related