1 / 1

4.2 User Interface GUI interface to select a data source, set a parameter.

MapDupReducer : Detecting Near Duplicates over Massive Datasets. Chaokun Wang 1 Jianmin Wang 1 Xuemin Lin 2 Wei Wang 2 Haixun Wang 3 Hongsong Li 3 Wanpeng Tian 1 Jun Xu 41 Rui Li 1. 1 School of Software, Tsinghua University

shamus
Download Presentation

4.2 User Interface GUI interface to select a data source, set a parameter.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapDupReducer: Detecting Near Duplicates over Massive Datasets Chaokun Wang1Jianmin Wang1Xuemin Lin2 Wei Wang2Haixun Wang3Hongsong Li3Wanpeng Tian1 Jun Xu41Rui Li1 1School of Software, Tsinghua University Key Laboratory for Information System Security, Ministry of Education Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China 2School of Computer Science and Engineering, University of New South Wales & NICTA, Sydney, Australia 3Microsoft Research Asia 4Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China • 1 Introduction • Challenges • Very expensive to detect near duplicates exactly in a large data set. • Non-trivial to deal with NDD in parallel fashion. • 4.2 User Interface • GUI interface to select a data source, set a parameter. • The beginning of a document is shown as the tooltip. • The differences between two duplicates are shown in color. • 4.3 An Example • Given a set of documents and a Jaccardthreshold t = 0.8. • Job1: • IN: (x, “C D E F”) (y, “B C D E F”) (z, “A B C D F”), • OUT: (C,1) (D,1) (E,1) (F,1) (B,1) (C,1) (D,1) (E,1) (F,1) (A,1) (B,1) (C,1) (D,1) (F,1) • IN: (A, {1}) (B, {1 1})(C, {1 1 1}) (D, {1 1 1}) (E, {1 1}) (F, {1 1 1}) • OUT: (A,1) (B,2) (C,3) (D,3) (E,2) (F,3) • Job2: • IN: (x, “C D E F”) (y, “B C D E F”) (z, “A B C D F”) • OUT: (E,x@1) (B,y@1) (E,y@2) (A,z@1) (B,z@2) • IN: (A,{z@1}) (B,{y@1 z@2}) (E,{x@1 y@2}) • OUT: ((y,z,01), 1#2) ((x,y,01), 1#2) • Job3:Mapper: A special IdentityMapper • IN: ((y,z), 01), {1#2}; ((x,y), 01), {1#2} • OUT: (x,y) • Job4: fetches the corresponding pre-processed documents and compares them to generate the final NDD results. X = x1; x2; . . . ; xn is a document set where each xi is a document. The task of Near Duplicate Detection (NDD) is to find all of the pairs (xi; xj) such that the similarity between xi and xj is not smaller than a given threshold (typically close to 1). • 2 Related Work • The MapReduce framework is a mature platform with distinctive features such as ease of use and high fault-tolerance. • map(k,v)  list(k1,v1) • reduce(k1, list(v1))  v2 • Approximate String Join • Tables R1 and R2 with string attributes Ai and Aj respectively, an threshold k {(t, t΄)|(t, t΄)R1R2, • dist (R2.Ai(t), R2.Aj(t΄)) ≤k}. • PPJoin • an extension to the All-Pairs algorithm • combines positional filtering and prefix-filtering { 3 System Architecture { { • The system MapDupReducer consists of two parts: • the user application • the server engine • MapDupReducer is implemented on top of the Hadoopv?.?.?. { { 4 Our Approach 4.1 Near Duplicate Detection by the PPJoin Paradigm in MapReduce 5 Preliminary Experimental Results Four MapReduce jobs for NDD • Job1: Compute the total order • Job2: Prefix filter • Job3: Position filter • Job4: Verification (Jaccard similarity) Time Cost (s) The size of the input document set

More Related