1 / 46

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix. 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup.

Download Presentation

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 大规模数据处理/云计算Lecture 4 – Word Co-occurrence Matrix 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

  2. WordCount Review

  3. Problems&Solutions • Mac OS • Mac OS X下配置心得,by Xin Lv • Eclipse • eclipse3.7Indigo连接hadoop心得, by 朱瑜坚 • Linux • Linux下手动配置运行hadoop心得, by Haoyan Huo • VMPlayer • 暂缺,:)

  4. Homework Submission • What to hand in • Please pack the ACCEPTED source codes of oneline evalution into a single rar/tar.gz file, name it as "assign1-YourPinYinName.rar" or "assign1-YourPinYinName.tar.gz" and send the package to our TA by email (cs402.pku AT gmail.com) with "CS40214-Assign1-YourPinYinName" as the subject.

  5. Changping11使用规范 • hadoop.job.ugi = YourName, cs402 • 输入数据在: • /public • 自己传上去的数据,放在个人目录下 • 输出数据,一定放在/cs402的个人目录下 • /cs402/YourName • 不要使用默认的/user/Yourname

  6. Streaming for Python Programmer

  7. Hadoop Streaming • Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

  8. How Does Streaming Work • both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout • By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value

  9. More Features • Specifying Other Plugins for Jobs • inputformat JavaClassName • outputformat JavaClassName • partitioner JavaClassName • combiner JavaClassName • Specifying Additional Configuration Variables for Jobs • Customizing the Way to Split Lines into Key/Value Pairs • A Useful Partitioner Class • A Useful Comparator Class • Working with the Hadoop Aggregate Package

  10. Debug in Hadoop

  11. What Constitutes Progress in MapReduce? • Hadoop will not fail a task that’s making progress. • Reading an input record (in a mapper or reducer) • Writing an output record (in a mapper or reducer) • Setting the status message (using Context’s setStatus() method) • Incrementing a counter (using Context’s getCounter().increment() method) • Calling Reporter’s progress() method

  12. Counters & Status Message • Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics. • Status Message

  13. Hadoop Logs • MapReduce task logs • Each tasktracker child process produces a logfile using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr). • accessible through the web UI

  14. 'wordcount'How does it work?

  15. r1 c a b r2 k1 r3 a b c a s2 s3 s1 v1 7 5 1 3 2 2 1 c c b c k2 3 7 5 8 2 2 6 v2 6 8 k3 v3 k4 v4 k5 v5 k6 v6 map map map map Shuffle and Sort: aggregate values by keys reduce reduce reduce 18

  16. “Hello World”: Word Count 19

  17. But, in a real system...... • How to inject user code into a runing system? • job submission • mapper & reducer class instantiate • read/write data in mapper& reducer

  18. Implementation in Hadoop • job submission • mapper & reducer class instantiate • read/write data in mapper& reducer

  19. Hadoop Cluster 22

  20. Job Submission Process

  21. Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2). • Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. • Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

  22. Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. • The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3). • Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

  23. InputSplits • input split • is a chunk of the input that is processed by a single map. • Each map processes a single split. • Each split is divided into records, and the map processes each record—a key-value pair—in turn. • Splits and records are logical:

  24. InputFormat • An InputFormat is responsible for creating the input splits and dividing them into records.

  25. Mapper’s run() method

  26. InputFormat Class Hierarchy

  27. Serialization • Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage. • Deserialization is the reverse process of turning a byte stream back into a series of structured objects. • In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

  28. The Writable Interface • public interface Writable { • void write(DataOutput out) throws IOException; • void readFields(DataInput in) throws IOException; } • public interface WritableComparable<T> extends Writable, Comparable<T> • A Writable which is also Comparable. • public int compareTo(WritableComparable w){}

  29. Word Co-occurrence

  30. Tasks • Do word co-occurrenceanalysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points) • Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points) • Analysis the result data matrix and find something interesting. (10 points bonus) • Write a report to describe approach to each task, the problem you met etc.

  31. co-occurrence • Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.[1] From Wikipedia, the free encyclopedia

  32. Input Data

  33. Pairs .vs. Stripes (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } + Key: cleverly-constructed data structure brings together partial results • Idea: group together pairs into an associative array • Each mapper takes a sentence: • Generate all co-occurring term pairs • For each term, emit a → { b: countb, c: countc, d: countd … } • Reducers perform element-wise sum of associative arrays 40

  34. Pairs • customized KEY • (a,b) TextPair that implements WritableComparable<> • customized Partitioner • all (a,b) (a,c) (a,f) (a,*) go to the same Reducer • Default partitioner HashPartitioner use the hashCode() method

  35. a c r3 c b a r2 c a a b a c b r1 k1 s2 s3 s1 v1 5 5 7 7 3 1 9 1 2 2 1 2 c c c c b c b k2 9 3 7 5 2 8 8 2 6 2 2 v2 8 6 8 k3 v3 k4 v4 k5 v5 k6 v6 map map map map combine combine combine combine partition partition partition partition Shuffle and Sort: aggregate values by keys reduce reduce reduce 42

  36. Partitioner • public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int numPartitions) } • job setup • job.setPartitionerClass(UserPartitioner.class);

  37. Comparators • Comparable<T> • compareTo() • Comparator • RawComprator<> • WritableComparator • job.setSortComparatorClass • sort map() output in Mapper • job.setGroupingComparatorClass • sort shuffled data in Reducer, group result sent to reduce()

  38. Stripes • associative array • map -> MapWritable? • Caution: • jvm memory big enough • set mapred.child.java.opts (200M by default)

  39. Q&A

More Related