Leveraging Hadoop in Real-World Applications: Insights and Techniques

Hadoop(MapReduce) in the Wild—— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao 2008-05-08

The REAP Project • REAP is an intelligent tutor for English language learning • Intelligent tutors often use student models to generate individualized instruction for each student • REAP cannot generate texts, but it can recognize texts • POS tagging, NE tagging, search, categorization, … • #filreq( #band( #greater( textquality 85 ) #greater( readinglevel 2 ) #less( readinglevel 9 ) #less( doclength 1001 ) ) #combine( advocate evidence destroy propose … acceptance ) ) • So, REAP needs a very large database of annotated texts • Previous approach to collecting texts didn’t scale well • Gathered ~1 M high quality docs in ~1 year • Typical yield rate <1%, for docs fitting tutoring criteria

Tasks Done on Hadoop • Web crawling of 200 million Web documents • Two web crawls of 100 million web pages each • Text annotation and categorization of Web pages • Part-of-speech, named-entity, sentence breaking • Reading level, text quality, topic classification • Output: {6 TB} + {42 GB} offset annotation • Filtering documents according to text quality • Output: 6 million high quality HTML documents (114 GB) • Generating graph structure & PageRank • Class project

Getting Started With Hadoop Quickly Hadoop Streaming has been our most important tool for porting legacy tasks and tools to hadoop • Runs any program with STDIN -> STDOUT • No need to recompile or relink with hadoop libraries • For 1 file/record streaming, Not the most efficient implementation • Poor data locality • But very efficient in human time • A day or two to get something running on 100 nodes

Hadoop in the Wild:Trick #1 • Q: My annotator takes file input, not STDIN • Solution: still Hadoop Streaming • Prepare a list of filenames • Distribute the filenames instead of file contents • map.pl • takes one filename • download the file from HDFS (Hadoop Distributed File System) • apply the annotator • upload resulting files to HDFS • No reducer needed • Can port any data distributive program onto Hadoop in a day • Efficient enough for computation intensive tasks • Even though low data locality

Trick #2 • Q: My annotator is a directory of programs, but Hadoop Streaming only accepts files. • Solution: still Hadoop Streaming • Make a tar ball of your directory of programs • map.pl needs to extract & launch the program

Trick #3 • Q: Hadoop programs are running on backend nodes, and are difficult to debug • Use STDERR for debugging • Also, if using HOD for managing the cluster • Views STDERR thru Web monitoring interface • Sees time spent on each Map/Reduce task

Pitfall #1:It’s All a Matter of Balance • For higher performance: • It is important to have the right balance between Map & Reduce tasks • The default number of Map/Reduce processes per node is 2 • But some multicore / multiprocessor nodes can easily handle more (e.g., 6 on M45) There is no good way to determine the right balance, except by parameter sweeps

Pitfall #2:Things Die, No Idea Why • Fault Tolerance and Diagnosis • If a Reduce task becomes unresponsive, it is killed • E.g., if it is overwhelmed with work • E.g., if its sort task is overwhelmed with work • Diagnosing the cause of an unresponsive Reduce process is not always easy • Sometimes solved by increasing number of reducers

Unsolved Problems • Monitoring cluster for diagnostics • CPU, Network, Disk I/O, Swap, etc. • Simon web interface, but not working.. • HOD (or Torque?) does not allow scheduling and prioritizing jobs • Reduce happens in a few nodes, waiting time for other idle nodes can be long. • Shuffle & sort is opaque • yet another black box

Thanks! Comments? Ideas?

Leveraging Hadoop in Real-World Applications: Insights and Techniques