110 likes | 188 Views
Explore practical applications and strategies for utilizing Hadoop in various tasks, from web crawling to text annotation and more. Learn tricks, pitfalls, and unresolved challenges in efficiently harnessing the power of Hadoop in real-world scenarios.
E N D
Hadoop(MapReduce) in the Wild—— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao 2008-05-08
The REAP Project • REAP is an intelligent tutor for English language learning • Intelligent tutors often use student models to generate individualized instruction for each student • REAP cannot generate texts, but it can recognize texts • POS tagging, NE tagging, search, categorization, … • #filreq( #band( #greater( textquality 85 ) #greater( readinglevel 2 ) #less( readinglevel 9 ) #less( doclength 1001 ) ) #combine( advocate evidence destroy propose … acceptance ) ) • So, REAP needs a very large database of annotated texts • Previous approach to collecting texts didn’t scale well • Gathered ~1 M high quality docs in ~1 year • Typical yield rate <1%, for docs fitting tutoring criteria
Tasks Done on Hadoop • Web crawling of 200 million Web documents • Two web crawls of 100 million web pages each • Text annotation and categorization of Web pages • Part-of-speech, named-entity, sentence breaking • Reading level, text quality, topic classification • Output: {6 TB} + {42 GB} offset annotation • Filtering documents according to text quality • Output: 6 million high quality HTML documents (114 GB) • Generating graph structure & PageRank • Class project
Getting Started With Hadoop Quickly Hadoop Streaming has been our most important tool for porting legacy tasks and tools to hadoop • Runs any program with STDIN -> STDOUT • No need to recompile or relink with hadoop libraries • For 1 file/record streaming, Not the most efficient implementation • Poor data locality • But very efficient in human time • A day or two to get something running on 100 nodes
Hadoop in the Wild:Trick #1 • Q: My annotator takes file input, not STDIN • Solution: still Hadoop Streaming • Prepare a list of filenames • Distribute the filenames instead of file contents • map.pl • takes one filename • download the file from HDFS (Hadoop Distributed File System) • apply the annotator • upload resulting files to HDFS • No reducer needed • Can port any data distributive program onto Hadoop in a day • Efficient enough for computation intensive tasks • Even though low data locality
Trick #2 • Q: My annotator is a directory of programs, but Hadoop Streaming only accepts files. • Solution: still Hadoop Streaming • Make a tar ball of your directory of programs • map.pl needs to extract & launch the program
Trick #3 • Q: Hadoop programs are running on backend nodes, and are difficult to debug • Use STDERR for debugging • Also, if using HOD for managing the cluster • Views STDERR thru Web monitoring interface • Sees time spent on each Map/Reduce task
Pitfall #1:It’s All a Matter of Balance • For higher performance: • It is important to have the right balance between Map & Reduce tasks • The default number of Map/Reduce processes per node is 2 • But some multicore / multiprocessor nodes can easily handle more (e.g., 6 on M45) There is no good way to determine the right balance, except by parameter sweeps
Pitfall #2:Things Die, No Idea Why • Fault Tolerance and Diagnosis • If a Reduce task becomes unresponsive, it is killed • E.g., if it is overwhelmed with work • E.g., if its sort task is overwhelmed with work • Diagnosing the cause of an unresponsive Reduce process is not always easy • Sometimes solved by increasing number of reducers
Unsolved Problems • Monitoring cluster for diagnostics • CPU, Network, Disk I/O, Swap, etc. • Simon web interface, but not working.. • HOD (or Torque?) does not allow scheduling and prioritizing jobs • Reduce happens in a few nodes, waiting time for other idle nodes can be long. • Shuffle & sort is opaque • yet another black box
Thanks! Comments? Ideas?