Last Time

Last Time • Parallelism within a machine: • Run lots of programs • Break into multiple programs or multiple data set • Fork multiple processes → fork() • Spawns multiple threads → pthreads • Parallelism across machines: • MPI • Queing system • Scheduler

Lecture Overview • Lecture: • More on Clusters, Grids & Schedulers • Distributed Filesystems • Hadoop Building Blocks • Hands On: • Clusters/Grids • Sun Grid Engine • Distributed Filesystems • Lustre

Next Week • From Grids to Clouds: • Hadoop • Hadoop Streaming • Map/Reduce • HDFS

News • Cluster is now accessible from home! • ssh -p 60000 student#@jordan-test.genome.ucla.edu • # corresponds to your machine in lab • NOTE: • See me to set up password for outside access! • Don't store important files on it, as it is not backed up! • It is now larger, and has more queues! • Lab 1 will be discussed later. It is due on April 27. • Now in the syllabus • UCLA Extension has agreed to offer advanced cloud courses • Learn to setup, tune and administer the tools from this class • Will likely be offered in 2-3 quarters, after we have more students • Feedback is desired to shape the additional courses • More details as they are available • The syllabus is being updated as we go along with more references

Queue Selection • How does SGE select a queue? • Job submitted to one or more queues • Queues have sequence number (seq_no) • qconf -mq all.q • Users are only authorized for cetrain queues • Does desired queue have slots available, or continue? • Examples • echo sleep 100 | qsub -N all • echo sleep 100 | qsub -N all -q small.q • echo sleep 100 | qsub -N both -q all.q,small.q

Queues: Advanced • Subordinate Queues • Maximum Job Length • Do we kill cheaters? • Queues contains hosts • qsub -q q1@compute-0,q2@compute-1,q3 • Host Groups (a group of hosts) • qconf -ahgrp • qconf -shgrpl • echo sleep 1000 | qsub -q all.q@@highmem2 • Run in any @highmem2 node, within all.q queue

Scheduling • How do we efficiently map jobs to nodes? • Parallel Environments • Multiple Slots/Cores • qsub -pe serial N • Consumable Resources (Memory, IB, etc) • Setup • qconf -sc (show) • qconf -mc (setup) • qconf -me compute-1-1 • complex_values virtual_free=32.409G • Usage • qsub -l vf=2G • Load-based scheduling • How do we handle cheaters?

Fair Scheduling • Which user/project/department gets the next resource? • FIFO (First In First Out) • Tickets • Functional • Fair usage right now, history irrelevant • Share-based • Fair usage over time, with amortized half-life • Policy, Politics • Should power users have low priority? • If I run 1 job a month, does that mean it is important? • How long am I penalized for after heavy usage? • Does the Lead Developer get more shares than me? What about the Pipeline? • Priorities • Are priorities legitimate between users, or just within? • How do we weigh priority, tickets, wait time, etc?

Fair Scheduling • How do we prevent a user/group/dept from dominating? • Quotas • qconf -srqsl, qconf -arqs name limitUsers description rules to avoid user domination enabled TRUE limit users {guest,test} to slots=4 limit users * hosts @mainframes to slots=1 limit users * hosts to slots=100 • The scheduled job can still run forever once! (or do nothing) • Do we kill them or put to sleep? Do we have enough swap? • Subordinate Queues, Time Limits

Advanced Jobs • Array jobs • Input File (arrays.jobs.in) /home/jordan/file1 /home/jordan/file2 • Execution Script (runner.sh) #!/bin/bash INPUT_FILE=/home/jordan/array.jobs.in LINE=$(head -n $SGE_TASK_ID $jobfile | tail -n 1) gzip $LINE • Submission • qsub -N array -t 1-`cat $jobfile | wc -l` ~/runner.sh • Job Dependencies • JOB1_ID=`qsub -N job1 ./job1.sh | awk '{print $3}'` • qsub -N job2 -hold_jid $JOB1_ID job1.sh

Advanced Jobs • Email notification • qsub -m e -M jmendler@ucla.edu /tmp/foo.sh • Job length • Advanced Reservations • Lets us preallocate execution hosts • Too advanced for now

Scaling Storage • As we add computers, what bottlenecks? • Network • Depends on application • Infiniband, Myrinet • Storage • Direct Attached Storage • Network Attached Storage (NAS) • Storage Area Network (SAN)

DAS vs NAS vs SAN

Scaling Storage • Direct Attached Storage • Not shared • Network Attached Storage (NAS) • Head node bottlenecks • Storage Area Network (SAN) • Every node talks to every disk • Doesn't scale as well and very expensive • Clustered NAS, NAS + SAN, etc • Clients are load-balanced amongst NFS servers • A single request goes to one server at a time • All head nodes must present same storage

Distributed Filesystems • POSIX Compliant (sometimes) • Appears to the client the same way a local disk or NFS volume would • Global • A single namespace across multiple storage servers/volumes • All clients see a single volume regardless of the back-end • Distributed • Files are distributed across storage servers for load balancing • Storage servers talk to one or more sets of disks at a time • Parallel • A single file can be distributed across one or more servers • Client talks to one or more fileservers at a time

Distributed Filesystems • Expandable • Add storage servers or disk arrays to grow (online) • Multi-Petabyte installations • Fast and scalable • Support tens of thousands of clients • Support hundreds of gigabytes per second • Reliability • Automatic failover when a server or disk dies

Lustre • Lustre is a POSIX-compliant global, distributed, parallel filesystem • Lustre is fast, scalable and live expandable • Lustre is licensed under GPL • Lustre was acquired by Sun/Oracle

Lustre: Flagship Deployment • Oak Ridge National Lab (2009) • 1 center-wide Lustre-based storage volume • Over 10 PB of RAID6 storage (13,400 SATA disks!) • Over 200GB/s of throughput (240GB/s theoretical) • 192 Lustre servers over Infiniband • Over 26,000 clients simultaneously performing I/O • ORNL 2012 Projections • 1.5 TB/s aggregate disk bandwidth • 244 PB of SATA disk storage or 61 PB of SAS • 100,000 clients Source: http://wiki.lustre.org/images/a/a8/Gshipman_lug_2009.pdf

Lustre: Downsides • Complexity • Reliability is improving but not quite there • Unexplained slow downs, hangs, weird debug messages • Occasional corruption bug pops up on mailing list • Fast scratch space at best • Copy raw data to Lustre, process, copy results back • No High Availability at the Lustre level (yet) • Regardless, Lustre is surprisingly robust to failures • Reboot any number of OSSes and/or the MDS during a read/write • The client simple waits around for the target to return • When the cluster comes back online, I/O generally resumes cleanly • Client timeouts are tunable to wait or return file unavailable error

Lustre Components: Servers • Metadata Server (MDS) • Manages filesystem metadata, but stores no actual data • Ideally, enough RAM to fit all of metadata in memory • Object Storage Servers (OSS) • Analogous to head node(s) for each storage server • Performs the disk I/O when prompted by client • Server side caching • Management Server (MGS) • Stores configuration information about filesystem • Servers require custom kernel

Lustre Components • Metadata Target (MDT) • Disk back-end to the MDS, tuned for small files • Object Storage Target (OST) • One or more per OSS, each a disk or array stores actual files • ldiskfs (modified ext3/ext4), porting to ZFS • Clients • Lustre client runs as kernel module to direct mount Lustre • Client asks the MDS where to read/write a file or directory • Client makes request directly to the OSS(s) • OSS talks to the appropriate OST(s) • Clients cache when possible

Lab time • Cluster is now accessible from home! • ssh -p 60000 student#@jordan-test.genome.ucla.edu • # corresponds to your machine number in lab • In lab you can ssh directly in. From home you need a password! • New cluster is bigger, and has more bells and whistles • It is not backed up, so do not store important files on it! • Problems: • Forking and Threading (from Last Week) • http://genome.ucla.edu/~jordan/teaching/spring2010/LinuxCloudComputing/lecture2/lab_problems.txt • Sun Grid Engine • http://genome.ucla.edu/~jordan/teaching/spring2010/LinuxCloudComputing/lecture3/lab_problems.txt • The two combined will make up Lab 1, due April 27

Introduction to Hadoop • Sun Grid Engine + Lustre • Job runs in Rack1, writes to storage in Rack5 • All writes go across the network, possibly far away • SGE disks do nothing, Lustre CPUs do nothing • Wasted resources, and need to grow these systems independently • Integration points? • Combine storage servers and compute nodes • CPUs for computation, disks for storage • Minimize network traffic by writing to our local disk when possible • Each added server speeds up both processing and data throughput/capacity • Combine job scheduler and filesystem metadata (data locality) • Run jobs on the node, or rack that has the desired input files • Cheaper to move computation than data! • Stripe across the local rack, not across uplinks!

Introduction to Hadoop • How else can we optimize? • Run duplicate computation on empty nodes • Amongst 3 computers, 1 is likely to be a little faster and 1 may fail • Replicate data • Copies on different racks to improve read speeds • But no reason to copy intermediate temp files • Also safer, so we can use cheap/commidity hardware • Compression • CPUs are faster than disks and networks • How else can we simply? • Automate and optimize splits and merges • Integrate the whole system, so user doesn't worry about internals • How can we hide all this detail from the user? • An API providing simple functions/data structures that the system can scale

Introduction to MapReduce • What primitives must this API to do? • Get Input and Split (InputReader) • Efficiently read in 1 dataset with 1,000,000 records • Split into N groups for N nodes • Computation (Map) • Take a group of data from split or prior computation • Run some algorithm on that data • Output result(s) for each computation • Merges (Reduce) • Take some group of data and combine them into 1 or more values • Store Results (OutputWriter) • Take our result and efficiently/safely write it to storage

Last Time

Last Time

Presentation Transcript

Last Time

Last Time

Last Time

Last time

Last Time

Last time…

Last Time

Last time…

Last Time

Last time…

Last Time

Last Time

Last Time

Last time

Last time

Last Time

Last Time

Last Time

Last time…

Last Time

Last time…

Last Time