Python In The Cloud

Python In The Cloud PyHouMeetUp, Dec 17th 2013 Chris McCafferty, SunGard Consulting Services

Overview • What is the Cloud? • What is Big Data? • Big Data Sources • Python and Amazon Web Services • Python and Hadoop • Other Pythonic Cloud providers • Wrap-up

What Is The Cloud • I want 40 servers and I want them NOW • I want to store 100Tb of data cheaply and reliably • We can do this with Cloud technologies

What is Big Data • “Three Vs” • Volume • Variety • Velocity • Genome: sequencing machines throw off several TB per day. Each. • Hard drive performance is often the killer bottleneck, both reading and writing

What is NOT Big Data • Anything where the whole data set can be held in memory on a single standard instance • Data that can be held straightforwardly in a traditional relational database • Problems where most of the data can be trivially excluded • There are many challenging problems in the world – but not all need Cloud or Big Data tools to solve them

To The Cloud! • Amazon Web Services is the 800lb gorilla in this space • Start here if in doubt • Other options are RackSpace, Microsoft Azure, (PiCloud/Multyvac?) • You can also spin up some big iron very cheaply • Current AWS big memory spec is cr1.8xlarge • 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network • $3.50 per hour

Geo Big Data Sources • NASA SRTM data is on the large side • NASA recently released a huge set of data directly into the cloud: NEX • Earth Sciences data sets • Made available on Amazon Web Services public datasets • Available on S3 at: • s3://nasanex/NEX-DCP30 • s3://nasanex/MODIS • s3://nasanex/Landsat • There are many, many geo data sets available now (NOAA Lidar, etc)

Time for some code • Example - Use S3 browser to look at new NASA NEX data • Let’s download some with boto package • Quickest to do this from an Amazon data centre • See DemoDownloadNasaNEX.py

Weather & Big Data Sources • Good public weather and energy data • It's hard to move data around for free: just try! • Power grids shed many GB of public data a day • Historical data sets form many Terabytes • Weather data available from NOAA • QCLD: Hourly, daily, and monthly summaries for approximately 1,600 U.S. locations. • ASOS data contains sensor data at one-minute intervals. 5 min intervals available too. • 900 stations, 3-4MB per day, 12 years of data = 11-15TB data set.

Why go to the cloud • Cheap - see AWS pricing here • spot pricing of m1.medium normally ~1c/hr • The cloud is increasingly where the (public) data will reside • Pay as you go, less bureaucracy • Support for Big Data technologies out of the box • Amazon Elastic Compute Cloud (EC2) gives you a Hadoop cluster with minimal • Host a big web server farm or video streaming cluster

Python on AWS EC2 • AWS = Amazon Web Services. The Big Cloud • EC2 = Elastic Cloud Compute • Let’s run up an instance and see what we have available • See this script as one way to upgrade to Python 2.7 • Note absence of high-level packages like NumPy, matplotlib and Pandas • It would be very useful to have a very high-level Python environment…

StarCluster • Cluster management in AWS, written by a group at MIT • Convenient package to spin up clusters (Hadoop or other) and copy across files • Machine images (AMIs) for high-level Python environments (NumPy, matplotlib, Pandas, etc) • Not every high-level library is there • No sklearn (SciKit-Learn, machine learning) • But easier to pip-install with most pre-requisites already there • Sun Grid Engine: Job Management • Hadoop • Boto plugin • dumbo… and much more

Python's Support for AWS • boto- interface to AWS (Amazon Web Services) • Hadoop Streaming - use Python in MapReduce tasks • mrjob - Framework that wraps Hadoop Streaming and uses boto • pydoop- wraps Hadoop Pipes, which is a C++ API into Hadoop Map Reduce • Write Python in User-Defined Functions in Pig, Hive • Essentially wraps MapReduce and Hadoop Streaming

Boto - Python Interface to AWS • Support for HDFS • Upload/download from Amazon S3 and Glacier • Start/stop EC2 instances • Manage users through IAM • Virtually every API available from AWS is supported • django-storages uses boto to present an S3 storage option • See http://docs.pythonboto.org/en/latest/ • Make sure you keep your AWS key-pair secure

Another Code Example – upload • Example where we merge many files together and upload to S3 • Merge files to avoid the Small Files Problem • Note use of retry decorator (exponential backoff) • See CopyToCloud.pyand MergeAndUploadTxOutages.py

What is ? • A scalable data and job manager suitable for MapReduce jobs • Core technologies date from early 2000s at Google • Retries failed tasks, redundant data, good for commodity hardware • Rich ecosystem of tools including NoSQL databases, good Python support • Example, let’s spin up a cluster of 30 machines with StarCluster

Hadoop Scales Massively

HadoopStreaming • Hadooppasses incoming data in rows on stdin • Any program (including Python) can process the rows and emit to stdout • Logging and errors to stderror

Hadoop Streaming - Echo • Useful example that can be used for debugging • Tells you what Hadoop is actually passing your task • See echo.py • Similar example firstten.py peeks at the first ten lines then stops • Useful for debugging

Hadoop Parsing Example • Python's regexsupport makes it very good for parsing unstructured data • One of the keys in working with Hadoop and Big Data is getting it into a clean row-based format • Apply 'schema on read' • Transmission Data from PJM is updated here every 5 mins: https://edart.pjm.com/reports/linesout.txt • Needs cleaning up before we can use it for detailed analysis - note multi-line format • Script split_transmission.py • Watch out for Hadoop splitting input blocks in the middle of a file

Alternatives to AWS • Picloud offers open source software enabling you to run large computational clusters • Just acquired by DropBox • Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr • Doesn't offer many of the things Amazon does (AMIs, SMS) but great for computation or a private cloud • Disco is MapReduce implemented in Python • Started life at Nokia • Has its own Distributed Filesystem (like HDFS) • Or roll your own cluster in-house with pp (parallel python) • StarCluster Sun Grid Engine on other vendor/in-house • Google App Engine…?

PiCloud • Acquired by DropBox Nov 2013 • DropBox will probably come out with its own cloud compute offering in 2014 • As of Dec 2013, no new sign-ups • Existing customers encouraged to migrate to Multyvac • Feb 25th 2014 PiCloud will switch off • The underlying PiCloud software is still open source

Conclusions • For cheap compute power and cheap storage, look to the cloud • Python is well-supported in this space • Consider being close to your data: in the same cloud • Moving data is expensive and slow • Leverage AWS with tools like boto, StarCluster • Beware setting up complex environments: installing packages takes time and effort • Ideally, think Pythonicly – use the best tools to get the job done

Links • Good rundown on the Python ecosystem around Hadoop from Jan 2013: • http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ • Early vision for PiCloud(YouTube Mar 2012) • http://www.youtube.com/watch?v=47NSfuuuMfs • Disco MapReduce Framework from PyData • http://www.youtube.com/watch?v=YuLBsdvCDo8 • PuTTY tool for windows • Some AWS & Python war stories: • http://nz.pycon.org/schedule/presentation/12

Thank you • Chris McCafferty • http://christophermccafferty.com/blog • Slides will be at: • http://christophermccafferty.com/slides • Contact me at: • public@christophermccafferty.com • Chris.McCafferty@sungard.com

Python In The Cloud

Python In The Cloud

Presentation Transcript

In the Cloud…

Strings in Python

Search in Python

Strings in Python

Strings in Python

Lists in Python

In the Cloud

Strings in Python

Decisions in Python

The Standard Library In Python

Files in Python

Spidering the Web in Python

Files in Python

Scheme in Python

Exception Handling In Python | Exceptions In Python | Python Programming Tutorial | Edureka

Scheme in Python

Scheme in Python

Iterator in Python

Python Training | Python Classes in Pune