1 / 25

Python In The Cloud

Python In The Cloud. PyHou MeetUp , Dec 17 th 2013 Chris McCafferty, SunGard Consulting Services. Overview. What is the Cloud? What is Big Data? Big Data Sources Python and Amazon Web Services Python and Hadoop Other Pythonic Cloud providers Wrap-up. What Is The Cloud.

tara-mcleod
Download Presentation

Python In The Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Python In The Cloud PyHouMeetUp, Dec 17th 2013 Chris McCafferty, SunGard Consulting Services

  2. Overview • What is the Cloud? • What is Big Data? • Big Data Sources • Python and Amazon Web Services • Python and Hadoop • Other Pythonic Cloud providers • Wrap-up

  3. What Is The Cloud • I want 40 servers and I want them NOW • I want to store 100Tb of data cheaply and reliably • We can do this with Cloud technologies

  4. What is Big Data • “Three Vs” • Volume • Variety • Velocity • Genome: sequencing machines throw off several TB per day. Each. • Hard drive performance is often the killer bottleneck, both reading and writing

  5. What is NOT Big Data • Anything where the whole data set can be held in memory on a single standard instance • Data that can be held straightforwardly in a traditional relational database • Problems where most of the data can be trivially excluded • There are many challenging problems in the world – but not all need Cloud or Big Data tools to solve them

  6. To The Cloud! • Amazon Web Services is the 800lb gorilla in this space • Start here if in doubt • Other options are RackSpace, Microsoft Azure, (PiCloud/Multyvac?) • You can also spin up some big iron very cheaply • Current AWS big memory spec is cr1.8xlarge • 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network • $3.50 per hour

  7. Geo Big Data Sources • NASA SRTM data is on the large side • NASA recently released a huge set of data directly into the cloud: NEX • Earth Sciences data sets • Made available on Amazon Web Services public datasets • Available on S3 at: • s3://nasanex/NEX-DCP30 • s3://nasanex/MODIS • s3://nasanex/Landsat • There are many, many geo data sets available now (NOAA Lidar, etc)

  8. Time for some code • Example - Use S3 browser to look at new NASA NEX data • Let’s download some with boto package • Quickest to do this from an Amazon data centre • See DemoDownloadNasaNEX.py

  9. Weather & Big Data Sources • Good public weather and energy data • It's hard to move data around for free: just try! • Power grids shed many GB of public data a day • Historical data sets form many Terabytes • Weather data available from NOAA • QCLD: Hourly, daily, and monthly summaries for approximately 1,600 U.S. locations. • ASOS data contains sensor data at one-minute intervals. 5 min intervals available too. • 900 stations, 3-4MB per day, 12 years of data = 11-15TB data set.

  10. Why go to the cloud • Cheap - see AWS pricing here • spot pricing of m1.medium normally ~1c/hr • The cloud is increasingly where the (public) data will reside • Pay as you go, less bureaucracy • Support for Big Data technologies out of the box • Amazon Elastic Compute Cloud (EC2) gives you a Hadoop cluster with minimal • Host a big web server farm or video streaming cluster

  11. Python on AWS EC2 • AWS = Amazon Web Services. The Big Cloud • EC2 = Elastic Cloud Compute • Let’s run up an instance and see what we have available • See this script as one way to upgrade to Python 2.7 • Note absence of high-level packages like NumPy, matplotlib and Pandas • It would be very useful to have a very high-level Python environment…

  12. StarCluster • Cluster management in AWS, written by a group at MIT • Convenient package to spin up clusters (Hadoop or other) and copy across files • Machine images (AMIs) for high-level Python environments (NumPy, matplotlib, Pandas, etc) • Not every high-level library is there • No sklearn (SciKit-Learn, machine learning) • But easier to pip-install with most pre-requisites already there • Sun Grid Engine: Job Management • Hadoop • Boto plugin • dumbo… and much more

  13. Python's Support for AWS • boto- interface to AWS (Amazon Web Services) • Hadoop Streaming - use Python in MapReduce tasks • mrjob - Framework that wraps Hadoop Streaming and uses boto • pydoop- wraps Hadoop Pipes, which is a C++ API into Hadoop Map Reduce • Write Python in User-Defined Functions in Pig, Hive • Essentially wraps MapReduce and Hadoop Streaming

  14. Boto - Python Interface to AWS • Support for HDFS • Upload/download from Amazon S3 and Glacier • Start/stop EC2 instances • Manage users through IAM • Virtually every API available from AWS is supported • django-storages uses boto to present an S3 storage option • See http://docs.pythonboto.org/en/latest/ • Make sure you keep your AWS key-pair secure

  15. Another Code Example – upload • Example where we merge many files together and upload to S3 • Merge files to avoid the Small Files Problem • Note use of retry decorator (exponential backoff) • See CopyToCloud.pyand MergeAndUploadTxOutages.py

  16. What is ? • A scalable data and job manager suitable for MapReduce jobs • Core technologies date from early 2000s at Google • Retries failed tasks, redundant data, good for commodity hardware • Rich ecosystem of tools including NoSQL databases, good Python support • Example, let’s spin up a cluster of 30 machines with StarCluster

  17. Hadoop Scales Massively

  18. HadoopStreaming • Hadooppasses incoming data in rows on stdin • Any program (including Python) can process the rows and emit to stdout • Logging and errors to stderror

  19. Hadoop Streaming - Echo • Useful example that can be used for debugging • Tells you what Hadoop is actually passing your task • See echo.py • Similar example firstten.py peeks at the first ten lines then stops • Useful for debugging

  20. Hadoop Parsing Example • Python's regexsupport makes it very good for parsing unstructured data • One of the keys in working with Hadoop and Big Data is getting it into a clean row-based format • Apply 'schema on read' • Transmission Data from PJM is updated here every 5 mins: https://edart.pjm.com/reports/linesout.txt • Needs cleaning up before we can use it for detailed analysis - note multi-line format • Script split_transmission.py • Watch out for Hadoop splitting input blocks in the middle of a file

  21. Alternatives to AWS • Picloud offers open source software enabling you to run large computational clusters • Just acquired by DropBox • Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr • Doesn't offer many of the things Amazon does (AMIs, SMS) but great for computation or a private cloud • Disco is MapReduce implemented in Python • Started life at Nokia • Has its own Distributed Filesystem (like HDFS) • Or roll your own cluster in-house with pp (parallel python) • StarCluster Sun Grid Engine on other vendor/in-house • Google App Engine…?

  22. PiCloud • Acquired by DropBox Nov 2013 • DropBox will probably come out with its own cloud compute offering in 2014 • As of Dec 2013, no new sign-ups • Existing customers encouraged to migrate to Multyvac • Feb 25th 2014 PiCloud will switch off • The underlying PiCloud software is still open source

  23. Conclusions • For cheap compute power and cheap storage, look to the cloud • Python is well-supported in this space • Consider being close to your data: in the same cloud • Moving data is expensive and slow • Leverage AWS with tools like boto, StarCluster • Beware setting up complex environments: installing packages takes time and effort • Ideally, think Pythonicly – use the best tools to get the job done

  24. Links • Good rundown on the Python ecosystem around Hadoop from Jan 2013: • http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ • Early vision for PiCloud(YouTube Mar 2012) • http://www.youtube.com/watch?v=47NSfuuuMfs • Disco MapReduce Framework from PyData • http://www.youtube.com/watch?v=YuLBsdvCDo8 • PuTTY tool for windows • Some AWS & Python war stories: • http://nz.pycon.org/schedule/presentation/12

  25. Thank you • Chris McCafferty • http://christophermccafferty.com/blog • Slides will be at: • http://christophermccafferty.com/slides • Contact me at: • public@christophermccafferty.com • Chris.McCafferty@sungard.com

More Related