Jimmy Lin The iSchool University of Maryland Wednesday, September 3, 2008 - PowerPoint PPT Presentation

Jimmy lin the ischool university of maryland wednesday september 3 2008 l.jpg
Download
1 / 44

Cloud Computing Lecture #1 What is Cloud Computing? (and an intro to parallel/distributed processing). Jimmy Lin The iSchool University of Maryland Wednesday, September 3, 2008.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Jimmy Lin The iSchool University of Maryland Wednesday, September 3, 2008

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Jimmy lin the ischool university of maryland wednesday september 3 2008 l.jpg

Cloud Computing Lecture #1What is Cloud Computing?(and an intro to parallel/distributed processing)

Jimmy Lin

The iSchool

University of Maryland

Wednesday, September 3, 2008

Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details


Slide2 l.jpg

Source: http://www.free-pictures-photos.com/


What is cloud computing l.jpg

What is Cloud Computing?

  • Web-scale problems

  • Large data centers

  • Different models of computing

  • Highly-interactive Web applications


1 web scale problems l.jpg

1. Web-Scale Problems

  • Characteristics:

    • Definitely data-intensive

    • May also be processing intensive

  • Examples:

    • Crawling, indexing, searching, mining the Web

    • “Post-genomics” life sciences research

    • Other scientific data (physics, astronomers, etc.)

    • Sensor networks

    • Web 2.0 applications


How much data l.jpg

How much data?

  • Wayback Machine has 2 PB + 20 TB/month (2006)

  • Google processes 20 PB a day (2008)

  • “all words ever spoken by human beings” ~ 5 EB

  • NOAA has ~1 PB climate data (2007)

  • CERN’s LHC will generate 15 PB a year (2008)

640Kought to be enough for anybody.


Slide6 l.jpg

Maximilien Brice, © CERN


Slide7 l.jpg

Maximilien Brice, © CERN


There s nothing like more data l.jpg

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

There’s nothing like more data!

s/inspiration/data/g;


What to do with more data l.jpg

What to do with more data?

  • Answering factoid questions

    • Pattern matching on the Web

    • Works amazingly well

  • Learning relations

    • Start with seed instances

    • Search for patterns on the Web

    • Using patterns to find more instances

Who shot Abraham Lincoln?  Xshot Abraham Lincoln

Wolfgang Amadeus Mozart (1756 - 1791)

Einstein was born in 1879

Birthday-of(Mozart, 1756)

Birthday-of(Einstein, 1879)

PERSON (DATE –

PERSON was born in DATE

(Brill et al., TREC 2001; Lin, ACM TOIS 2007)

(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )


2 large data centers l.jpg

2. Large Data Centers

  • Web-scale problems? Throw more machines at it!

  • Clear trend: centralization of computing resources in large data centers

    • Necessary ingredients: fiber, juice, and space

    • What do Oregon, Iceland, and abandoned mines have in common?

  • Important Issues:

    • Redundancy

    • Efficiency

    • Utilization

    • Management


Slide11 l.jpg

Source: Harper’s (Feb, 2008)


Slide12 l.jpg

Maximilien Brice, © CERN


Key technology virtualization l.jpg

Key Technology: Virtualization

App

App

App

App

App

App

OS

OS

OS

Operating System

Hypervisor

Hardware

Hardware

Traditional Stack

Virtualized Stack


3 different computing models l.jpg

3. Different Computing Models

  • Utility computing

    • Why buy machines when you can rent cycles?

    • Examples: Amazon’s EC2, GoGrid, AppNexus

  • Platform as a Service (PaaS)

    • Give me nice API and take care of the implementation

    • Example: Google App Engine

  • Software as a Service (SaaS)

    • Just run it for me!

    • Example: Gmail

“Why do it yourself if you can pay someone to do it for you?”


4 web applications l.jpg

4. Web Applications

  • A mistake on top of a hack built on sand held together by duct tape?

  • What is the nature of software applications?

    • From the desktop to the browser

    • SaaS == Web-based applications

    • Examples: Google Maps, Facebook

  • How do we deliver highly-interactive Web-based applications?

    • AJAX (asynchronous JavaScript and XML)

    • For better, or for worse…


What is the course about l.jpg

What is the course about?

  • MapReduce: the “back-end” of cloud computing

    • Batch-oriented processing of large datasets

  • Ajax: the “front-end” of cloud computing

    • Highly-interactive Web-based applications

  • Computing “in the clouds”

    • Amazon’s EC2/S3 as an example of utility computing


Amazon web services l.jpg

Amazon Web Services

  • Elastic Compute Cloud (EC2)

    • Rent computing resources by the hour

    • Basic unit of accounting = instance-hour

    • Additional costs for bandwidth

  • Simple Storage Service (S3)

    • Persistent storage

    • Charge by the GB/month

    • Additional costs for bandwidth

  • You’ll be using EC2/S3 for course assignments!


This course is not for you l.jpg

This course is not for you…

  • If you’re not genuinely interested in the topic

  • If you’re not ready to do a lot of programming

  • If you’re not open to thinking about computing in new ways

  • If you can’t cope with uncertainly, unpredictability, poor documentation, and immature software

  • If you can’t put in the time

Otherwise, this will be a richly rewarding course!


Slide19 l.jpg

Source: http://davidzinger.wordpress.com/2007/05/page/2/


Cloud computing zen l.jpg

Cloud Computing Zen

  • Don’t get frustrated (take a deep breath)…

    • This is bleeding edge technology

    • Those W$*#T@F! moments

  • Be patient…

    • This is the second first time I’ve taught this course

  • Be flexible…

    • There will be unanticipated issues along the way

  • Be constructive…

    • Tell me how I can make everyone’s experience better


Slide21 l.jpg

Source: Wikipedia


Slide22 l.jpg

Source: Wikipedia


Slide23 l.jpg

Source: Wikipedia


Slide24 l.jpg

Source: Wikipedia


Things to go over l.jpg

Things to go over…

  • Course schedule

  • Assignments and deliverables

  • Amazon EC2/S3


Web scale problems l.jpg

Web-Scale Problems?

  • Don’t hold your breath:

    • Biocomputing

    • Nanocomputing

    • Quantum computing

  • It all boils down to…

    • Divide-and-conquer

    • Throwing more hardware at the problem

Simple to understand… a lifetime to master…


Divide and conquer l.jpg

Divide and Conquer

“Work”

Partition

w1

w2

w3

“worker”

“worker”

“worker”

r1

r2

r3

Combine

“Result”


Different workers l.jpg

Different Workers

  • Different threads in the same core

  • Different cores in the same CPU

  • Different CPUs in a multi-processor system

  • Different machines in a distributed system


Choices choices choices l.jpg

Choices, Choices, Choices

  • Commodity vs. “exotic” hardware

  • Number of machines vs. processor vs. cores

  • Bandwidth of memory vs. disk vs. network

  • Different programming models


Flynn s taxonomy l.jpg

Flynn’s Taxonomy

Instructions

Single (SI)

Multiple (MI)

Single (SD)

Data

Multiple (MD)


Slide31 l.jpg

SISD

Processor

D

D

D

D

D

D

D

Instructions


Slide32 l.jpg

SIMD

Processor

D0

D0

D0

D0

D0

D0

D0

D1

D1

D1

D1

D1

D1

D1

D2

D2

D2

D2

D2

D2

D2

D3

D3

D3

D3

D3

D3

D3

D4

D4

D4

D4

D4

D4

D4

Dn

Dn

Dn

Dn

Dn

Dn

Dn

Instructions


Slide33 l.jpg

MIMD

Processor

Processor

D

D

D

D

D

D

D

D

D

D

D

D

D

D

Instructions

Instructions


Memory typology shared l.jpg

Memory Typology: Shared

Processor

Processor

Memory

Processor

Processor


Memory typology distributed l.jpg

Memory Typology: Distributed

Processor

Memory

Processor

Memory

Network

Processor

Memory

Processor

Memory


Memory typology hybrid l.jpg

Memory Typology: Hybrid

Processor

Memory

Processor

Memory

Processor

Processor

Network

Processor

Memory

Processor

Memory

Processor

Processor


Parallelization problems l.jpg

Parallelization Problems

  • How do we assign work units to workers?

  • What if we have more work units than workers?

  • What if workers need to share partial results?

  • How do we aggregate partial results?

  • How do we know all the workers have finished?

  • What if workers die?

What is the common theme of all of these problems?


General theme l.jpg

General Theme?

  • Parallelization problems arise from:

    • Communication between workers

    • Access to shared resources (e.g., data)

  • Thus, we need a synchronization system!

  • This is tricky:

    • Finding bugs is hard

    • Solving bugs is even harder


Managing multiple workers l.jpg

Managing Multiple Workers

  • Difficult because

    • (Often) don’t know the order in which workers run

    • (Often) don’t know where the workers are running

    • (Often) don’t know when workers interrupt each other

  • Thus, we need:

    • Semaphores (lock, unlock)

    • Conditional variables (wait, notify, broadcast)

    • Barriers

  • Still, lots of problems:

    • Deadlock, livelock, race conditions, ...

  • Moral of the story: be careful!

    • Even trickier if the workers are on different machines


Patterns for parallelism l.jpg

Patterns for Parallelism

  • Parallel computing has been around for decades

  • Here are some “design patterns” …


Master slaves l.jpg

Master/Slaves

master

slaves


Producer consumer flow l.jpg

Producer/Consumer Flow

P

C

P

C

P

C

P

C

P

C

P

C


Work queues l.jpg

Work Queues

P

C

shared queue

P

C

W

W

W

W

W

P

C


Rubber meets road l.jpg

Rubber Meets Road

  • From patterns to implementation:

    • pthreads, OpenMP for multi-threaded programming

    • MPI for clustering computing

  • The reality:

    • Lots of one-off solutions, custom code

    • Write you own dedicated library, then program with it

    • Burden on the programmer to explicitly manage everything

  • MapReduce to the rescue!

    • (for next time)


  • Login