Welcome to cis 455 555 internet and web systems
Download
1 / 26

- PowerPoint PPT Presentation


  • 289 Views
  • Updated On :

Welcome to CIS 455 / 555 – Internet and Web Systems. Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 13, 2010. What this Course Is About. How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - Gideon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Welcome to cis 455 555 internet and web systems l.jpg

Welcome to CIS 455 / 555 –Internet and Web Systems

Zachary G. Ives

University of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

January 13, 2010


What this course is about l.jpg
What this Course Is About

  • How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …?

    • What are the principles behind them?(This is NOT a course on building Web sites!)

    • How do “cloud computing,” P2P, and Web services relate?

  • The main themes of the course:

    • Distributed systems concepts, with emphasis on data, scalability and interoperability (including “the cloud”)

    • Data representation fundamentals, with emphasis on XML

    • Information retrieval concepts, including ranking and indexing

  • It’s a course that involves building software using the principles learned, evaluating it, and programming in teams


How does this relate to other cis courses l.jpg
How Does this Relate to Other CIS Courses?

CIS 330/550

  • Data representation and management

  • Relational querying with SQL; XML querying with XQuery

  • DBMS-backed web sites

  • 455/555 focuses on data with respect to interoperability

    CIS 350/573: software engineering and mashups

    CIS 505: focuses on distributed systems and algorithms

  • CIS 505 is less project-oriented than CIS 555

  • CIS 555 covers Web services, cloud architectures in more detail


Some things we ll look at l.jpg
Some Things We’ll Look at

  • What are the principles behind building systems that work on the Internet?

  • How do these relate to many of today’s hot technologies?

    • Web servers, DHTML, Servlets, JSP, …

    • XML

    • Web services

    • Peer-to-peer

    • Application servers

    • Cloud computing environments

    • Content distribution networks

    • Web search

    • Mash-ups

    • The cloud


Staff l.jpg
Staff

  • Instructor: Zack Ives, zives@cis

    • Office: 576 Levine North

    • Office hours Th 3:30-4:30 (and by arrangement)

  • TA: Katie Gibson, gibsonk@seas

    • Office hours TBA

  • Discussion group:

    • cis-455-555-spring10@googlegroups.com

    • http://groups.google.com/cis-455-555-spring10


Textbooks l.jpg
Textbooks

  • Distributed Systems: Principles and Paradigms, 2nd ed, Tanenbaum and van Steen

    • We’ll read from the book ~50% of the time

  • Frequent supplementary handouts

    • Excerpts from several books

    • Many recent research papers

  • Your first one, which you should read by Wed:http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf (linked off the CIS 555 “Schedule & Slides” page)

    • Send me mail if it’s difficult for you to find a way of printing the paper yourself


Prerequisites workload etc l.jpg
Prerequisites, Workload, etc.

Necessary skills:

  • Ability to code in Java: there is a substantial implementation project

  • Good debugging skills – this will be the biggest time sink!

  • The ability to work as a team with classmates (towards the end)

  • A willingness to learn how to read API documentation

  • Some exposure to threads and concurrent programming

  • A willingness to “push the envelope”

    Workload:

  • Several programming/debugging-based homework assignments

  • A substantial term project with experimental evaluation and a report

  • Two midterms

    Payoff:

  • Lots of practical development and debugging experience

  • A good working knowledge of the fundamentals behind scalable systems

  • A working “academic clone of Google,” hosted on Amazon EC2!

    WARNING: this course should be considered 1.5 CU!


A disclaimer l.jpg
A Disclaimer…

  • This remains a “bleeding edge” course!

    • Goal 0: an understanding of scalable distributed data-centric systems

    • Goal 1: a look under the covers of today’s hottest topics – in lectures and in projects

    • Goal 2: a level of comfort in managing large, complex software development with others’ code

    • Part of this means doing a substantial implementation project

      • As in the real world: learning APIs, dealing with inadequate tools

      • Most of you will find this a struggle! You’ll spend many hours debugging!

  • We will be using some immature technology

    • Not everything has been tested and validated ahead of time

      • e.g., this will be the first year we are using Amazon Elastic Compute Cloud

    • We’ll do the best we can to smooth over the bugs

  • We hope it will be a fun course, though…

    … And an interesting one!



What exactly is the web l.jpg
What Exactly Is the Web?

  • The Web consists of HTTP servers that publish HTML, XML, and a few other content types

    • These are hyperlinked via URLs (a subset of URIs)

    • Plus there are a huge number of web clients

  • The Web is built on a number of Internet protocols:

    • DNS, TCP, IP

  • Other Internet services use other protocols

    • SMTP, IMAP, POP, AIM, FTP, …

    • Streaming media, music swapping protocols, …

  • Web services, custom applications may actually also use HTTP in ways it wasn’t designed for


The internet is built in layers l.jpg
The Internet is Built in Layers

Your Application

Web Services, distrib transactions, …

Middleware

SSH, FTP,HTTP, IM, P2P, …

Lightweight streaming, etc.

Session

TCP (session-based)

UDP (sessionless)

Transport

IPv4, IPv6 Unicast, (multicast)

IP

WiFi, ZigBee, Ethernet, WiMax

Link


What is an internet system l.jpg
What Is an Internet System?

  • Not just a web server or web application…

  • An application built over the Internet, whose functionality is distributed across more than one machine

    • Typically, at least in a client-server or server-to-server fashion, but may have many more participants

    • Typically, data and/or code must be exchanged in distributed fashion for the functioning of the application

    • Often, the data must be partitioned, replicated, translated, etc. (“shards” in Google-speak)

    • Often, the code is written in multiple different environments, languages, etc.

    • Often, there are concerns about handling failures, firewalls, attacks, …


Why are internet system topics interesting l.jpg
Why Are Internet System Topics Interesting?

  • Understanding what’s underneath today’s Web

    • How does it work?

    • What are its shortcomings?

    • What are its strengths?

  • Understanding distributed algorithms

  • Using the right approach when designing new protocols and web systems

  • Being able to anticipate what’s actually possible in the future


Example web search a cloud service l.jpg
Example: Web Search, a Cloud Service

client

client

client

HTML forms;

results

queries

Web

Pages

Search Interface Servers

Uses a model ofdocument/wordsimilarity to rankmatches

pages

Crawlers

results

query

Index Servers

keywords +

locations


Example social networking facebook twitter a cloud service l.jpg
Example: Social Networking (Facebook / Twitter), a Cloud Service

client

client

client

pages & notifications

clicks

User PageServers

updates, posts

Users &

entities

suggestions

Recommender

common properties,

usage logs, …


Example information integration l.jpg
Example: Information Integration Service

client

client

client

results in

“mediated schema”

queries

Maps all data into a single format and virtual schema

Mediator

System

XQuery

+ XPath

over

XML

ODBC

results

HTML

SQL

HTTP POST

XML

XML sources

Relational

sources

HTML sources


Example seti@home l.jpg
Example: SETI@home Service

Breaks computation intomany parts and distributes them tothe clients

Problem Partitioning

Data Aggregation

New sub-problems

Computedsubresults

client

client

client


Example p2p file sharing l.jpg
Example: P2P File Sharing Service

Processes name-basedrequests for data; eachnode can make requests,forward requests,return data

request

client

client

data

request

data

data

request

client

client


What are the hard problems l.jpg
What are the Hard Problems? Service

  • Disclaimer: most of the hard problems AREN’T solved (or solvable) – and there often isn’t any single BEST solution

    Much of systems design is about finding the right compromise for each specific problem

  • We can divide them into:

    • Scalability

    • Availability / reliability

    • Consistency

    • Interoperability

    • Location and resource discovery


Scalability l.jpg
Scalability Service

  • How do we support a large number of clients or requests?

    • Distribute work!

    • Challenges:

      • Coordination – takes significant overhead in the general case

      • Load balancing – avoid having bottlenecks

    • Parts of the solution:

      • Client-server, multi-tier, P2P architectures

      • Restricted programming models, e.g., MapReduce

      • Data partitioning, replication, remote procedure calls, …


Availability reliability l.jpg
Availability/Reliability Service

  • How do we ensure the system is “up” when we want it to be, and doing the “right” thing?

    • Replication and redundancy

    • Security measures against attacks

    • Ability to undo/redo

    • Challenges:

      • Keeping things consistent

      • Performance vs. security

      • Acknowledgments

    • Parts of the solution:

      • Data partitioning, replication, …

      • Logging, transactions, …

      • Redundant hardware, multiple sites, …

      • Quorum and consensus algorithms


Consistency consensus l.jpg
Consistency / Consensus Service

  • Replication, distribution, and failures make it difficult to keep a unified, consistent view of the world – how do we combat this?

    • Locking, concurrency control, and invalidation schemes

    • Clock synchronization

    • Challenges:

      • Locking has huge performance overhead

      • Network partitions, disconnected operation

    • Parts of the solution:

      • Optimistic concurrency control, 2-phase locking

      • Distributed clock sync

      • Conflict resolvers


Interoperability l.jpg
Interoperability Service

  • How do we coordinate the efforts of components that have different data formats and/or source languages, and are on different machines?

    • Standardization!

    • Challenges:

      • Everything has a different semantics!

    • Parts of the solution:

      • Standard data formats: XML, XML schemas

      • “Schema mediation” and data translation

      • Remote procedure calls: CORBA, XML-RPC, …


Location resource discovery l.jpg
Location & Resource Discovery Service

  • How do you find what you’re looking for?

    • Naming

    • Declarative queries over standard schemas

    • Advertisements

    • Challenges:

      • Naming has implicit semantics

      • What do you do when you don’t know what to call something?

    • Parts of the solution:

      • Directory systems – DNS, LDAP, etc.

      • Resource discovery and advertising protocols

      • Overlay networks, sharding schemes

      • Standardized schemas


Our first focus single machines aka servers l.jpg
Our First Focus: ServiceSingle Machines, aka Servers

  • How do you handle large numbers of concurrent users?

    • Processes

    • Threads

    • Events

    • Hybrids (e.g., thread pools)

    • Staged architectures


Next time wed due to mlk day l.jpg
Next Time (Wed due to MLK Day)… Service

  • We’ll look under the covers of an HTTP server

    • Key ideas in building scalable systems

    • Principles of HTTP and web servers

    • Management of concurrent sessions

  • To read by next Wednesday:

    • Lampson and Saltzer paperhttp://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf

    • Tanenbaum Ch. 3.1

    • If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar OS book on interprocess communication


ad