470 likes | 472 Views
Three Talks. Scalability Terminology Gray (with help from Devlin, Laing, Spix) What Windows is doing re this Laing The M$ PetaByte (as time allows) Gray. Geo Plex. Farm. Partition. Clone. Pack. Shared Nothing. Shared Disk. Shared Nothing. Active- Active. Active- Passive.
E N D
Three Talks • Scalability Terminology • Gray (with help from Devlin, Laing, Spix) • What Windows is doing re this • Laing • The M$ PetaByte (as time allows) • Gray
Geo Plex Farm Partition Clone Pack Shared Nothing Shared Disk Shared Nothing Active-Active Active-Passive Terminology for ScaleabilityBill Devlin, Jim Gray, Bill Laing, George Spix,,,,paper at: ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc • Farms of servers: • Clones: identical • Scaleability + availability • Partitions: • Scaleability • Packs • Partition availability via fail-over • GeoPlex • for disaster tolerance.
Unpredictable Growth • The TerraServer Story: • Expected 5 M hits per day • Got 50 M hits on day 1 • Peak at 20 M hpd on a “hot” day • Average 5 M hpd over last 2 years • Most of us cannot predict demand • Must be able to deal with NO demand • Must be able to deal with HUGE demand
Web Services Requirements • Scalability: Need to be able to add capacity • New processing • New storage • New networking • Availability: Need continuous service • Online change of all components (hardware and software) • Multiple service sites • Multiple network providers • Agility: Need great tools • Manage the system • Change the application several times per year. • Add new services several times per year.
Premise: Each Site is aFarm • Buy computing by the slice (brick): • Rack of servers + disks. • Functionally specialized servers • Grow by adding slices • Spread data and computation to new slices • Two styles: • Clones: anonymous servers • Parts+Packs: Partitions fail over within a pack • In both cases, GeoPlex remote farm for disaster recovery
Scale UP Scaleable Systems • ScaleUP: grow by adding components to a single system. • ScaleOut: grow by adding more systems. Scale OUT
Everyone does both. Choice’s Size of a brick Clones or partitions Size of a pack Who’s software? scaleup and scaleout both have a largesoftware component 1M$/slice IBM S390? Sun E 10,000? 100 K$/slice Wintel 8X 10 K$/slice Wintel 4x 1 K$/slice Wintel 1x ScaleUP and Scale OUT
Clones: Availability+Scalability • Some applications are • Read-mostly • Low consistency requirements • Modest storage requirement (less than 1TB) • Examples: • HTML web servers (IP sprayer/sieve + replication) • LDAP servers (replication via gossip) • Replicate app at all nodes (clones) • Load Balance: • Spray& Sieve: requests across nodes. • Route: requests across nodes. • Grow: adding clones • Fault tolerance: stop sending to that clone.
Shared Nothing Clones Shared Disk Clones Two Clone Geometries • Shared-Nothing:exact replicas • Shared-Disk (state stored in server) If clones have any state: make it disposable. Manage clones by reboot, failing that replace. One person can manage thousands of clones.
Clone Requirements • Automatic replication (if they have any state) • Applications (and system software) • Data • Automatic request routing • Spray or sieve • Management: • Who is up? • Update management & propagation • Application monitoring. • Clones are very easy to manage: • Rule of thumb: 100’s of clones per admin.
Partitions for Scalability • Clones are not appropriate for some apps. • State-full apps do not replicate well • high update rates do not replicate well • Examples • Email • Databases • Read/write file server… • Cache managers • chat • Partition state among servers • Partitioning: • must be transparent to client. • split & merge partitions online
Packs for Availability • Each partition may fail (independent of others) • Partitions migrate to new node via fail-over • Fail-over in seconds • Pack: the nodes supporting a partition • VMS Cluster, Tandem, SP2 HACMP,.. • IBM Sysplex™ • WinNT MSCS (wolfpack) • Partitions typically grow in packs. • ActiveActive: all nodes provide service • ActivePassive: hot standby is idle • Cluster-In-A-Box now commodity
Partitions Scalability Packed PartitionsScalability + Availability Partitions and Packs
Parts+Packs Requirements • Automatic partitioning(in dbms, mail, files,…) • Location transparent • Partition split/merge • Grow without limits (100x10TB) • Application-centric request routing • Simple fail-over model • Partition migration is transparent • MSCS-like model for services • Management: • Automatic partition management (split/merge) • Who is up? • Application monitoring.
GeoPlex: Farm Pairs • Two farms (or more) • State (your mailbox, bank account)stored at both farms • Changes from one sent to other • When one farm failsother provides service • Masks • Hardware/Software faults • Operations tasks(reorganize, upgrade move) • Environmental faults(power fail, earthquake, fire)
DirectoryFail-Over Load Balancing • Routes request to right farm • Farm can be clone or partition • At farm, routes request to right service • At service routes request to • Any clone • Correct partition. • Routes around failures.
9 9 9 9 9 well-managed nodes Availability Masks some hardware failures well-managed packs & clones Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures well-managed GeoPlex Masks site failures (power, network, fire, move,…) Masks some operations failures
Packed Partitions: Database Transparency SQL Partition 3 SQL Partition 2 SQL Partition1 SQL Database replication Web File StoreB Cloned Packed file servers The FARM: Clones and Packs of Partitions Cluster Scale Out Scenarios Web File StoreA SQL Temp State ClonedFront Ends(firewall, sprayer, web server) Web Clients Load Balance
Some Examples: • TerraServer: • 6 IIS clone front-ends (wlbs) • 3-partition 4-pack backend: 3 active 1 passive • Partition by theme and geography (longitude) • 1/3 sysadmin • Hotmail: • 1000 IIS clone HTTP login • 3400 IIS clone HTTP front door • + 1000 clones for ad rotator, in/out bound… • 115 partition backend (partition by mailbox) • Cisco local director for load balancing • 50 sysadmin • Google: (inktomi is similar but smaller) • 700 clone spider • 300 clone indexer • 5-node geoplex (full replica) • 1,000 clones/farm do search • 100 clones/farm for http • 10 sysadmin See Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services,Steven Levi and Galen Hunthttp://big/megasurvey/megasurvey.doc.
Acronyms • RACS: Reliable Arrays of Cloned Servers • RAPS: Reliable Arrays of partitioned and Packed Servers (the first p is silent ).
Emissaries and Fiefdoms • Emissaries are stateless (nearly) Emissaries are easy to clone. • Fiefdoms are stateful Fiefdoms get partitioned.
Geo Plex Farm Partition Clone Pack Shared Nothing Shared Disk Shared Nothing Active-Active Active-Passive Summary • Terminology for scaleability • Farms of servers: • Clones: identical • Scaleability + availability • Partitions: • Scaleability • Packs • Partition availability via fail-over • GeoPlex for disaster tolerance. Architectural Blueprint for Large eSites Bill Laing http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85 ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc
Three Talks • Scalability Terminology • Gray (with help from Devlin, Laing, Spix) • What Windows is doing re this • Laing • The M$ PetaByte (as time allows) • Gray
What Windows is Doing • Continued architecture and analysis work • AppCenter, BizTalk, SQL, SQL Service Broker, ISA,… all key to Clones/Partitions • Exchange is an archetype • Front ends, directory, partitioned, packs, transparent mobility. • NLB (clones) and MSCS (Packs) • High Performance Technical Computing • Appliances and hardware trends • Management of these kind of systems • Still need good ideas on….
Architecture and Design work • Produced an architectural Blueprint for large eSites published on MSDN • http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp • Creating and testing instances of the architecture • Team led by Per Vonge Neilsen • Actually building and testing examples of the architecture with partners. (sometimes known as MICE) • Built a scalability “Megalab” run by Robert Barnes • 1000 node cyber wall, 315 1U Compaq DL360s, 32 8ways, 7000 disks
Clones and Packs aka Clustering • Integrated the NLB and MSCS teams • Both focused on scalability and availability • NLB for Clones • MSCS for Partitions/Packs • Vision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and Packs • Unify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)
Clustering in Whistler Server • Microsoft Cluster Server • Much improved setup and installation • 4 node support in Advanced server • Kerberos support for Virtual Servers • Password change without restarting cluster service • 8 node support in Datacenter • SAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus) • Quorum of nodes supported (no shared disk needed) • Network Load Balancer • New NLB manager • Bi-Directional affinity for ISA as a Proxy/Firewall • Virtual cluster support (Different port rules for each IP addr) • Dual NIC support
Geoclusters • AKA - Geographically dispersed (Packs) • Essentially the nodes and storage are replicated at 2 sites, disks are remotely mirrored • Being deployed today, helping vendors them get certified, we still need better tools • Working with • EMC, Compaq, NSISoftware, StorageApps • Log shipping (SQL) and extended VLANs (IIS) are also solutions
Last year (CY2000) This work is a part of server scale-out efforts (BLaing) Web site and HPC Tech Preview CD late last year A W2000 “Beowulf” equivalent w/ 3rd-party tools Better than the competition 10-25% faster than Linux on SMPs (2, 4 & 8 ways) More reliable than SP2 (!) Better performance & integration w/ IBM periphs (!) But it lacks MPP debugger, tools, evangelism, reputation See ../windows2000/hpc Also \\jcbach\public\cornell* This year (CY2001) Partner w/ Cornell/MPI-Soft/+ Unix to W2000 projects Evangelism of commercial HPC (start w/ financial svcs) Showcase environment & apps (EBC support) First Itanium FP “play-offs” BIG tools integration / beta Dell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice) Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux) Gain on Sun in the www.top500.org list Address the win-by-default assumption for Linux in HPC High Performance Computing No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.
Appliances and Hardware Trends • The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devices • Working with OEMs to adopt WindowsXP • Ultradense servers are on the horizon • 100s of servers per rack • Manage the rack as one • Infiniband and 10 GbpsEthernet change things.
Operations and Management • Great research work done in MSR on this topic • The Mega services paper by Levi and Hunt • The follow on BIG project developed the ideas of • Scale Invariant Service Descriptions with • automated monitoring and • deployment of servers. • Building on that work in Windows Server group • AppCenter doing similar things at app level
Still Need Good Ideas on… • Automatic partitioning • Stateful load balancing • Unified management of clones/partitions at both app and OS level
Three Talks • Scalability Terminology • Gray (with help from Devlin, Laing, Spix) • What Windows is doing re this • Laing • The M$ PetaByte (as time allows) • Gray
Yotta Zetta Exa Peta Tera Giga Mega Kilo We're building Petabyte Stores Everything! Recorded • Soon everything can be recorded and indexed • Hotmail 100TB now • MSN 100TB now • List price is 800M$/PB(including FC switches & brains) • Must Geoplex it. • Can we get if for 1M$/PB? • Personal 1TB stores for 1k$ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Building a Petabyte Store • EMC ~ 500k$/TB = 500 M$/PB plus FC switches plus… 800 M$/PB • TPC-C SANs (Dell 18GB/…) 62 M$/PB • Dell local SCSI, 3ware 20 M$/PB • Do it yourself: 5 M$/PB • a billion here, a billion there, soon your talking about real money!
320 GB, 2k$ (now)6M$ / PB • 4x80 GB IDE(2 hot plugable) • (1,000$) • SCSI-IDE bridge • 200k$ • Box • 500 Mhz cpu • 256 MB SRAM • Fan, power, Enet • 500$ • Ethernet Switch: • 150$/port • Or 8 disks/box640 GB for ~3K$ ( or 300 GB RAID)
Hot Swap Drives for Archive or Data Interchange • 25 MBps write(so can write N x 80 GB in 3 hours) • 80 GB/overnite = ~N x 2 MB/second @ 19.95$/nite Compare to 1$/GB via Internet
2 x 80GB disks 500 Mhz cpu (intel/ amd/ arm) 256MB ram 2 eNet RJ45 Fan(s) Current disk form factor 30 watt 600$ (?) per rack (48U - 3U/module - 16 units/U) 400 disks, 200 whistler nodes 32 TB 100 Billion Instructions Per Second 120 K$/rack, 4 M$/PB, per Petabyte (33 racks) 4 M$ 3 TeraOps (6,600 nodes) 13 k disk arms (1/2 TBps IO) A Storage Brick
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other COM+ SOAP, BizTalk Huge leverage in high-level interfaces. Same old distributed system story. What Software Do The Bricks Run? Applications Applications datagrams datagrams streams RPC ? ? RPC streams CLR CLR Infiniband /Gbps Ehternet
300 arms 50TB (160 GB/arm) 24 racks48 storage processors2x6+1 in rack Disks = 2.5 GBps IO Controllers = 1.2 GBps IO Ports 500 MBps IO My suggestion: move the processors into the storage racks. Storage Rack in 2 years?
Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 60k$ today, 10k$ in a few years. • Admin cost >> storage cost??? • Challenge: • Automate ALL storage admin tasks
It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)
Lets work together to make storage bricks Low cost High function NAS (network attached storage) not SAN (storage area network) Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive Call To Action
Three Talks • Scalability Terminology • Gray (with help from Devlin, Laing, Spix) • What Windows is doing re this • Laing • The M$ PetaByte (as time allows) • Gray
Cheap Storage • Disks are getting cheap: • 3 k$/TB disks (12 80 GB disks @ 250$ each)
All Device Controllers will be Super-Computers Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • Economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane