1 / 83

Session D: Tashi

Session D: Tashi. Tashi. Michael Ryan Intel. Introduction 8.30-9.00 Hadoop 9.00-10.45 Break 10.45-11.00 Pig 11.00-12.00 Lunch 12.00-1.00 Tashi 1.00-3.00 Break 3.00-3.15 PRS 3.15-5.00. Overview User view Administration Installation Internals Summary. Agenda. Overview. Tashi.

henriette
Download Presentation

Session D: Tashi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session D:Tashi

  2. Tashi Michael Ryan Intel

  3. Introduction 8.30-9.00 Hadoop 9.00-10.45 Break 10.45-11.00 Pig 11.00-12.00 Lunch 12.00-1.00 Tashi 1.00-3.00 Break 3.00-3.15 PRS 3.15-5.00 Overview User view Administration Installation Internals Summary Agenda

  4. Overview

  5. Tashi An infrastructure through which service providers are able to build applications that harness cluster computing resources to efficiently access repositories of Big Data

  6. Example Applications

  7. Cluster Computing: A User’s Perspective Job-submission spectrum Tight environment coupling Loose environment coupling Runtime-specific (i.e. Hadoop) Queue-based (i.e. Condor or Torque) Virtual Machine-based (i.e. EC2 or COD…)

  8. Tashi System Requirements • Provide high-performance execution over Big Data repositories  Many spindles, many CPUs, co-location • Enable multiple services to access a repository concurrently • Enable low-latency scaling of services • Enable each service to leverage its own software stack  Virtualization, file-system protections • Enable slow resource scaling for growth • Enable rapid resource scaling for power/demand  Scaling-aware storage

  9. Tashi High Level Architecture Remote cluster users Cluster Mgr Remote cluster owners Logical clusters … Distributed storage system(s) Note: Tashi runtime and distributed storage systems do not necessarily run on the same physical nodes as the logical clusters

  10. Virtualization Service Storage Service Node Node Node Node Node Node Tashi Components Services are instantiated through virtual machines Most decisions happen in the scheduler; manages compute/storage in concert Scheduler Data location information is exposed to scheduler and services Cluster Manager Cluster nodes are assumed to be commodity machines CM maintains databases and routes messages; decision logic is limited

  11. A query arrives Request forwarded The scheduler receives the file mapping information from the storage service Node Node Node Node Node Node VMs are requested on the appropriate nodes Create 4 VMs to handle files 5, 13, 17, and 26 The web server converts the query into a parallel data processing request Tashi Operation answers.opencirrus.net web server running in 1 VM Acting as a Tashi client, a request for additional VMs is submitted Scheduler Virtualization Service Storage Service Cluster Manager After the data objects are processed, the results are collected and forwarded to Alice. The VMs can then be destroyed

  12. Why Virtualization? • Ease of deployment • Boot 100 copies of an operating system in 2 minutes • Cluster lubrication • Machines can be migrated or even restarted very easily in a different location • Overheads are going down • Even workloads that tax the virtual memory subsystem can now run with a very small overhead • I/O intensive workloads have improved dramatically, but still have some room for improvement

  13. User View

  14. Tashi in a Nutshell • Tashi is primarily a system for managing Virtual Machines (VMs) • Virtual Machines are software containers that provide the illusion of real hardware, enabling • Physical resource sharing • OS-level isolation • Users specification of custom software environments • Rapid provisioning of services • Users will use Tashi to request the creation, destruction, and manipulation of VMs

  15. Tashi Native Interface • Users invoke Tashi actions through a Tashi client • The client will have been configured by an administrator to communicate with the Tashi Cluster Manager • Example client actions include: • tashi createVm • tashi destroyVm • tashi createMany • etc.

  16. Tashi AWS-compatibility • Tashi also has a client interface that is compatible with a subset of Amazon Web Services* • Parts of the SOAP and QUERY interfaces

  17. Apache cgi-bin QUERY -> SOAP Tashi Agent Tashi AWS-compatibility Elastic Fox Client ec2-api-tools QUERY SOAP VM instance DB Cluster Manager (CM) Node Manager DB

  18. Tashi Organization • Each cluster contains one Tashi Cluster Manager (CM) • The CM maintains a database of: • Available physical resources (nodes) • Active virtual machines • Pending requests for virtual machines • Virtual networks • Users submit requests to the CM through a Tashi Client • The Tashi Scheduler uses the CM databases to invoke actions, such as VM creation, through the CM • Each node contains a Node Manager that carries out actions, such as invoking the local Virtual Machine Manager (VMM), to create a new VM, and monitoring the performance of VMs

  19. Tashi Software Architecture Site Specific Plugin(s) Centralized cluster administration Cluster Manager (CM) VM instance DB Scheduling Agent DFS Proxy Client Client API Node Manager DB VM Ganglia VM VM VM CM-NM API Resource Controller Plugins (VMM, DFS, power, etc.) Node Manager (NM) Sensor Plugins VMM DFS Legend DFS Metadata Server Tashi component system software nmd iptables /vlan non-Tashi component sshd Compute node

  20. Tashi Native Client Interface (I) • VM Creation/Destruction Calls (Single Version) • createVm [--userId <value>] --name <value> [--cores <value>] [--memory <value>] --disks <value> [--nics <value>] [--hints <value>] • destroyVm --instance <value> • shutdownVm --instance <value> • VM Creation/Destruction Calls (Multiple Version) • createMany [--userId <value>] --basename <value> [--cores <value>] [--memory <value>] --disks <value> [--nics <value>] [--hints <value>] --count <value> • destroyMany --basename <value>

  21. Creating a VM tashi createVm --name mikes-vm --cores 4 --memory 1024 --disks hardy.qcow2 --name specifies the DNS name to be created --disks specifies the disk image Advanced: [--nics <value>] [--hints <value>]

  22. Tashi: Instances • An instance is a running VM • Each disk image may be used for multiple VMs if the ‘persistent’ bit is not set • A VM may be booted in persistent mode to make modifications without building an entirely new disk image

  23. getMyInstances Explained tashi getMyInstances • This lists all VMs belonging to your userId • This is a good way to see what you’re currently using

  24. getVmLayout Explained tashi getVmLayout • This command displays the layout of currently running VMs across the nodes in the cluster id name state instances usedMemory memory usedCores cores --------------------------------------------------------------------------------------- 126 r3r2u42 Normal ['bfly3', 'bfly4'] 14000 16070 16 16 127 r3r2u40 Normal ['mpa-00'] 15360 16070 8 16 128 r3r2u38 Normal ['xren1', 'jpan-vm2'] 15480 16070 16 16 129 r3r2u36 Normal ['xren3', 'collab-00'] 14800 16070 16 16 130 r3r2u34 Normal ['collab-02', 'collab-03'] 14000 16070 16 16 131 r3r2u32 Drained [] 0 16068 0 16 132 r3r2u30 Normal ['collab-04', 'collab-05'] 14000 16070 16 16 133 r3r2u28 Normal ['collab-06', 'collab-07'] 14000 16070 16 16

  25. Tashi Native Client Interface (II) • VM Management Calls • suspendVm --instance <value> • resumeVm --instance <value> • pauseVm --instance <value> • unpauseVm --instance <value> • migrateVm --instance <value> --targetHostId <value> • vmmSpecificCall --instance <value> --arg <value>

  26. Tashi Native Client Interface (III) • Bookkeeping Calls • getMyInstances • getInstances • getVmLayout • getUsers • getNetworks • getHosts

  27. Creating Multiple VMs tashi createMany –count 10 --basename mikes-vm --cores 4 --memory 1024 --disks hardy.qcow2 --name specifies the DNS name to be created --disks specifies the disk image Advanced: [--nics <value>] [--hints <value>]

  28. Example cluster: Maui/Torque • Configure a base disk image from an existing Maui/Torque cluster (or setup a new one) • We’ve done this - amd64-torque_node.qcow2 • Ask the Cluster Manager (CM) to create <N> VMs using this image • Have one preconfigured to be the scheduler and queue manager • Or set it up once the VMs have booted • Or have a separate image

  29. Example cluster: Web Service • Configure a base image for a web server, and whatever other tiers (database, etc) you need for your service • Variable numbers of each can be created by requesting them from the CM • Conventional architecture for a web service

  30. Example cluster: Hadoop • Configure a base image including Hadoop • Ask the CM to create instances • Note: Hadoop wants memory • Two options: • Let HDFS reside in the VMs • Not ideal for availability/persistence • Use HDFS from the hosts • Upcoming topic

  31. Appliances • Not surprisingly, this set of examples makes one think of VM appliances • Certainly not a new concept • We’ve built several of these from the software configuration of common systems at our site • Configuration of old physical nodes • Clean images after an OS install (Ubuntu)

  32. Where are we today? • Tashi can reliably manage virtual machines spread across a cluster • In production use for over a year • Still some opportunities to add features • Security • Intelligent scheduling • Additional opportunities for research • Power management • Alternative distributed file systems • Other

  33. Where are we today? (cont) • Our deployment of Tashi has managed ~500 VMs across ~150 hosts • Primary access mechanism for the Big Data cluster • Maui/Torque and Hadoop have been pulled into VMs and are running on top of Tashi

  34. Tashi Deployment Intel Labs Pittsburgh • Tashi is used on the Open Cirrus site at ILP • Majority of the cluster • Some nodes run Maui/Torque, Hadoop • Primary source of computational power for the lab • Mix of preexisting batch users, HPC workloads, Open Cirrus customers, and others

  35. Storage

  36. compute servers storage servers Storing the Data – Choices Model 1: Separate Compute/Storage Compute and storage can scale independently Many opportunities for reliability Model 2: Co-located Compute/Storage No compute resources are under-utilized Potential for higher throughput compute/storage servers

  37. How is this done currently? HPC Amazon EC2/S3 Fine-grained parallelism Virtualized compute Separate Compute/Storage Task(s) Compute Storage See also: Usher, CoD, Eucalyptus, SnowFlock, … Hadoop/Google Tashi Coarse-grained parallelism Co-located Compute/Storage Multiple Cluster Users Single Cluster User

  38. 1U Rack Blade Rack 2U Rack Example cluster hardware 4/8 Gbps 48 port Gbps switches 30 Servers 2 disks/server 40 Servers 2 disks/server 15 Servers 6 disks/server

  39. Far vs Near • With co-located compute/storage: • Near: data consumed on node where it is stored • Far: data consumed across the network • System software must enable near access for good performance • MapReduce provides near access • HPC typically provides far access, unless function shipping

  40. Far vs Near Methodology Assume I/O bound (scan) application One task per spindle, no CPU load In the far system, data is consumed on a randomly selected node In the near system, data is consumed on the node where stored Average throughput, no queueing model Scenario 1: 11 Racks @ 4 Gbps Scenario 3: 5 Pods @ 8Gbps of 11 Racks @ 4 Gbps Far vs Near Analysis Scenario 2: 5 Racks @ 8 Gbps

  41. Far vs Near Access Throughput 396 264 352 8.1x 11.3x 10.3x 5.0x 6.0x 5.8x 2.4x 2.8x 2.8x

  42. Storage Service • Many options possible • HDFS, PVFS, pNFS, Lustre, JBOD, etc. • A standard interface is needed to expose location information

  43. Data Location Service struct blockInfo { encodingType type; byteRange range; list<hostId> nodeList; }; list<blockInfo> getBlockInfoByteRange(fileId f, byteRange r); How do we know which data server is the best?

  44. Resource Telemetry Service typedef double metricValue; metricValue getMetric(hostId from, hostId to, metricType t); list< list<metricValue> > getAllMetrics(list<hostId> fromList, list<hostId> toList, metricType t); Example metrics include latency, bandwidth, switch count, fault-tolerance domain, …

  45. Putting the Pieces Together Data Location Service LA application LA application LA runtime LA runtime Resource Telemetry Service Virtual Machines DFS DFS Guest OS OS DFS VM Runtime VMM OS (a) non-virtualized (b) virtualized

  46. DFS Performance

  47. Administration

  48. Key Configuration Options • Tashi uses a series of configuration files • TashiDefaults.cfg is the most basic and is included in the source tree • Tashi.cfg overrides this for site specific settings • Agent.cfg, NodeManager.cfg, ClusterManager.cfg, and Client.cfg override those setting based on which app is launched • Files in ~/.tashi/ override everything else

  49. Key Configuration Options (CM hostname) • You need to set the hostname used for the CM by Node Managers • Some example settings are listed below • Tashi.cfg: [Client] clusterManagerHost = merkabah [NodeManagerService] clusterManagerHost = merkabah

  50. Key Configuration Options (VFS) • You need to set the directory that serves disk images • We’re using NFS for this at the moment • Some example settings are listed below • Tashi.cfg: [Vfs] prefix = /mnt/merkabah/tashi/

More Related