1 / 64

e-mail pawelw@man.poznan.pl man.poznan.pl/

e-mail pawelw@man.poznan.pl http://www.man.poznan.pl/. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER. Homogeniczne i heterogeniczne środowiska. Środowisko homogeniczne: jednorodne elementy składowe charakteryzują się tymi samymi wartościami, cechami skalowalne

aimee
Download Presentation

e-mail pawelw@man.poznan.pl man.poznan.pl/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. e-mail pawelw@man.poznan.pl http://www.man.poznan.pl/

  2. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER Homogeniczne i heterogeniczne środowiska • Środowisko homogeniczne: • jednorodne • elementy składowe charakteryzują się tymi samymi • wartościami, cechami • skalowalne • Środowisko heterogeniczne: • różnorodność elementów składowych • zróżnicowany zbiór parametrów, cech • skalowalne • trudne w zarządzaniu • Różne systemy • operacyjne • Różne architektury • Różni producenci

  3. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER Zasoby • procesor (cpu, rodzaj) • częstotliwość (zróżnicowane płyty CPU), • typ, np. skalarny, wektorowy , graficzny • RAM (typ, wielkość) • we/wy • interfejsy sieciowe, • dyski, • ‘graphics engines’ • pamięć masowa • pojedyncze systemy (węzły w sieci) • specjalizowane systemy (obliczeniowe, graficzne, archiwizacji, etc.)

  4. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER Zapotrzebowanie na zasoby 1/2 Compute Data Visualize • BIG Compute Problems • Computing • Visualization • Data Handling • BIG Visualization Problems • Computing • Visualization • Data Handling • BIG Data Problems • Computing • Visualization • Data Handling

  5. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER Weather simulation Traditional big supercomputer Repository / archive Signal processing Web serving Media streaming Zapotrzebowanie na zasoby 2/2 CPU Storage I/O Scale in Any and All Dimensions

  6. Typy klastrów - zapewniające niezawodność (ang. high-availability cluster), których zadanie polega na zapewnieniu ciągłej pracy systemu i przerzucenie obciążenia na zapasowe węzły w przypadku awarii (np. serwery WWW, e-commerce) - obliczeniowe (ang. capability cluster), których zadaniem jest przetwarzanie równoległe aplikacji dla celów naukowych, inżynierskich czy projektowych. Wymagane jest zapewnienie wydajnych mechanizmów komunikacji między węzłami, co umożliwi wykorzystanie wysokiego stopnia równoległości (fain grain granularity). Klastry obliczeniowe przeważnie są dedykowane dla określonej aplikacji, a programy są wykonywane sekwencyjnie i nie współzawodniczą między sobą w dostępie do zasobów. - skalowalne (ang. scalability cluster), których zadaniem jest poprawienie efektywności wykonywania programów poprzez odpowiednie przydzielanie węzłów do aplikacji. Wymagane jest oprogramowanie zarządzające zapewniające uruchamianie zadań, load balancing, analizę obciążenia i zarządzanie zadaniami. Ewentualne zadania rozproszone mogą wykorzystywać równoległość na poziomie procedur i modułów.

  7. Single system image Single Point of Entry: A user can connect to the cluster as a single system (like telnet beowulf.myinstitute.edu), instead of connecting to individual nodes as in the case of distributed systems (like telnet node1.beowulf.myinstitute.edu). Single File Hierarchy (SFH): On entering into the system, the user sees a file system as a single hierarchy of files and directories under the same root directory. Examples: xFS and Solaris MC Proxy. Single Point of Management and Control: The entire cluster can be monitored or controlled from a single window using a single GUI tool, much like an NT workstation managed by the Task Manager tool or PARMON monitoring the cluster resources Single Virtual Networking: This means that any node can access any network connection throughout the cluster domain even if the network is not physically connected to all nodes in the cluster. Single Memory Space: This illusion of shared memory over memories associated with nodes of the cluster. Single Job Management System: A user can submit a job from any node using a transparent job submission mechanism. Jobs can be scheduled to run in either batch, interactive, or parallel modes (discussed later). Example systems include LSF and CODINE. Single User Interface: The user should be able to use the cluster through a single GUI. The interface must have the same look and feel of an interface that is available for workstations (e.g., Solaris OpenWin or Windows NT GUI).

  8. Single system image Availability Support Functions Single I/O Space (SIOS): This allows any node to perform I/O operation on local or remotely located peripheral or disk devices. In this SIOS design, disks associated with cluster nodes, RAIDs, and peripheral devices form a single address space. Single Process Space: Processes have a unique cluster-wide process id. A process on any node can create child processes on the same or different node (through a UNIX fork) or communicate with any other process (through signals and pipes) on a remote node. This cluster should support globalized process management and allow the management and control of processes as if they are running on local machines. Checkpointing and Process Migration: Checkpointing mechanisms allow a process state and intermediate computing results to be saved periodically. When a node fails, processes on the failed node can be restarted on another working

  9. Stopień złożoności C-brickCPU Module R-brick Router Interconnect I-brick Base I/O Module P-brick PCI Expansion X-brick XIO Expansion D-brick Disk Storage G-brick Graphics Expansion

  10. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER Klastry homogeniczne • GigaRing, SuperCluster (T3E) • PowerChallengeArray • POE • DCE • Zarządzanie dużymi ilościami danych • Systemy archiwizacji

  11. Poznań Supercomputing and Networking Center Massively Parallel Processing (MPP) • Massively parallel approaches achieve high processing rates by assembling large numbers of relatively slow processors • Traditional approaches focus on improving the speed of individual processors and assembly only a few of these powerfull processors for a complete machine • Improving network speed and communication overheads • Examples : • Thinking Machines (CM-2, CM-5) • Intel Paragon • Kendall Square (KS-1) • SGI Origin 2000 • Cray T3D, T3E

  12. Poznań Supercomputing and Networking Center MPP’s network topologies Topology Connectivity Some commonly used network topologies

  13. Poznań Supercomputing and Networking Center Cray T3E, T3D • The Cray MPP system contains four types of components: processing element nodes, the interconnect network, I/O gateways and a clock • Network topology: 3D Mesh Interconnect Network Processing Element Node I/O Gateway Cray T3D System Components

  14. Poznań Supercomputing and Networking Center Cray T3E Processing Element Nodes (PE) • Each PE contains a microprocessor, local memory and support circuitry • 64-bit DEC Alpha RISC processor • Very high scalability (8 ... 2048 CPUs)

  15. Poznań Supercomputing and Networking Center Cray T3E Interconnect Network • The interconnect network provides communication paths between PEs • There is formed a three dimensional matrix of paths that connect the nodes in X, Y and Z dimensions • A communication link transfers data and control information between two network routers, connects two nodes in one dimension. A communication link is actually two unidirectional channels. Each channel in the link contains data, control and acknowledge signals. • Dimension order routing (predefined methods of information traveling) • Fault tolerance

  16. Poznań Supercomputing and Networking Center Cray T3E Distributed operating system (Unicos/mk) In the CRAY T3E systems, the local memory of each PE must contain a copy of the microkernel and one or more servers. Under Unicos/mk each PE is configured as one of the following types of PEs: • Support PEs The local memory of support PEs contains a copy of the microkernel and servers. The exact number and type of servers vary depending on configuration tuning. • User PEs The local memory of user PEs contains a copy of the microkernel and a minimum number of servers. Because it contains a limited amount of operating system code, most of a user PE’s local memory is available to the user. User PEs include command and application PEs • Redundant PE A redundant PE is not configured into the system until an active PE fails.

  17. Poznań Supercomputing and Networking Center Cray T3E Distributed operating system (Unicos/microkernel) • Unicos/mk does not require a common memory architecture. Unlike Unicos, the functions of Unicos/mk are devided between a microkernel and numerous servers. For this reason, Unicos/mk is referred to as a serverized operating system. • Serverized operating systems offer a distinct advantage for the Cray T3E system because of its distributed memory architecture. Within these systems, the local memory of each PE is not required to hold the entire set of OS code • The operating system can be distributed across the PEs in the whole system • Under Unicos/mk, traditional UNICOS processes are implemented as actors. Actors represents a resource allocation entity. The microkernel views all user processes, servers and daemons as actors • A multiple PE application has one actor per PE. User and daemon actors reside in user address space; server actors reside in supervisory (kernel address) space.

  18. T3EMS – konfiguracja PE

  19. T3E – szeregowanie zadań

  20. Moduły demona psched Gang scheduler Provides application CPU and memory residency control by enabling you to schedule all members of an application together. This guarantees that the application members are synchronized across all PEs spanning the application. Load balancer Measures how well processes and applications are acted upon and serviced in each scheduling domain. Based on this information, the load balancer may decide to move commands and applications among eligible PEs in each domain. MUSE Implements a scheduling strategy similar to the fair-share scheduler in UNICOS. MUSE allows the system to be shared among groups in an organized way by assigning resources to the most deserving process. Resource manager Collects and analyses information about resource usage within the machine for internal and external use. The object manager then makes this information available in a uniform way to service providers such as NQE.

  21. Gang scheduling Wszystkie procesy aplikacji są przydzielane do zasobów w tym samym czasie • Parametry • Heartbeat – długość kwantu czasu przydzielanego aplikacji • Partial - pozwala na częściowe szeregowanie w razie wolnych zasobów • Variation - wariacja kwantu czasu

  22. Load balancingw domenie interaktywnej Przenoszenie procesów pomiędzy procesorami w zależności od wykorzystywanych zasobów. Uwzględnia się koszt przeniesienia zadań.

  23. Load balancing w domenie aplikacyjnej • Minimize swapping • Minimize migration cost • Perform expensive migrations only when necessary • Minimize the number of parties • Maximize the contiguously allocated PEs per party Parametry Heartbeat - częstotliwość MigrationDelay – minimalny czas pomiędzy migracjami tej samej aplikacji MigrationGravity – w którą stronę przesuwać aplikacje (w dół, w górę, w obie) NoPreemptiveMigration – migracje tylko jeżeli są kolejne aplikacje do uruchomienia.

  24. MUSE scheduler Przydział ustalonego procentu czasu CPU niezależnie od ilości procesów użytkownika

  25. psview - MUSE lotus 9% psview -m APP Status of MUSE Domain: APP PE Range : 0 - 0x4 Mode : Active Share by : UID Heartbeat : 600 seconds Decay : 3600 seconds OsHeartbeat : 60 seconds Entitlement MUSE LongTerm Interval Name Absolute Relative Factor Usage Usage Type ------------ -------- -------- -------- -------- -------- -------- root 1 1.0000 - - - Root Users 100 0.5000 - 0.9630 - Group komasa 100 0.5000 0.2596 0.9630 0.6824 Active Staff 100 0.5000 - 0.0370 - Group pawelw 100 0.5000 1.0000 0.0370 0.3176 Active

  26. psview -gang lotus 12% psview -g APP Status of Gang Scheduler Domain: APP PE Range : 0 - 0x4 Mode : Full Gang Scheduling Gangs : 3 Parties : 2 Time Slice : 50 - 800; Current: 800; Minimum: 5 Status : schedule change pending Rank Command Name User PE-Range Id Status ==== ================ ======== =========== ====== ======= 0 a.out pawelw 0x003-0x004 19415 - a.out pawelw 000-0x002 19087 - 1 nel186_4.exe komasa 000-0x003 81257 swapped (1 of 4)

  27. Poznań Supercomputing and Networking Center GigaRing Channel • The GigaRing channel architecture is a modification of Scalable Coherent Interface (SCI) specification and is designed to be the common channel that carries information between Input/Output Nodes (ION) • This channel consists of a pair of 500 MB/s. channels configured as counter-rotating rings • The two rings form a single logical channel with a maximum bandwidth of 1.0 GB/s. Protocol overhead lowers the channel rate to 920 MB/s. • A client connects to the GigaRing channel through the ION via a 64-bit full-duplex interface • Detection of lost packets and cyclic redundancy checksums

  28. Poznań Supercomputing and Networking Center GigaRing Channel The counter rotating rings provide two forms of system resiliency: • Ring folding • Ring masking GigaRing Node Interface

  29. Poznań Supercomputing and Networking Center GigaRing Channel Ring Folding • The GigaRing channel can be software configured to map out one or more IONs from the system. Ring folding converts the counter-rotating rings to form a single ring • The maximum channel bandwith for a folded ring is approximately 500 MB/s

  30. Poznań Supercomputing and Networking Center GigaRing Channel Ring Masking • Ring masking removes one of the counter-rotating rings from the system, which results in one fully connected, uniderectional ring • The maximum channel bandwidth = 500 MB/s GigaRing Channel

  31. Poznań Supercomputing and Networking Center GigaRing Channel Input/Output Nodes (ION) • All devices that connect directly to the GigaRing channel are considered to be IONs • There are three types of IONs : Single-purpose Node (SPN) Multipurpose node (MPN) Mainframe node • Available mainframe nodes : Cray T90 Cray T3E Cray J90se

  32. Poznań Supercomputing and Networking Center GigaRing Channel

  33. Poznań Supercomputing and Networking Center SuperCluster Environment Heterogenous Workstation Servers HIPPI

  34. Poznań Supercomputing and Networking Center SuperCluster Software Components • Job distribution and load balancing Cray NQX (NQE for Unicos) • Open systems remote file access: NFS • Standard, secured distributed file system: DCE DFS Server • Client/server based distributed computing: DCE Client Services • Cray Message Passing Toolkit (MPT): PVM, MPI • High performance, resilient file sharing: opt. Shared File System (SFS) • Client/server hierarchical storage management: opt. Data Migration Facility (DMF)

  35. Poznań Supercomputing and Networking Center SuperCluster Software Components Network Queuing Environment (NQE) • NQE consists of four components : Network Queuing System (NQS), Network Load Balancer (NLB) File Transfer Agent (FTA), Network Qeuing Environment clients • NQE is a batch queuing system that automatically load balances jobs across heterogenous systems on a network. It runs each job submitted to the network as efficiently as possible on the ressources available. • This provides faster turnaround for users and automatic load balancing to ensure that all systems on the network are used effectively.

  36. Poznań Supercomputing and Networking Center POWER CHALLENGEarray • Consists of up to eight Power Challenge or Power Onyx (POWERnode) supercomputing systems connected by a high performance HIPPI interconnect • Two level communication hierarchy, whereas CPUs within a POWERnode communicate via a fast shared bus interconnect and CPUs across POWERnode communicate via HIPPI interconnect

  37. Poznań Supercomputing and Networking Center POWER CHALLENGEarray Parallel programming models supported: • Shared memory with n processes inside a POWERnode • Message passing with n processes inside a POWERnode • Hybrid model with n processes inside a POWERnode, using a combination of shared memory and message passing • Message passing with n processes over p POWERnodes • Hybrid model with n processes over p POWERnodes, using a combination of shared memory within a POWERnode system and message passing between POWERnodes

  38. Poznań Supercomputing and Networking Center Message Passing MPI Model Multiparallel Memory Sharing

  39. Poznań Supercomputing and Networking Center POWER CHALLENGEarray Software: • Native POWERnode tools IRIX 6.x, XFS, NFS, MIPSpro compilers, scientific and math libraries, development environment • Array services Allows to manage and administer the array as a single system • Distributed program development tools HPF, MPI and PVM libraries, tools for distributed program visualization and debugging (Upshot, XPVM) • Distributed batch processing tools LSF, CODINE • Distributed system management tools IRIXPro, Performance Co-Pilot (PCP)

  40. Poznań Supercomputing and Networking Center An array session is a set of processes, possibly running across several POWERnodes, that are related to another by a single, unique identifier called the Array Session Handle (ASH). A local ASH is assigned by the kernel and is guaranteed to be unique within a single POWERnode, whereas a global ASH is assigned by the array services daemon and is unique across the entire POWER CHALLENGEarray.

  41. Parallel Operating Environment • Parallel Operating Environment - środowisko do pracy równoległej • Upraszcza uruchamianie programów równo-ległych • Jeden punkt zarządzania - konsola wspólna dla wszystkich procesów • Proste konfigurowanie przy pomocy zmiennych środowiskowych (lub parametrów) • MPL, MPI, własne programy równoległe lub nawet seryjne Poznańskie Centrum Superkomputerowo-Sieciowe

  42. Parallel Operating Environment The POE consists of parallel compiler scripts, POE environment variables, parallel debugger(s) and profiler(s), MPL, and parallel visualization tools. These tools allow one to develop, execute, profile, debug, and fine-tune parallel code. The Partition Manager controls a partition, or group of nodes on which you wish to run your program. The Partition Manager requests the nodes for your parallel job, acquires the nodes necessary for that job (if the Resource Manager is not used), copies the executables from the initiating node to each node in the partition, loads executables on every node in the partition, and sets up standard I/O. The Resource Manager keeps track of the nodes currently processing a parallel task, and, when nodes are requested from the Partion Manager, it allocates nodes for use. The Resource Manager attempts to enforce a ``one parallel task per node” rule. The Processor Pools are sets of nodes dedicated to a particular type of process (such as interactive, batch, I/O intensive) which have been grouped together by the system administrator(s).

  43. What is POE? POE encompasses a collection of software tools designed to provide an environment for developing, executing, debugging and profiling parallel C, C++ and Fortran programs. • Facilities to manage your parallel execution environment (environment variables and command line flags) • Message Passing Interface (MPI) library for interprocess communications • Subset of MPI-2 • Low-level Application Programming Interface (LAPI) • Parallel compiler scripts • Parallel file copy utilities • Authentication utilities • Parallel debuggers • Parallel profiling tools • Dynamic probe class library (DPCL) parallel tools development API

  44. What is POE? Much of what POE does is designed to be transparent to the parallel user. Some of these tasks include: • Linking the necessary parallel libraries during compilation (via parallel compiler scripts) • Finding and acquiring machines (nodes) for your parallel job • Loading your executable onto all nodes acquired for your parallel job • Handling all stdin, stderr and stdout between the nodes of a your parallel job • Signal handling for all tasks in your job • Providing intertask communications facilities • Managing the use of processor and network adapter resources • Retrieving system and job status information when requested • Error detection and reporting • Providing support for run-time profiling and analysis tools

  45. Basic POE Environment Variables MP_PROCS The number of task processes for your parallel job. May be used alone or in conjunction with MP_NODES and/or MP_TASKS_PER_NODE to specify how many tasks are loaded onto a physical SP node. The maximum value for MP_PROCS is dependent upon the version of PE software installed (currently ranges from 128 to 2048) If not set, the default is 1. MP_NODES Specifies the number of physical nodes on which to run the parallel tasks. May be used alone or in conjunction with MP_TASKS_PER_NODE and/or MP_PROCS. MP_TASKS_PER_NODE Specifies the number of tasks to be run on each of the physical nodes. May be used in conjunction with MP_NODES and/or MP_PROCS. MP_RESD Specifies whether or not LoadLeveler should be used to allocate nodes. Valid values are either "yes" (non-specific node allocation) or "no" (specific node allocation). If not set, the default value is context sensitive to other POE variables. Batch systems typically override/ignore user settings for this environment variable.

  46. Basic POE Environment Variables MP_RMPOOL Specifies the SP system pool number that should be used for non-specific node allocation. This is only valid if you are using the LoadLeveler for non-specific node allocation (from a single pool) without a host list file. Batch systems typically override/ignore user settings for this environment variable. MP_HOSTFILE This environment variable is used only if you wish to explicitly select which nodes will be allocated for your POE job (specific node allocation). If you prefer to let LoadLeveler automatically allocate nodes then this variable should be set to NULL or "". If used, this variable specifies the name of a file which contains the actual machine (domain) names of nodes you wish to use. It can also be used to specify which pools should be used. The default filename is "host.list" in the current directory. MP_EUILIB Specifies which of two protocols should be used for task communications. Valid values are either "ip" for Internet Protocol or "us" for User Space protocol. The default is "ip", while "us" is faster. MP_EUIDEVICE A node may be physically connected to different networks. This environment variable is used to specify which network adapter should be used for communications. Valid values are: "en0" (ethernet), "fi0" (FDDI), "tr0" (token-ring), or "css0" (high-performance switch). Note that valid values will also depend upon the actual physical network configuration of the node.

  47. System Status Array The leftmost area represents a list of POE jobs which the Resource Manager knows about. Clicking on one of these jobs selects it. The rightmost area provides a list of node names; nodes are listed in order from left to the right and from top to the bottom. The central area provides a grid of squares, each square representing a machine/node. Pink squares represent low utilized nodes. Yellow squares represent high utilized nodes. Gray squares are nonexistent nodes or nodes that are not available for monitoring. Squares with green boxes indicate which nodes are associated with a selected POE job number.

  48. DCE 1.DCE provides tools and services that support distributed applications. (DCE RPC, DCE Threads, DCE Directory Service, Security Service and Distributed Time Service, 2.DCE's set of services is integrated and comprehensive. 3.DCE provides interoperability and portability across heterogeneous platforms. 4.DCE supports data sharing. 5.DCE participates in a global computing environment. (X.500 and Domain Name Service (DNS))

More Related