Origin 2000

Origin 2000 CCH Etude et optimisation des entrées/sorties parallèles S. Descamps, S. Nussbaum, O. Rochel S. Vialle Avec le support technique d ’A. Filbois

Etude et optimisation des entrées/sorties parallèles 1. Environnement de travail 2. Description de l ’étude réalisée 3. Mode Direct 4. Mode Indirect 5. Problèmes rencontrés 6. Conclusion

1. Environnement de travail • 1.1 L ’Origin2000 • 1.2 La baie disque • Constitution • Partitions • Stripping • Piège de partitionnement • 1.3 Modes d ’utilisation de l ’Origin2000 • 1.4 Ressources humaines de l ’étude

1.1 L ’Origin2000 : Structure interne From SGI Documentation

1.1 L ’Origin2000 : Liaison vers la baie disque From SGI Documentation

1.2 La baie disque • 110 disques standards : • 9.1 Go x 110 = 1 Tera • Bande passante individuelle de 7 à 9 Mo/s • Cache disque individuel de 1 Mo • 11 contrôleurs de disques (Fibre Channel) • Bande passante de 80 Mo/s par contrôleur

1.2 La baie disque : Les 110 disques

1.2 La baie disque : 10 disques par contrôleur

1.2 La baie disque : 11 contrôleurs au total

1.2 La baie disque : Les partitions • Exemple: si on crée 15 partitions • 110/15 = 7+1/3 • On a donc la répartition: • partition 1: 7 disques • partition 2: 8 disques • partition 3: 7 disques • on recommence 7,8,7,… • Maximum de contrôleurs par partitions

1.2 La baie disque : Les partitions

1.2 La baie disque : Le stripping • Stripping sur les partitions : • Répartition de la partition sur tous ses disques et donc tous ses contrôleurs • Données non successives sur un disque • Utilisation optimale du cache des disques

1.2 La baie disque : Le stripping

1.2 La baie disque : Piège de partitionnement • Création de 60 partitions • Partitions mono-disque • Accumulation de charge sur un seul contrôleur • => Bw observées de plus de 10 fois inférieures à l ’ordinaire

1.2 La baie disque : Piège de partitionnement • Création de 61 partitions • Répartition des partitions mono-disque équilibrée • => les résultats restent bons dans cette configuration • => à une partition près on peut avoir de bons ou de très mauvais résultats

1.3 Modes d ’utilisation • Modes d ’utilisation: • Mode batch • Mode interactif : • mode multi-users • mode exclusif • Pour la baie disque: • Accès limité à notre groupe • Test en interactif exclusif et multi-users

2.Description de l ’étude réalisée • 2.1 Objectifs • 2.2 Méthodologie • Démarche de programmation • Méthode de mesure • Paramètres étudiés • Méthode de travail

2.1 Objectifs • Etudier les différents modes d ’écriture • Séquentiel / Parallèle • Direct / Indirect-synchrone • Indirect asynchrone (recouvrement) • Rechercher les meilleures Bw • Mettre les résultats sur le web => http://www.ese-metz.fr/~vialle/parallel/par-io/index.html

2.4 Méthodologie : Démarche de programmation • Threads / Processus • Fichiers de même taille • 1 fichier par processus • Ecriture d ’une zone mémoire allouée-fixée • Barrière de synchronisation avant et après • Tw = max (Tfin-write) - min (Tdébut-write)

2.4 Méthodologie : Méthodes de mesure • Mesure externe : timex Mesure interne : gettimeofday() • Problème d ’incertitude : • Résolution : 1ms • Variation constatée  500ms

2.4 Méthodologie : Paramètres étudiés • 4 paramètres explicites : • Nombre de partitions présentes : TP • Taille des données à écrire : Q • Nombre de processus utilisés : P • Nombre de partitions utilisées : SP (TP, P) • 2 paramètres implicites : • Nombre de disques par partition = 110 / TP • Nombre de disques utilisés = 110 / TP * SP

3. Le mode Direct 3.1 Description 3.2 Les Résultats 3.3 Bilan

3.1 Description • A la charge de l ’utilisateur • Toujours synchrone • Tenir compte des caractéristiques du DMA • Structure imposée aux données : • Taille minimale : 16 Ko • Taille maximale = Taille de page : 16 Mo • Taille totale = K x Taille minimale • Alignement : 32 Ko • Une granularité de 16 Mo est l ’optimum

file = open("nom_fichier.ext", O_RDWR|O_TRUNC|O_CREAT|O_SYNC|O_DIRECT); if (file < 0) … //error if (statvfs("name_of_file.ext", &fsdesc))… //error if (strcmp(fsdesc.f_basetype,"xfs")) … //error if (fcntl(file, F_DIOINFO, dioI) != 0)… //error datasize=10*1024*1024*1024; if (datasize%dioI.d_miniosz!=0) datasize+=dioI.d_miniosz-datasize%dioI.d_miniosz; data=(double*)memalign(dioI.d_mem,dioI.d_maxiosz); if (mpin(data,dioinfo.d_maxiosz)==-1) … //error while (dataondisk < datasize) { if (datasize-dataondisk >= dioI.d_maxiosz) { datawewant=dioI.d_maxiosz; dataW=write(file,data,datawewant); } else { datawewant=datasize-dataondisk; dataW=write(file,data,datawewant); } if (datawrite!=datawewant) … // error dataondisk+=datawrite; } close(file);

3.1 Description : Surcoût de code • #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <sys/types.h> #include <sys/statvfs.h> • int example (void) { /* We need several variables as following. */ int file; /* The file descriptor */ double *data; /* The data to write. */ size_t datasize; /* The size of the data to be written */ size_t dataondisk; /* The size of the data already on the disk bay */ size_t datawrite; /* The size of the last writting (see below)*/ size_t datawewant; /* The size of the data we wanted to write (see below) */ /* To get the information about the DMA. composed of 3 variables: */ /* d_miniosz: the smallest size accepted by the DMA for one write. The size must be a multiple. */ /* d_maxiosz: the biggest size accepted by the DMA on one write. */ /* d_mem: the memory must be aligned on it */ struct dioattr dioinfo; struct statvfs fsdesc; /* To get the information about the filesystem. */ • /* We are trying to get the file descriptor, first with an open which will eliminate a lot of possible errors. We */ /* chose to create a new file, in direct mode (of course here) and in synchronous mode, even if O_DIRECT */ /* always occurs in synchronous mode we prefered precising it again. */ if (file=open("nom_fichier.ext", O_RDWR | O_TRUNC | O_CREAT | O_SYNC | O_DIRECT)<0) { perror("Open error: "); /* if there is an error, stop the open */ return -1; } /* Now we will get the informations about the file (which is created now) to check whether or not it is an xfs */ /* filesystem. The open could already check it when O_DIRECT is specified, but double check is better. */ if (statvfs("name_of_file.ext", &fsdesc)) { perror("Information about filesystem of the file can't be accessed: "); return -1; } /* Now we test whether or not it is a xfs file system (at last). */ if (strcmp(fsdesc.f_basetype,"xfs")) { perror("That's not xfs!"); return -1; } /* Here we can assume that we have a nice file descriptor. We chose to force the good rights for our file, but it */ /* would not be needed. */ fchmod(file,644); • /* Now we will get the information about the DMA, to know how to allocate the memory. */ if (fcntl(file, F_DIOINFO, dioinfo) != 0) /* we ask for the information about the DMA */ { perror ("Internal fail of fcntl + F_DIOINFO."); /* if we can't get them, then error */ } • /* We will now try to write 10GB of data. In a real situation, you will maybe have to write specific data, but here*/ /* we will write always the same bloc up to 10GB */ datasize=10*1024*1024*1024; /* 10GB */ /* We will have to change the datasize so that it will be a multiple of dioinfo.d_miniosz. This is the first */ /* condition on the data. */ if (datasize%dioinfo.d_miniosz!=0) /* Is it a multiple ?*/ datasize+=dioinfo.d_miniosz - datasize%dioinfo.d_miniosz; /* now it is a multiple, a bit longer */ /* We will use dioinfo.d_mem to know how to aligne the data, and allocate the memory. We will allocate only */ /* "dioinfo.d_maxiosz" because we have no real data to write, but in your case you should allocate datasize in */ /* your case. */ /*Remark: the system assume that dioinfo.d_maxiosz IS a multiple of dioinfo.d_miniosz, so that no need to */ /* check one more */ data=(double*)memalign(dioinfo.d_mem,dioinfo.d_maxiosz); /* We pin the memory so that it won't be moved because of reschedulding. It is not necessary but it increase */ /* a bit more the bandwidth. Once again, we only pin "dioinfo.d-maxiosz because that is what we allocate, you */ /* will need to pin datasize in your case */ if (mpin(data,dioinfo.d_maxiosz)==-1) /* Hope there is no error?*/ { perror("Mpin error:"); return -1; /* we want to pin so we exit, but you needn't to if you needn't mpin. */ } • /* Now will begin the write. We need a loop because we can write only "small" blocs of "dioinfo.d_maxiosz" at */ /* a time. we use several test to be sure there are no errors. */ while (dataondisk<datasize) /* We stop when all the data are written. */ { /* If we can still write a whole DMA page we will, else we will heve to write less. */ /* Each time we use several variables: */ /* datawrite: "write" return the data size it really writed, to check later. */ /* datawewant: it is the data size we want to write on this write, to check with datawrite later */ if (datasize-dataondisk >= dio.d_maxiosz) datawrite=write(file,data,datawewant=dioinfo.d_maxiosz); else /* we will complete the data on the disk bay to have the right size */ datawrite=write(file,data,datawewant=datasize-dataondisk); /* Now we test if we really writed what we wanted to. */ if (datawrite!=datawewant) { perror("Write error!"); return -1; } /* Finally all was correct, so we notice the write */ dataondisk+=datawrite; } close(file) /* Do not forget... */ free(data); return 0; } • #include <stdio.h> #include <stdlib.h> • int example (void) { /* We need some variables as following. */ int file; /* The file descriptor */ double *data; /* The data to write. */ size_t datasize /* The size of the data to be written */ • /* We are trying to get the file descriptor, first with an open which will eliminate a lot of possible */ /* errors. We chose to create a new file, in indirect mode (of course here) and in synchronous */ /* mode. Asynchronous mode is also possible. */ if (file=open("nom_fichier.ext", O_RDWR | O_TRUNC | O_CREAT | O_SYNC )<0) { perror("Open error: "); /* if there is an error, stop the open */ return -1; } /* Here we can assume that we have a nice file descriptor. We chose to force the good rights for */ /* our file, but it would not be needed. */ fchmod(file,644); • datasize=10*1024*1024*1024; /* We will write 10GB */ data=(double*)malloc(datasize); /* We allocate 10 GB */ /* Now a normal write, so easy. */ if (write(file,data,datasize)!=datasize) /* We check if we really write the good size of data */ { perror("Write error!"); return -1; } close(file); /* Never forget... */ return 0; }

3.2 Les résultats • Combien de partitions utiliser ? • Combien de disques utiliser ? • Combien de processus utiliser ? • Erreurs de mesure et correction • Bilan des I/O directes

3.2 Les résultats : Combien de partitions utiliser? • Etude de : Bw(P,SP) avec TP = 16 •  P, Bw(SP) croissante • => Utiliser le maximum de partitions disponibles : • SP = TP

3.2 Les résultats : Combien de disques utiliser? • Etude de : Bw(NbDisk) avec TP = 16 • Pente : 7 MB/s/disk • Bw un disque 7MB/s • => Bonne parallélisation • => Utiliser le maximum de disques

3.2 Les résultats : Combien de disques utiliser? • Etude généralisée : • Bw(NbDisk,TP) • Indépendant de TP • => Peu de gaspillage • => Bonne parallélisation • => Utiliser le maximum de disques

3.2 Les résultats : Combien de processus utiliser? • Etude de : Bw(P) avec SP=TP (optimum) • Meilleure Bw : P = K x TP • Chaque processus écrit Q/P • => Equilibrer la charge et l ’attaque des partitions

3.2 Les résultats : Combien de processus utiliser? • Etude de : Bw(P=K.TP,TP) • Zone optimale : • 8  Popt = K x Tp  16 • => Quelque soit le partitionnement, on peut exploiter la baie disque à l ’optimum en ajustant P

3.2 Les résultats : Erreur des mesures externes • Q moyen (<6GB) : Bw (P  32) décroit Q grand (>6GB) : Bw (P  32) constante • Hypothèse : erreur de mesure • Initialisation des processus • Quelques autres instructions (allocation, ...) => Mesure interne

3.2 Les résultats : Mesures internes • P  50 : pas de perte de bande passante • P > 50 : décroissance rencontrée • Modification de la règle précédente : 8  Popt 50

3.3 Bilan des I/O directes Pour l ’administrateur • Pour l ’administrateur : • Donner un accès à toutes les partitions pour tous • Créer plus d ’une partition (résultats moins bon pour 1 partition unique)

3.3 Bilan des I/O directes Pour l ’utilisateur • Règle 1 : • SP = TP  2 • Règle 2 : • P = K x TP • Règle 3 : • 8  P  50 => Bande passante de l ’ordre de 700MB/s

4. Le mode Indirect 4.1 Description et différences 4.2 Mode Indirect Synchrone 4.3 Mode Indirect Asynchrone

4.1 Description et différences Comparaison au mode Direct • Gestion de l ’écriture par le système : • Plus simple • Plus sûr • Aucune contrainte sur les données • 2 modes proposés: • Le mode Indirect Synchrone • Le mode Indirect Asynchrone

4.1 Description et différence : Synchrone ou asynchrone • Mode Indirect Synchrone: • Attente que l ’écriture soit effectuée • Plus long mais mieux sécurisé • Mode Indirect Asynchrone: • Rend la main bien plus vite • Gestion entière par le système de l ’écriture qui peut être différée • Problème : maintenir l ’intégrité des données • Problème : moins sûr si reboot de la machine

4.2 Mode Indirect Synchrone • Procédure à suivre • Granularité et DMA • Combien de partitions utiliser? • Influence de la taille des données • Combien de processus utiliser? • Bilan

4.2 Mode Indirect Synchrone : Procédure à suivre • Procédure d ’I/O habituelle : • open (…) • write (…) • close (…) => Simple et connu

4.2 Mode Indirect Synchrone : Granularité et DMA • Etude de : Bw(Q1-write) • Indirect : DMA masqué • Asymptote au delà de 16MB • => Q1-write 16MB

4.2 Mode Indirect Synchrone : Combien de partitions utiliser? • Etude de : Bw(P,SP) avec TP=8 • Idem mode direct : • Bw(SP) croît • SPopt = TP • => Utiliser le maximum de partitions disponibles : • SP = TP

4.2 Mode Indirect Synchrone : Combien de partitions créer ? • Etude de : Bw(TP) • 1  TP  55 : • Pas de gros pièges • Bw(TP) croît • 55  TP : • Pièges nombreux • Bw(TP) décroît • => Créer de 45 à 55 partitions

4.2 Mode Indirect Synchrone : Influence de la taille des données • Etude de : Bw(Q) • Bw(Q>1.6GB) constante • => Bw indépendant de Q

Origin 2000

Origin 2000

Presentation Transcript

Origin

Scalable CC-NUMA Design Study - SGI Origin 2000

Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

National Origin

DA ORIGIN

OGO 2.1 SGI Origin 2000

Origin/ Meaning

Introduction to Using the Origin-2000 (hecate) things to know...

Performance Optimization for the Origin 2000

Origin

ORIGIN

Performance Optimization for the Origin 2000

Religious Origin

Environnement SGI Origin 2000 v 3.0 Guy Moebs Guy.Moebs@crihan.fr

ORIGIN

Origin

Origin

OGO 2.1 SGI Origin 2000

Scalable CC-NUMA Design Study - SGI Origin 2000

Introduction to Using the Origin-2000 (hecate) things to know...

Origin

Performance Optimization for the Origin 2000