1 / 22

System installation & updates

System installation & updates. A.Manabe (KEK). Installation & update. System(SW) installation & update is boring and hard work for me. Question: How do you install or update system for Cluster of more than 100 nodes.

dinos
Download Presentation

System installation & updates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System installation & updates A.Manabe (KEK) LSCCW A.Manabe

  2. Installation & update • System(SW) installation & update is boring and hard work for me. • Question:How do you install or update system for Cluster of more than 100 nodes. • Question:Did you postpone a system upgrading, because the work is too much? LSCCW A.Manabe

  3. Installation & Update methods • Pre-installed, Pre-configured System • you can postpone your work, but soon or later ... • Manual installation; one PC by one PC. • many operators in parallel with many duplicated installation CDs. • it require many CRTs, days and cost (to hire operators) • Network Installation • with NFS/FTP server and Automated ‘batch’ installation. • ‘Server too busy’ in installation to many nodes. • A lot of works still remain (utility SW installation...). LSCCW A.Manabe

  4. Installation & update methods • Duplicate disk image • Attach many disks to one PC and dup. the installed disk, then distribute duplicated disks to nodes. • Hardware work is hard (attach/detach easy disk unit). • Diskless PC • Using local disks only for swap and /var directory, other dir. from NFS server. • Powerful server is necessary. • Node can do nothing alone (trouble shooting may become difficult). LSCCW A.Manabe

  5. An Idea • Make one installed host, clone the disk image to nodes via network. • 100PC installation in 10min. (objective value) • Necessary operator intervention as small as possible. LSCCW A.Manabe

  6. Our planning method (1) • Network Disk Cloning Software • dolly+ • For cloning disk image. • Network Booting • PXE (Preboot Execution Environment) with Intel NIC • For starting an Installer. • Batch Installer • Modified RedHat kickstart • For disk format, network setup and starting cloning sw.make private /etc/fstab, /etc/sysconfig/network.. LSCCW A.Manabe

  7. Our method (2) • Remote Power Controller • Network control power tap (Hardware) • For remote system reset.(replace ‘pushing reset button’ one by one) • Console server with a serial console feature of Linux. • For watching everything done well. LSCCW A.Manabe

  8. Dolly+100PC installation in 10 min. • A software to copy/clone files or/anddisk images among many PCs through a network. • Running on Linux as a user program. • Free Software • Dolly is developed by CoPs project in ETH. (Swiss) LSCCW A.Manabe

  9. Dolly+ • Sequential file & Block file transfer. • RING network connection topology. • Pipeline mechanism. • Fail recovery mechanism. LSCCW A.Manabe

  10. Config file • Need only for Server host. Server = host having original images or files iofiles 3 /data/image_hda1 > /dev/hda1 /data/image_hda5 > /dev/hda5 /dev/hda6 > /dev/hda6 server dcpcf001 clients 10 n001 n002 (listing of all nodes) endconfig LSCCW A.Manabe

  11. S Server = host having original image Ring Topology • Utilize max. performance ability of full duplex ports switches. • Good for networks of complex of switches. (because connection is only needed between adjacent nodes)

  12. Sever bottle neck in One Server-many clients topology Server = host having original image • Server bottle neck both in network and server itself. • Broadcast or Multicast • UDP • Difficulty in making reliable transfer on multicast. S

  13. BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node PIPELINING & multi threading 3 thread in parallel

  14. Performance (measured) • 1Server - 1Nodes (Pent.III 500Mhz) • IDE disk/100BaseT network ~ 4MB/s • SCSI U2W/100BaseT network ~ 9MB/s • 4GB image copy >> 17min.(IDE), 8min.(SCSI) • 1Server - 7Nodes • IDE/100BaseT • 4GB image copy -> 17min.(IDE) (+8sec.) • +Time for booting process. LSCCW A.Manabe

  15. Expected performance • 1Server-100Nodes • IDE/100 ~ 19min.(+2min.Ovh) • SCSI/100 ~ 9min.(+1min.Ovh) LSCCW A.Manabe

  16. How many min. to install to 1000 nodes? +100% +50%

  17. S Fail recovery mechanism • In my experience, ~2% initial HW problem. • Dolly+ provides automatic ‘short cut’ mechanism in node problem. • RING topology makes its implementation easy. time out Short cutting LSCCW A.Manabe

  18. Cascade Topology • Server bottle neck could be overcome. • Week against a node failure. Failure will spread in cascade way as well and difficult to recover.

  19. Beta version will be available from corvus.kek.jp/~manabe/pcf/dollyafter this work shop. LSCCW A.Manabe

  20. LSCCW A.Manabe

  21. BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node PIPELINING & multi threading

More Related