Atlas Canada Lightpath
Download
1 / 52

Atlas Canada Lightpath Data Transfer Trial - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Atlas Canada Lightpath Data Transfer Trial. Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton). Brownie 2.5 TeraByte RAID array. 16 x 160 GB IDE disks (5400 rpm 2MB cache) hot swap capable Dual ultra160 SCSI interface to host Maximum transfer ~65 MB/sec

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Atlas Canada Lightpath Data Transfer Trial' - leane


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Atlas Canada Lightpath

Data Transfer Trial

Corrie Kost, Steve McDonald (TRIUMF)

Bryan Caron (UofAlberta), Wade Hong (Carleton)


Brownie 2 5 terabyte raid array
Brownie 2.5 TeraByte RAID array

  • 16 x 160 GB IDE disks (5400 rpm 2MB cache)

    • hot swap capable

  • Dual ultra160 SCSI interface to host

  • Maximum transfer ~65 MB/sec

  • Triple hot swap power supplies

  • CAN ~$15k

  • Arrives July 8th 2002


What to do while waiting for server to arrive
What to Do while waiting for server to arrive

  • IBM PRO6850 Intellistation (Loan)

  • Dual 2.2 GHz Xeons

  • 2 PCI 64bit/66MHz

  • 4 PCI 33bit/33MHz

  • 1.5 GB RAMBUS

  • Add 2 Promise Ultra100

  • IDE controllers and 5 Disks

  • Each disk on its own IDE controller for maximum IO

  • Begin Linux Software RAID performance tests ~170/130 MB/sec Read/Write


The long road to high disk io
The Long Road to High Disk IO

  • IBM cluster x330’s RH7.2 disk io ~ 15 MB/sec (slow??)

    • expect 45 MB/sec for any modern single drive

  • Need 2.4.18 Linux kernel to support >1TB filesystems

  • IBM cluster x330’s RH7.3 disk io ~ 3 MB/sec

    • What is going on

  • Red Hat modified serverworks driver broke DMA on x330’s

  • x330’s ATA 100 drive, BUT controller is only UDMA 33

  • Promise controllers capable of UDMA 100 but need latest kernel patches for 2.4.18 before drives recognise UDMA100

  • Finally drives/controller both working at UDMA100 = 45MB/sec

  • Linux software raid0 2 drives 90MB/sec, 3 drives 125 MB/sec

    • 4 drives 155MB/sec, 5 drives 175 MB/sec

  • Now we are ready to start network transfers


  • So what are we going to do

    did we

    So what are we going to do?

    ----------------------------------

    • Demonstrate a manually provisioned “e2e” lightpath

    • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN

    • Test out 10GbE technology and channel bonding

    • Establish a new benchmark for high performance disk to disk throughput over a large distance


    Comparative results triumf to cern
    Comparative Results(TRIUMF to CERN)


    What is an e2e lightpath
    What is an e2e Lightpath

    • Core design principle of CA*net 4

    • Ultimately to give control of lightpath creation, teardown and routing to the end user

    • Hence, “Customer Empowered Networks”

    • Provides a flexible infrastructure for emerging grid applications

    • Alas, can only do things manually today



    The chicago loopback
    The Chicago Loopback

    • Need to test TCP/IP and Tsunami protocols over long distances, arrange optical loop via StarLight

      • ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )

      • ~91ms RTT

    • TRIUMF - CERN RRT ~200ms Told Damir, we really needed to have a double loopback

      • “No problem”

      • The loopback2 was setup a few days later (RTT=193ms)

      • (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)


    TRIUMF Server

    SuperMicro P4DL6 (Dual Xeon 2GHz)

    400 MHz front side bus

    1 GB DDR2100 RAM

    Dual Channel Ultra 160 onboard SCSI

    SysKonnect 9843 SX GbE

    2 independent PCI buses

    6 PCI-X 64 bit/133 Mhz capable

    3ware 7850 RAID controller

    2 Promise Ultra 100 Tx2 controllers


    CERN Server

    SuperMicro P4DL6 (Dual Xeon 2GHz)

    400 MHz front side bus

    1 GB DDR2100 RAM

    Dual Channel Ultra 160 onboard SCSI

    SysKonnect 9843 SX GbE

    2 independent PCI buses

    6 PCI-X 64 bit/133 Mhz capable

    2 3ware 7850 RAID controller

    6 IDE drives on each 3-ware controllers

    RH7.3 on 13th drive connected to on-board IDE

    WD Caviar 120GB drives with 8Mbyte cache

    RMC4D from HARDDATA


    TRIUMF Backup Server

    SuperMicro P4DL6 (Dual Xeon 1.8GHz)

    Supermicro 742I-420 17” 4U Chassis 420W Power Supply

    400 MHz front side bus 1 GB DDR2100 RAM

    Dual Channel Ultra 160 onboard SCSI

    SysKonnect 9843 SX GbE 2 independent PCI buses

    6 PCI-X 64bit/133 MHz capable

    2 Promise Ultra 133 TX2 controllers & 1 Promise Ultra 100 TX2 controller



    Operating system
    Operating System

    • Redhat 7.3 based Linux kernel 2.4.18-3

      • Needed to support filesystems > 1TB

    • Upgrades and patches

      • Patched to 2.4.18-10

      • Intel Pro 10GbE Linux driver (early stable)

      • SysKonnect 9843 SX Linux driver (latest)

      • Ported Sylvain Ravot’s tcp tune patches


    Intel 10gbe cards
    Intel 10GbE Cards

    • Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program

      • based on Intel® 82597EX 10 Gigabit Ethernet Controller

        Note length of card!


    Extreme networks
    Extreme Networks

    TRIUMF

    CERN



    Ide disk arrays
    IDE Disk Arrays

    TRIUMF Send Host

    CERN Receive Host


    Disk read write performance
    Disk Read/Write Performance

    • TRIUMF send host:

      • 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI controllers

      • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB)

      • Tuned for optimal read performance (227/174 MB/s)

    • CERN receive host:

      • 2 3ware 7850 64-bit/33 MHz PCI IDE controllers

      • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB)

      • Tuned for optimal write performance (295/210 MB/s)


    THUNDER RAID DETAILS

    raidstop /dev/mdo

    mkraid –R /dev/md0

    mkfs -t ext3 /dev/md0

    mount -t ext2 /dev/mdo /raid0

    /root/raidtab

    raiddev /dev/md0

    raid-level 0

    nr-raid-disks 12

    persistent-superblock 1

    chunk-size 512 kbytes

    device /dev/sdc

    raid-disk 0

    device /dev/sdd

    raid-disk 1

    device /dev/sde

    raid-disk 2

    device /dev/sdf

    raid-disk 3

    device /dev/sdg

    raid-disk 4

    device /dev/sdh

    raid-disk 5

    device /dev/sdi

    raid-disk 6

    device /dev/sdj

    raid-disk 7

    device /dev/hde

    raid-disk 8

    device /dev/hdg

    raid-disk 9

    device /dev/hdi

    raid-disk 10

    device /dev/hdk

    raid-disk 11

    }

    }

    }

    } 8 drives on 3-ware

    }

    }

    }

    }

    }

    }

    } 4 drives on 2 Promise

    }


    Black magic
    Black Magic

    • We are novices in the art of optimizing system performance

    • It is also time consuming

    • We followed most conventional wisdom, much of which we don’t yet fully understand


    Testing methodologies
    Testing Methodologies

    • Began testing with a variety of bandwidth characterization tools

      • pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc

    • Evaluated high performance file transfer applications

      • bbftp, bbcp, tsunami, pftp

    • Developed scripts to automate and to scan parameter space for a number of the tools


    Disk i o black magic
    Disk I/O Black Magic

    • min max read ahead on both systems

      sysctl -w vm.min-readahead=127

      sysctl -w vm.max-readahead=256

    • bdflush on receive host

      sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0”

      or

      echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush

    • bdflush on send host

      sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0”

      or

      echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush


    Misc. Tuning and other tips

    /sbin/elvtune –r 512 /dev/sdc (same for other 11 disks)

    /sbin/elvtune –w 1024 /dev/sdc (same for other 11 disks)

    -r sets the max latency that the I/O scheduler

    will provide on each read

    -w sets the max latency that the I/O scheduler

    will provide on each write

    When the /raid disk refuses to dismount!

    Works for kernels 2.4.11 or later.

    umount -l /raid (then mount & umount)

    ^

    lazy


    Disk i o black magic1
    Disk I/O Black Magic

    • Disk I/O elevators (minimal impact noticed)

      • /sbin/elvtune

      • Allows some control of latency vs throughput

        Read_latency set to 512 (default 8192)

        Write_latency set to 1024 (default 16384)

    • atime

      • Disables updating the last time a file has been accessed (typically for file servers)

        mount –t ext2 –o noatime /dev/md0 /raid

        Typically, ext3 writes90Mbytes/sec while for ext2 writes 190Mbytes/sec

        Reads minimally affected. We always used ext2


    Disk i o black magic2
    Disk I/O Black Magic

    Need to have PROCESS Affinity

    - but this requires 2.5 kernel

    • IRQ Affinity

      [root@thunder root]# more /proc/interrupts

      CPU0 CPU1

      0: 15723114 0 IO-APIC-edge timer

      1: 12 0 IO-APIC-edge keyboard

      2: 0 0 XT-PIC cascade

      8: 1 0 IO-APIC-edge rtc

      10: 0 0 IO-APIC-level usb-ohci

      14: 22 0 IO-APIC-edge ide0

      15: 227234 2 IO-APIC-edge ide1

      16: 126 0 IO-APIC-level aic7xxx

      17: 16 0 IO-APIC-level aic7xxx

      18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller

      20: 14 0 IO-APIC-level ide2, ide3

      22: 2296662 0 IO-APIC-level SysKonnect SK-98xx

      24: 2 0 IO-APIC-level eth3

      26: 2296673 0 IO-APIC-level SysKonnect SK-98xx

      30: 26640812 0 IO-APIC-level eth0

      NMI: 0 0

      LOC: 15724196 15724154

      ERR: 0

      MIS: 0

    echo 1 >/proc/irq/18/smp_affinity  use CPU0

    echo 2 >/proc/irq/18/smp_affinity  use CPU1

    echo 3 >/proc/irq/18/smp_affinity  use either

    cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity  reset to default


    Tcp black magic
    TCP Black Magic

    • Typically suggested TCP and net buffer tuning

      sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304"

      sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304"

      sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304"

      sysctl -w net.core.rmem_default=65535

      sysctl -w net.core.rmem_max=8388608

      sysctl -w net.core.wmem_default=65535

      sysctl -w net.core.wmem_max=8388608


    Tcp black magic1
    TCP Black Magic

    • Sylvain Ravot’s tcp tune patch parameters

      sysctl -w net.ipv4.tcp_tune=“115 115 0”

    • Linux 2.4 retentive TCP

      • Caches TCP control information for a destination for 10 mins

      • To avoid caching

        sysctl -w net.ipv4.route.flush=1


    We are live continent to continent
    We are live continent to continent!

    • e2e lightpath up and running Friday Sept 20 21:45 CET

    traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets

    1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms


    Bbftp transfer
    BBFTP Transfer

    Vancouver ONS

    ons-van01(enet_15/1)

    Vancouver ONS

    ons-van01(enet_15/2)


    Bbftp transfer1
    BBFTP Transfer

    Chicago ONS

    GigE Port 1

    Chicago ONS

    GigE Port 2


    Tsunami transfer
    Tsunami Transfer

    Vancouver ONS

    ons-van01(enet_15/1)

    Vancouver ONS

    ons-van01(enet_15/2)


    Tsunami transfer1
    Tsunami Transfer

    Chicago ONS

    GigE Port 1

    Chicago ONS

    GigE Port 2



    Exceeding 1Gbit/sec …

    ( using tsunami)


    What does it mean for triumf in the long term
    What does it mean for TRIUMFin the long TERM

    • Established a relationship with a ‘grid’ of people for future networking projects

    • Upgraded WAN connection from 100Mbit to

      • 4 x 1GB Ethernet connections directly to BCNET

        • Canarie – educational/research network

        • Westgrid GRID computing

        • Commercial Internet

        • Spare (research & development)

  • Recognition that TRIUMF has the expertise and the Network connectivity for large scale and high speed data transfers necessary for upcoming scientific programs, ATLAS, WESTGRID, etc


  • Lessons learned 1
    Lessons Learned –1

    • Linux software RAID faster than most conventional SCSI and IDE hardware RAID based systems.

    • One controller for each drive, more disk spindles the better

    • More than 2 Promises / machine possible (100/133Mhz)

    • Unless programs are multi-threaded or kernel permits process locking, Dual CPU will not give best performance.

      • Single 2.8 GHz is likely to outperform Dual 2.0 GHz, for a single purpose machine like our fileservers.

    • More memory the better


    Misc. comments

    • No hardware failure – even for the 50 disks!

    • Largest file transferred: 114 Gbytes (Sep 24)

    • Tar, compressing, etc take longer than transfer

    • Deleting files can take a lot of time

    • Low cost of project - $20,000 with most of that recycled


    220Mbytes/sec

    175Mbytes/sec


    Acknowledgements
    Acknowledgements

    • Canarie

      • Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas Tam, Jun Jian

    • Atlas Canada

      • Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve, Richard Keeler

    • HEPnet Canada

      • Dean Karlen

    • TRIUMF

      • Renee Poutissou, Konstantin Olchanski, Mike Vetterli (SFU / Westgrid),

    • BCNET

      • Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, Don McWilliams


    Acknowledgements1
    Acknowledgements

    • Extreme Networks

      • Amyn Pirmohamed, Steven Flowers, John Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner

    • Intel Corporation

      • Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg


    Acknowledgements2
    Acknowledgements

    • Indiana University

      • Mark Meiss, Stephen Wallace

    • Caltech

      • Sylvain Ravot, Harvey Neuman

    • CERN

      • Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley Cannon, J.P Martin-Flatin

    • SURFnet/Universiteit van Amsterdam

      • Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat


    Acknowledgements3
    Acknowledgements

    • Yotta Yotta

      • Geoff Hayward, Reg Joseph, Ying Xie, E. Siu

    • BCIT

      • Bill Rutherford

    • Jalaam

      • Loki Jorgensen

    • Netera

      • Gary Finley


    Atlas canada
    ATLAS Canada

    Alberta

    SFU

    Montreal

    Victoria

    UBC

    Carleton

    York

    TRIUMF

    Toronto


    Lhc data grid hierarchy

    Tier2 Center

    Tier2 Center

    Tier2 Center

    Tier2 Center

    Tier2 Center

    HPSS

    HPSS

    HPSS

    HPSS

    LHC Data Grid Hierarchy

    CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

    ~PByte/sec

    ~100-400 MBytes/sec

    Online System

    Experiment

    CERN 700k SI95 ~1 PB Disk; Tape Robot

    Tier 0 +1

    HPSS

    ~2.5 Gbps

    Tier 1

    FNAL: 200k SI95; 600 TB

    IN2P3 Center

    INFN Center

    RAL Center

    2.5 Gbps

    Tier 2

    ~2.5 Gbps

    Tier 3

    Institute ~0.25TIPS

    Institute

    Institute

    Institute

    Physicists work on analysis “channels”

    Each institute has ~10 physicists working on one or more channels

    Physics data cache

    0.1–1 Gbps

    Tier 4

    Workstations

    Slide courtesy H. Newman (Caltech)


    The atlas experiment
    The ATLAS Experiment

    Canada

    Canada


    ad