slide1
Download
Skip this Video
Download Presentation
Atlas Canada Lightpath Data Transfer Trial

Loading in 2 Seconds...

play fullscreen
1 / 52

Atlas Canada Lightpath Data Transfer Trial - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Atlas Canada Lightpath Data Transfer Trial. Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton). Brownie 2.5 TeraByte RAID array. 16 x 160 GB IDE disks (5400 rpm 2MB cache) hot swap capable Dual ultra160 SCSI interface to host Maximum transfer ~65 MB/sec

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Atlas Canada Lightpath Data Transfer Trial' - leane


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Atlas Canada Lightpath

Data Transfer Trial

Corrie Kost, Steve McDonald (TRIUMF)

Bryan Caron (UofAlberta), Wade Hong (Carleton)

brownie 2 5 terabyte raid array
Brownie 2.5 TeraByte RAID array
  • 16 x 160 GB IDE disks (5400 rpm 2MB cache)
    • hot swap capable
  • Dual ultra160 SCSI interface to host
  • Maximum transfer ~65 MB/sec
  • Triple hot swap power supplies
  • CAN ~$15k
  • Arrives July 8th 2002
what to do while waiting for server to arrive
What to Do while waiting for server to arrive
  • IBM PRO6850 Intellistation (Loan)
  • Dual 2.2 GHz Xeons
  • 2 PCI 64bit/66MHz
  • 4 PCI 33bit/33MHz
  • 1.5 GB RAMBUS
  • Add 2 Promise Ultra100
  • IDE controllers and 5 Disks
  • Each disk on its own IDE controller for maximum IO
  • Begin Linux Software RAID performance tests ~170/130 MB/sec Read/Write
the long road to high disk io
The Long Road to High Disk IO
  • IBM cluster x330’s RH7.2 disk io ~ 15 MB/sec (slow??)
      • expect 45 MB/sec for any modern single drive
  • Need 2.4.18 Linux kernel to support >1TB filesystems
  • IBM cluster x330’s RH7.3 disk io ~ 3 MB/sec
          • What is going on
  • Red Hat modified serverworks driver broke DMA on x330’s
  • x330’s ATA 100 drive, BUT controller is only UDMA 33
  • Promise controllers capable of UDMA 100 but need latest kernel patches for 2.4.18 before drives recognise UDMA100
  • Finally drives/controller both working at UDMA100 = 45MB/sec
  • Linux software raid0 2 drives 90MB/sec, 3 drives 125 MB/sec
          • 4 drives 155MB/sec, 5 drives 175 MB/sec
  • Now we are ready to start network transfers
so what are we going to do

did we

So what are we going to do?

----------------------------------

  • Demonstrate a manually provisioned “e2e” lightpath
  • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN
  • Test out 10GbE technology and channel bonding
  • Establish a new benchmark for high performance disk to disk throughput over a large distance
what is an e2e lightpath
What is an e2e Lightpath
  • Core design principle of CA*net 4
  • Ultimately to give control of lightpath creation, teardown and routing to the end user
  • Hence, “Customer Empowered Networks”
  • Provides a flexible infrastructure for emerging grid applications
  • Alas, can only do things manually today
the chicago loopback
The Chicago Loopback
  • Need to test TCP/IP and Tsunami protocols over long distances, arrange optical loop via StarLight
    • ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )
    • ~91ms RTT
  • TRIUMF - CERN RRT ~200ms Told Damir, we really needed to have a double loopback
    • “No problem”
    • The loopback2 was setup a few days later (RTT=193ms)
    • (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)
slide16

TRIUMF Server

SuperMicro P4DL6 (Dual Xeon 2GHz)

400 MHz front side bus

1 GB DDR2100 RAM

Dual Channel Ultra 160 onboard SCSI

SysKonnect 9843 SX GbE

2 independent PCI buses

6 PCI-X 64 bit/133 Mhz capable

3ware 7850 RAID controller

2 Promise Ultra 100 Tx2 controllers

slide17

CERN Server

SuperMicro P4DL6 (Dual Xeon 2GHz)

400 MHz front side bus

1 GB DDR2100 RAM

Dual Channel Ultra 160 onboard SCSI

SysKonnect 9843 SX GbE

2 independent PCI buses

6 PCI-X 64 bit/133 Mhz capable

2 3ware 7850 RAID controller

6 IDE drives on each 3-ware controllers

RH7.3 on 13th drive connected to on-board IDE

WD Caviar 120GB drives with 8Mbyte cache

RMC4D from HARDDATA

slide18

TRIUMF Backup Server

SuperMicro P4DL6 (Dual Xeon 1.8GHz)

Supermicro 742I-420 17” 4U Chassis 420W Power Supply

400 MHz front side bus 1 GB DDR2100 RAM

Dual Channel Ultra 160 onboard SCSI

SysKonnect 9843 SX GbE 2 independent PCI buses

6 PCI-X 64bit/133 MHz capable

2 Promise Ultra 133 TX2 controllers & 1 Promise Ultra 100 TX2 controller

operating system
Operating System
  • Redhat 7.3 based Linux kernel 2.4.18-3
    • Needed to support filesystems > 1TB
  • Upgrades and patches
    • Patched to 2.4.18-10
    • Intel Pro 10GbE Linux driver (early stable)
    • SysKonnect 9843 SX Linux driver (latest)
    • Ported Sylvain Ravot’s tcp tune patches
intel 10gbe cards
Intel 10GbE Cards
  • Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program
    • based on Intel® 82597EX 10 Gigabit Ethernet Controller

Note length of card!

ide disk arrays
IDE Disk Arrays

TRIUMF Send Host

CERN Receive Host

disk read write performance
Disk Read/Write Performance
  • TRIUMF send host:
    • 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI controllers
    • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB)
    • Tuned for optimal read performance (227/174 MB/s)
  • CERN receive host:
    • 2 3ware 7850 64-bit/33 MHz PCI IDE controllers
    • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB)
    • Tuned for optimal write performance (295/210 MB/s)
slide26

THUNDER RAID DETAILS

raidstop /dev/mdo

mkraid –R /dev/md0

mkfs -t ext3 /dev/md0

mount -t ext2 /dev/mdo /raid0

/root/raidtab

raiddev /dev/md0

raid-level 0

nr-raid-disks 12

persistent-superblock 1

chunk-size 512 kbytes

device /dev/sdc

raid-disk 0

device /dev/sdd

raid-disk 1

device /dev/sde

raid-disk 2

device /dev/sdf

raid-disk 3

device /dev/sdg

raid-disk 4

device /dev/sdh

raid-disk 5

device /dev/sdi

raid-disk 6

device /dev/sdj

raid-disk 7

device /dev/hde

raid-disk 8

device /dev/hdg

raid-disk 9

device /dev/hdi

raid-disk 10

device /dev/hdk

raid-disk 11

}

}

}

} 8 drives on 3-ware

}

}

}

}

}

}

} 4 drives on 2 Promise

}

black magic
Black Magic
  • We are novices in the art of optimizing system performance
  • It is also time consuming
  • We followed most conventional wisdom, much of which we don’t yet fully understand
testing methodologies
Testing Methodologies
  • Began testing with a variety of bandwidth characterization tools
    • pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc
  • Evaluated high performance file transfer applications
    • bbftp, bbcp, tsunami, pftp
  • Developed scripts to automate and to scan parameter space for a number of the tools
disk i o black magic
Disk I/O Black Magic
  • min max read ahead on both systems

sysctl -w vm.min-readahead=127

sysctl -w vm.max-readahead=256

  • bdflush on receive host

sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0”

or

echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush

  • bdflush on send host

sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0”

or

echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush

slide30

Misc. Tuning and other tips

/sbin/elvtune –r 512 /dev/sdc (same for other 11 disks)

/sbin/elvtune –w 1024 /dev/sdc (same for other 11 disks)

-r sets the max latency that the I/O scheduler

will provide on each read

-w sets the max latency that the I/O scheduler

will provide on each write

When the /raid disk refuses to dismount!

Works for kernels 2.4.11 or later.

umount -l /raid (then mount & umount)

^

lazy

disk i o black magic1
Disk I/O Black Magic
  • Disk I/O elevators (minimal impact noticed)
    • /sbin/elvtune
    • Allows some control of latency vs throughput

Read_latency set to 512 (default 8192)

Write_latency set to 1024 (default 16384)

  • atime
    • Disables updating the last time a file has been accessed (typically for file servers)

mount –t ext2 –o noatime /dev/md0 /raid

Typically, ext3 writes90Mbytes/sec while for ext2 writes 190Mbytes/sec

Reads minimally affected. We always used ext2

disk i o black magic2
Disk I/O Black Magic

Need to have PROCESS Affinity

- but this requires 2.5 kernel

  • IRQ Affinity

[[email protected] root]# more /proc/interrupts

CPU0 CPU1

0: 15723114 0 IO-APIC-edge timer

1: 12 0 IO-APIC-edge keyboard

2: 0 0 XT-PIC cascade

8: 1 0 IO-APIC-edge rtc

10: 0 0 IO-APIC-level usb-ohci

14: 22 0 IO-APIC-edge ide0

15: 227234 2 IO-APIC-edge ide1

16: 126 0 IO-APIC-level aic7xxx

17: 16 0 IO-APIC-level aic7xxx

18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller

20: 14 0 IO-APIC-level ide2, ide3

22: 2296662 0 IO-APIC-level SysKonnect SK-98xx

24: 2 0 IO-APIC-level eth3

26: 2296673 0 IO-APIC-level SysKonnect SK-98xx

30: 26640812 0 IO-APIC-level eth0

NMI: 0 0

LOC: 15724196 15724154

ERR: 0

MIS: 0

echo 1 >/proc/irq/18/smp_affinity  use CPU0

echo 2 >/proc/irq/18/smp_affinity  use CPU1

echo 3 >/proc/irq/18/smp_affinity  use either

cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity  reset to default

tcp black magic
TCP Black Magic
  • Typically suggested TCP and net buffer tuning

sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304"

sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304"

sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304"

sysctl -w net.core.rmem_default=65535

sysctl -w net.core.rmem_max=8388608

sysctl -w net.core.wmem_default=65535

sysctl -w net.core.wmem_max=8388608

tcp black magic1
TCP Black Magic
  • Sylvain Ravot’s tcp tune patch parameters

sysctl -w net.ipv4.tcp_tune=“115 115 0”

  • Linux 2.4 retentive TCP
    • Caches TCP control information for a destination for 10 mins
    • To avoid caching

sysctl -w net.ipv4.route.flush=1

we are live continent to continent
We are live continent to continent!
  • e2e lightpath up and running Friday Sept 20 21:45 CET

traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets

1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms

bbftp transfer
BBFTP Transfer

Vancouver ONS

ons-van01(enet_15/1)

Vancouver ONS

ons-van01(enet_15/2)

bbftp transfer1
BBFTP Transfer

Chicago ONS

GigE Port 1

Chicago ONS

GigE Port 2

tsunami transfer
Tsunami Transfer

Vancouver ONS

ons-van01(enet_15/1)

Vancouver ONS

ons-van01(enet_15/2)

tsunami transfer1
Tsunami Transfer

Chicago ONS

GigE Port 1

Chicago ONS

GigE Port 2

slide41

Exceeding 1Gbit/sec …

( using tsunami)

what does it mean for triumf in the long term
What does it mean for TRIUMFin the long TERM
  • Established a relationship with a ‘grid’ of people for future networking projects
  • Upgraded WAN connection from 100Mbit to
    • 4 x 1GB Ethernet connections directly to BCNET
        • Canarie – educational/research network
        • Westgrid GRID computing
        • Commercial Internet
        • Spare (research & development)
  • Recognition that TRIUMF has the expertise and the Network connectivity for large scale and high speed data transfers necessary for upcoming scientific programs, ATLAS, WESTGRID, etc
lessons learned 1
Lessons Learned –1
  • Linux software RAID faster than most conventional SCSI and IDE hardware RAID based systems.
  • One controller for each drive, more disk spindles the better
  • More than 2 Promises / machine possible (100/133Mhz)
  • Unless programs are multi-threaded or kernel permits process locking, Dual CPU will not give best performance.
    • Single 2.8 GHz is likely to outperform Dual 2.0 GHz, for a single purpose machine like our fileservers.
  • More memory the better
slide44

Misc. comments

  • No hardware failure – even for the 50 disks!
  • Largest file transferred: 114 Gbytes (Sep 24)
  • Tar, compressing, etc take longer than transfer
  • Deleting files can take a lot of time
  • Low cost of project - $20,000 with most of that recycled
slide45

220Mbytes/sec

175Mbytes/sec

acknowledgements
Acknowledgements
  • Canarie
    • Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas Tam, Jun Jian
  • Atlas Canada
    • Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve, Richard Keeler
  • HEPnet Canada
    • Dean Karlen
  • TRIUMF
    • Renee Poutissou, Konstantin Olchanski, Mike Vetterli (SFU / Westgrid),
  • BCNET
    • Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, Don McWilliams
acknowledgements1
Acknowledgements
  • Extreme Networks
    • Amyn Pirmohamed, Steven Flowers, John Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner
  • Intel Corporation
    • Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg
acknowledgements2
Acknowledgements
  • Indiana University
    • Mark Meiss, Stephen Wallace
  • Caltech
    • Sylvain Ravot, Harvey Neuman
  • CERN
    • Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley Cannon, J.P Martin-Flatin
  • SURFnet/Universiteit van Amsterdam
    • Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat
acknowledgements3
Acknowledgements
  • Yotta Yotta
    • Geoff Hayward, Reg Joseph, Ying Xie, E. Siu
  • BCIT
    • Bill Rutherford
  • Jalaam
    • Loki Jorgensen
  • Netera
    • Gary Finley
atlas canada
ATLAS Canada

Alberta

SFU

Montreal

Victoria

UBC

Carleton

York

TRIUMF

Toronto

lhc data grid hierarchy

Tier2 Center

Tier2 Center

Tier2 Center

Tier2 Center

Tier2 Center

HPSS

HPSS

HPSS

HPSS

LHC Data Grid Hierarchy

CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

~PByte/sec

~100-400 MBytes/sec

Online System

Experiment

CERN 700k SI95 ~1 PB Disk; Tape Robot

Tier 0 +1

HPSS

~2.5 Gbps

Tier 1

FNAL: 200k SI95; 600 TB

IN2P3 Center

INFN Center

RAL Center

2.5 Gbps

Tier 2

~2.5 Gbps

Tier 3

Institute ~0.25TIPS

Institute

Institute

Institute

Physicists work on analysis “channels”

Each institute has ~10 physicists working on one or more channels

Physics data cache

0.1–1 Gbps

Tier 4

Workstations

Slide courtesy H. Newman (Caltech)

ad