slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, PowerPoint Presentation
Download Presentation
Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre,

Loading in 2 Seconds...

play fullscreen
1 / 20

Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Monitoring and operational management in USLHCNet. Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa. CHEP09 - March 2009 Prague. Outline. MonALISA Framework Architecture Data handling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre,' - yosefu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Monitoring and operational management

in USLHCNet

Ramiro Voicu, Iosif Legrand, Harvey Newman,

Artur Barczyk, Costin Grigoras, Ciprian Dobre,

Alexandru Costan, Azher Mughal, Sandor Rozsa

CHEP09 - March 2009 Prague

slide2

Outline

  • MonALISA Framework
    • Architecture
    • Data handling
    • Automatic actions
  • USLHCNet
    • Network topology
    • Monitoring modules
    • Reliable monitoring & accounting
    • Alarms & triggers
  • Conclusions

2

Ramiro Voicu CHEP09 Prague March 2009

the monalisa architecture
The MonALISA Architecture

Regional or Global High Level Services,

Repositories & Clients

HL services

Secure and reliable communication

Dynamic load balancing

Scalability & Replication

AAA for Clients

Proxies

Distributed System for gathering and analyzing information based on mobile agents:

Customized aggregation, Triggers,

Actions

Agents

MonALISA services

Distributed Dynamic

Registration and Discovery-based on a lease

mechanism and remote events

Network of

JINI-Lookup Services Secure & Public

Fully Distributed System with no Single Point of Failure

3

Ramiro Voicu CHEP09 Prague March 2009

monalisa service data handling
MonALISA Service & Data Handling

Postgres

Data Store

Lookup

Service

Lookup

Service

Registration

Data Cache

Service & DB

Web

Service

WSDL

SOAP

Discovery

WS Clients and

service

Data (via ML Proxy)

Predicates & Agents

Clients or

Higher Level

Services

Configuration Control (SSL)

Applications

AGENTS

FILTERS / TRIGGERS

Dynamic (Re)Loading

Collects any

type of information

Monitoring Modules

Push and Pull

4

Ramiro Voicu CHEP09 Prague March 2009

local and global decision framework
Two levels of decisions:

local (autonomous),

global (correlations).

Actions triggered by:

values above/below given thresholds,

absence/presence of values,

correlations between any values.

Action types:

alerts (emails/instant msg/atom feeds),

running an external command,

automatic charts annotations in the repository,

running custom code, like securely ordering a ML service to (re)start a site service.

Local and Global Decision Framework
  • Traffic
  • Jobs
  • Hosts
  • Apps

ML Service

Actions based on

global information

Global

ML

Services

Actions based on

local information

  • Temperature
  • Humidity
  • A/C Power

ML Service

Sensors

Local decisions

Global decisions

Ramiro Voicu CHEP09 Prague March 2009

monitoring architecture in alice
Monitoring architecture in ALICE

AliEn

CE

AliEn

CE

Cluster

Monitor

Cluster

Monitor

AliEn

IS

AliEn

Optimizers

AliEn

Job Agent

AliEn

Job Agent

AliEn

Brokers

ApMon

ApMon

AliEn

TQ

ApMon

ApMon

ApMon

ApMon

AliEn

SE

AliEn

SE

ApMon

ApMon

ApMon

ApMon

MySQL

Servers

ApMon

ApMon

ApMon

CastorGrid

Scripts

AliEn

Job Agent

AliEn

Job Agent

AliEn

Job Agent

AliEn

Job Agent

ApMon

ApMon

ApMon

ApMon

ApMon

API

Services

ApMon

MonALISA

LCG Site

MonALISA

@CERN

MonALISA

@Site

job

slots

net

In/out

run

time

cpu

time

free

space

processes

load

jobs

status

vsz

sockets

rss

migrated

mbytes

See Costin Grigoras’ poster (067):

Automated agents for management and control of the ALICE Computing Grid

active

sessions

Aggregated Data

nr. of

files

open

files

Queued

JobAgents

MonaLisa

Repository

job

status

Alerts

cpu

ksi2k

Actions

Long History

DB

disk

used

MyProxy

status

LCG Tools

6

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet
USLHCNet
  • USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.
  • Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.
  • The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.
  • Hybrid network: uses both Ciena CD and Force10 routers
  • 4 transatlantic 10G links at the moment (6 links in the second part of this year)*

* See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network”

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet ml weather map
USLHCnet ML weather map

Ramiro Voicu CHEP09 Prague March 2009

monitoring modules
Monitoring modules

We developed a set of monitoring modules for USLHCNet network devices:

  • Force10 (SNMP & sFlow)
    • Traffic per interface
    • sFlow traffic
    • Link status monitoring
  • Ciena Core Director (TL1 – Transaction Language1)
    • ETTP (Ethernet Termination Point) traffic
    • EFLOW (Ethernet Flow) traffic
    • OSRP (routing protocol) topology
    • Dynamic circuits inside the optical core of the network

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet monitoring
USLHCnet monitoring

MonALISA

@GVA

MonALISA

@AMS

SNMP

SNMP

TL1

MonALISA

@NYC

MonALISA

@CHI

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet redundant monitoring
USLHCnet redundant monitoring

MonALISA

@GVA

MonALISA

@AMS

Each Circuit

is monitored at both

ends by at least two

MonALISA services;

the monitored data

is aggregated by

global filters in

the repository

MonALISA

@NYC

MonALISA

@CHI

Ramiro Voicu CHEP09 Prague March 2009

local and global filters
Local and global filters
  • Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems
  • The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services
    • The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.
  • We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet precise measurements for the operational status on the wan link
USLHCnet: Precise measurements for the Operational Status on the WAN Link
  • Operations & management assisted by agent-based software
  • Used on the new CIENA equipment used for network managment

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet traffic on different segments
USLHCnet: Traffic on different segments

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet accounting for integrated traffic
USLHCnet: Accounting for Integrated Traffic

Ramiro Voicu CHEP09 Prague March 2009

uslhcnet ciena alarms monitoring
USLHCnet: Ciena alarms monitoring

Ramiro Voicu CHEP09 Prague March 2009

the need for planning and scheduling for large data transfers
The Need for Planning and Scheduling for Large Data Transfers

In Parallel

Sequential

2.5 X Faster to perform the two reading tasks sequentially

Ramiro Voicu CHEP09 Prague March 2009

monitoring optical switches
Monitoring Optical Switches

Dynamic restoration

of lightpath if a segment has problems

Ramiro Voicu CHEP09 Prague March 2009

controlling optical planes automatic path recovery
Controlling Optical Planes Automatic Path Recovery

CERN

Geneva

USLHCnet

Internet2

Starlight

CALTECH

Pasadena

Manlan

200+ MBytes/sec

From a 1U Node

For more details, see Iosif Legrand’s poster (054):

A High Performance Data Transfer Service

FDT Transfer

“Fiber cut” simulations

The traffic moves from one

transatlantic line to the other one

FDT transfer (CERN – CALTECH)

continues uninterrupted

TCP fully recovers in ~ 20s

4

2

3

1

4 fiber cut emulations

4 Fiber cuts simulations

Ramiro Voicu CHEP09 Prague March 2009

conclusions
Conclusions
  • The MonALISA framework provides a flexible and reliable monitoring infrastructure
    • 350+ installed services, 1.5M+ unique parameters, 25kHz value updates
    • Truly distributed architecture with no single points of failure
    • Highly modular platform
    • Automatic decision taking capability at both local and global levels
  • USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services
    • Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime
    • We are investigating dynamic provisioning of circuits from collaborating agents

http://monalisa.caltech.edu

http://repository.uslhcnet.org

Ramiro Voicu CHEP09 Prague March 2009