Data Center Maintenance & Cleaning Practice Ahmad Naufal Jamri Penolong Pengarah TU2 Sektor Operasi & Perkhidmatan Teknikal Bahagian Pengurusan Maklumat Jabatan Perkhidmatan Awam - JPA
Lesson Contents • Maintenance • Types of Maintenance • Tiered Infrastructure Maintenance Standards (TIMS) • Infrastructure Maintainance Checklist • Facility – power, cooling, spatial • IT Infrastructure – server, storage, network • Environmental Testing • Cleaning Practices • Standard • Types
Maintenance Why is maintenance an important aspect in Data Center operation? Availability of critical systems and operational performance of the facility
Maintenance An activity used to inspect, repair, replace, or improve a component at a facility, to pro-actively preventing unexpected interruption and ensuring timely actions for remediation.
Types of Maintenance • Preventive Maintenance (PM) • Corrective Maintenance (CM) • Emergency Maintenance (EM) • Proactive Maintenance • Predictive Maintenance
Preventive Maintenance (PM) A work type for maintenance that is conducted on a regular frequency to prolong the life of equipment and prevent premature failures. PM includes equipment inspection, lubrication, adjustment, cleaning, and testing, and replacing incidental parts, such as filters and simple belts. PM is a combination of proactive maintenance and scheduled overhauls and may be augmented by techniques such as reliability-centered maintenance, predictive maintenance, and condition monitoring.
Corrective Maintenance (CM) A work type for maintenance activity which restores an asset to a preserved operating condition. It is normally initiated because of a scheduled inspection or a routine check that finds the asset or component is no longer within a prescribed tolerance or in an acceptable operating condition.
Emergency Maintenance (EM) A work type for maintenance that is not anticipated but must be performed immediately to ensure reliable operation of the facility or component in the facility.
Proactive Maintenance Maintenance that is structured to anticipate equipment problems and prevent or reduce Corrective or Emergency Maintenance. Predictive Maintenance Maintenance that is performed based on a known set of conditions, such as number of operations or length of operation.
Tiered Infrastructure Maintenance Standards (TIMS) • Uptime Institute • Introduce a tiered classification approach to data center design • It provides an important first step in helping data center’s to achieve increased reliability. • The more critical the mission, the more intense the • maintenance programs are needed.
Tiered Infrastructure Maintenance Standards (TIMS) • TIMS is created by Lee Technologies • to provide an evaluation on organization maintenance programs • to understand their level of risk • to effectively allocate organization resources (people, budget, spare parts) • Every facility and organization is unique, and TIMS will need to be adapted to the specific environment.
Tiered Infrastructure Maintenance Standards (TIMS) • According to TIMS, the most important step is to determine the level of risk acceptable to the organization. • Four Maintenance Service Tiers have been established: • TIMS-1: Run to Fail • TIMS-2: Unstructured • TIMS-3: Structured • TIMS-4: Facilitated
TIMS-1 Run to Fail • "If it isn’t broken, don’t fix it.“ • Maintenance at this level is essentially reactive. Only when a problem develop, a vendor is called to perform the repair. • Lack of preventive maintenance often results in overall system weakness and overloading.
TIMS-2 Unstructured Maintenance • TIMS-2 involves the performance of basic preventative maintenance on critical infrastructure equipment by a qualified vendor or in-house technical staff. • The lack of maintenance structure typically found in this approach can create a false sense of security - undetected trouble spots.
TIMS-2 Unstructured Maintenance • A common characteristic of Unstructured Maintenanceis an over-reliance on individual effort creating a high degree of risk when an organization’s facility maintenance knowledge resides inside the head of individual technicians. • Unstructured, under-documented maintenance programs create an environment in which maintenance is more haphazard, and the risk of human error is elevated.
TIMS-3 Structured Maintenance • The goal of Structured Maintenance is to maximize uptime by eliminating guesswork and minimizing human error. • Every part of the maintenance process is closely evaluated. Programs are created to identify, train, supervise and evaluate qualified personnel. Procedures are developed to precisely manage how and when work is performed.
TIMS-3 Structured Maintenance • Structured Maintenance brings together best practices for each maintenance element and integrates them into a program to systematically eliminate variables that can introduce errors. • This maintenance level is extremely proactive.
TIMS-3 Structured Maintenance • Characteristics of Structured Maintenance includes: • formal staff training program • document library that includes a scope of service and Standard Operating Procedure (SOP) for all site equipment • change management program that utilizes methods of procedure for all maintenance activities • robust vendor management program • quality control procedures • specialized support systems such as a Computerized Maintenance Management System (CMMS) and a Document Management System (DMS).
TIMS-4 Facilitated Maintenance • The highest level of maintenance. • Provide multiple power and cooling distribution paths with redundant components - allow individual equipment to be isolated and maintained without a disruption in services. • Building Management System (BMS), which continually monitors the critical infrastructure, and provide a controlled means for bringing equipment on- and off-line for maintenance.
TIMS-4 Facilitated Maintenance • Minimizes the risk of downtime. • Automated systems take much of the risk of human error out of the equation, and can respond more quickly and accurately to sudden changes. • Predictive maintenance.
FACILITY A. Power Maintenance Conduct emergency shutdown test and power up from complete power failure Turn off lighting with Motion Detectors when nobody is working In terms of batteries, check, measure and record voltage, currents, and conductance readings—look for trends in recordings Generators should be load tested once per year (checklist) UPS maintenance checklist & tools
Power Usage Effectiveness (PUE) • used to measure the energy efficiency . • value of 1.0 (indicating that all power was being used by IT equipment with 100% efficiency) • UPS Batteries • well ventilated air that is as close as possible to 25oC • Be aware of the battery's discharge status. • Cannot leave a battery uncharged for more than 48 hours. • Perform capacity testing on batteries • Consider using flywheel UPS in conjunction with battery UPS • Buy the right UPS battery for your data center
FACILITY B. Cooling Maintenance Temperature & Humidity 1. Check for airflow blockages under the floor. 2. Raise the temperature a few degrees. If the weather is cold outside, design air-conditioning systems that can take advantage of external air. Use dynamic fan which use temperature sensors to increase and decrease fan speeds as needed. Use recycled water collection systems for backup cooling Initiate a cooling system maintenance schedule
A cooling system checkup should include the following items: • maximum cooling capacity • CRAC (computer room air conditioning) units chiller water/ condenser loop • room temperatures • rack temperatures • tile air velocity • condition of sub floors • airflow within racks • aisle and floor tile arrangement – hot & cold aisle
Align air handling units with hot aisles to optimize cooling efficiency Thermal scanningof racks and breaker panels Airflow within the rack is also affected by unstructured cabling arrangements, which can restrict the exhaust air from IT equipment. air handler unstructured cabling breaker panels
Track cooling, humidification, and heating (dehumidifier) load on a monthly basis Return temperatures approaching or greater than 24oC are a danger sign that there is too much heat load in a specific area and temperatures below 21oC are a sign of inefficient cooling Filters need to be changed with proper replacement types on a monthly basis Change your air conditioning system’s filters every 1 to 2 months and inspecting your ducts annually will prevent dust and debris from accumulating in your air handler.
FACILITY C. Spatial Maintenance Remove under-floor obstructions. Distributing high density racks across the entire floor area. Manage perforated tiles - place them as closely as possible to equipment intakes Inspect for dust, leaks, or corrosion Cables are maintained above the raised floor—if the rack configuration allows it—to optimize airflow below the raised floor
IT INFRASTRUCTURE A. Server • Conduct preventive server maintenance • (sample schedule) • Try and test segments, rather that starting everything up a once (if problems occur, it is easier to locate) • Server monitoring tools
IT INFRASTRUCTURE B. Network • Network checking, adding and removing of IP’s, subnet maintenance, domain setup and maintenance, troubleshooting and resolving of problems on the network, troubleshooting of related hardware attached to the network, setup and changes to security polices, adding and removing of users, implementing filtering and anti-spam solutions • Network maintenance checklist
IT INFRASTRUCTURE C. Storage • Create clustered storage solutions • Backup • Data de-duplication • Archiving • Recovery • Migration • Virtualization • Storage Tiering (FAST by EMC2)
Best Practices of DC Maintenance • 24x7 dedicated security staff with multiple layers of physical security • Regular cleaning of DC • Quarterly preventive maintenance include power system checks, HVAC and generator servicing • All service visits must be coordinated and tracked • Physical site inspection be done every month • Copy of all documents and SOP must be maintained at a central website
Data Center Cleaning Data Center Cleaning Cleaning Standards • ISO Class 8 or Class 9 • Fed. Std. 209E – Class 100,000
Approved Cleaning Supplies • Triple filtration high-efficiency particulate air (HEPA) vacuums • Electrical cords in good condition with 3-pin ground configuration • Cleaning chemicals pH neutral • Mops lint-free with non-metal handles and sewn ends • Lint-free and antistatic wipes
Cleaning: Guidelines for Contamination Control in DC Establish Computer Room Protocols No food or drink inside the computer room. Do not unpack or uncrate equipment or other items inside the computer room. A staging area outside of the computer room should be established for unpacking or uncrating activity. Do not store cardboard, wood or paper type products inside the computer room. These items continuously shed large amounts of contamination. Do not prop open doors that lead to non-computer room areas. Do not allow any work to occur in computer room until the environmental impact of the work is known AND protocols for contamination control have been reviewed and approved. Any tools and/or materials brought in the computer room by vendors or employees should be reasonably clean and contaminant free.
Guidelines for Contamination Control in Computer Room Environments Limit access to the computer room In addition to security issues, unnecessary personnel can add to contamination levels. People generate contamination through clothing fibers, dead skin, hair, and dirt on their shoes. Place contamination control mats at all entrances to the computer room. Contamination control mats help ensure that dust, carpet fibers and other small particulate are not tracked into the room via people’s shoes or wheels on carts. Maintain positive air pressurization in the computer room relative to surrounding areas. Positive pressurization helps prevent contaminated air from entering the room.
Guidelines for Contamination Control in Computer Room Environments Maintain a disciplined computer room cleaning program Clean top of floor surfaces quarterly or more frequently. Clean equipment and environmental surfaces quarterly or more frequently. Clean underfloor plenum at least once per year. Two times per year if the plenum is delivering pressurized air. Maintain a consistent cleaning schedule. Increase cleaning frequency during construction or other contamination producing events