210 likes | 339 Views
This comprehensive guide explores the critical aspects of achieving high availability and security in modern distributed systems. It covers the interconnectedness of resiliency and high availability, defines downtime, and identifies its causes, including planned maintenance, human mistakes, and software failures. The document emphasizes measuring availability using MTBF and MTTR, while discussing redundant system designs such as RAID and SAN. It also examines failover configurations, management, and essential security measures for data management in storage networks, ensuring systems remain operational and secure.
E N D
Modern Distributed Systems Design – Security and High Availability Measuring Availability Highly Available Data Management Redundant System Design
Measuring Availability • How resiliency and high availability are interconnected? • Define downtime and what causing downtime. • How to meager availability?
Define Downtime • Downtime could be defined by following: “If a user cannot get his job done on time, the system is down”
What causing downtime? • Planned – ones that easiest to reduce that include scheduled system maintenance, hot-swappable hard drives, cluster upgrades and even failovers. Usually 30% of all downtime; • People or human factor – dumb mistakes and complex innovation in IT equipment, software and protocols requires greater knowledge of engineers. Usually 15 % of all downtime; • Software Failures - due to software bugs and viruses. (40%)
How to meager availability? MTBF Availability = ---------------------, where MTBF + MTTR MTBF – “mean time between failures” and MTTR - “maximum time to repair”
What can go wrong? • Hardware • Environmental and Physical Failures • Network Failures • Database System Failures • Web Server Failures • File and Print Server Failures
Levels of Availability: • Regular Availability • Increased Availability • High Availability • Disaster recovery • Fault-Tolerant System
Highly Available Data Management • Data management is the most sensitive area of modern distributed systems. • Quick overview of existing data topologies
Redundant System Design • Redundant storage (RAID, Multi-hosting, Multi-Pathing, DiskArray, JBOD, etc) • Failover Configurations and Management • Introduction to SAN and Fibre Channel protocol • Security aspects of data management in Storage Area Networks
Failover Configurations and Management Failover must meet following requirements: • Transparent to client; • Quick (no more then 5 min, ideally 0-2 min); • Minimal manual intervention, guaranteed data access.
Failover components: • Two servers, one primary another takeover; • Two network connections, third is highly recommended • All disks on a failover pair should have some sort of redundancy • Application portability • No single point of failure.
Security in IP Storage Networks • Security in Fibre Channel SANs • Security Options for IP Storage Networks
Fibre Channel SAN Security • Port or hard zoning • WWN Zoning • LUN Masking
Security Options for IP Storage Networks • iSNS • LUN Masking as in Fibre Channel and VLAN tagging • IP Security or IPSec • ACL