1 / 65

Book Drawing

Make sure you leave me a business card or a piece of paper with your name on it for the drawing at the end of the session. Book Drawing. Exchange High Availability Without Clustering. Jim McBee ITCS Hawaii jim@somorita.com. Setting the stage….

Download Presentation

Book Drawing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Make sure you leave me a business card or a piece of paper with your name on it for the drawing at the end of the session. Book Drawing

  2. Exchange High AvailabilityWithout Clustering Jim McBee ITCS Hawaii jim@somorita.com

  3. Setting the stage…. “Approximately 80 percent of unplanned downtime is caused by people and process issues, while the remainder is caused by technology failures and disasters” -Gartner Group study, March 16, 1999

  4. Who is Jim McBee!!?? • Consultant, Writer, MCSE, MVP and MCT – Honolulu, Hawaii (Aloha!) • Principal clients • USPACOM J2 • USARPAC G6 • Author – Exchange 2003 24Seven (Sybex) • Contributor – Exchange and Outlook Administrator • Blog • http://mostlyexchange.blogspot.com • Free eBook • http://nexus.realtimepublishers.com/ttgsm.htm

  5. This session’s coverage • Introduction to me and the topic • Presentation – About 60 minutes • Book give away – Drop off your business card or write your name on a slip of paper • Questions and answers – 10 - 15 minutes

  6. Audience Assumptions • You have at least a few months experience running Exchange 5.5, 2000, or 2003. • You have worked with Active Directory • You can install and configure a Windows 2000 / 2003 server

  7. Presentations coverage • Defining… • Availability, reliability, fault tolerance • Estimated costs of clustering • Common causes of downtime • Your friend, the SLA • Preventing disasters • Configuration recommendations • Minimizing the effects of downtime • Daily operations • Backup plans • This presentation will be posted to my blog after April 30, 2006 – http://mostlyexchange.blogspot.com

  8. If you take nothing else from this session, take this:Formula for better availability • Get good training and have good reference material • Set yourself up for predictable operations • Monitor your system to ensure it stays within the boundaries you establish

  9. High Availability - 101 • Determine the causes of unplanned downtime • Focus on preventing ‘disasters’ • Predictable daily operations • Catch problems before they affect the users

  10. Myths of high availability • Failure to meet 24x7x365 is a technical problem • More hardware = better availability • Training is not necessary • Existing procedures and processes are good enough • High availability can be bought off the shelf • Can achieved without ‘investment’

  11. In search of 5 nines (99.999%) • The percentage of uptime you have during your scheduled hours of operation • Stated hours of operation 24x7x365? • 99% up time = 3.7 days of downtime • 99.7% up time = 1 day • 99.9% up time = 8.8 hours • 99.99% up time = 52 minutes • 99.999% up time = 5.3 minutes • Hopefully you are not promising 24x7x365!

  12. Availability and Reliability… • Availability… • The percent of time that Exchange is accessible to the user community within the stated schedule of operations • The proportion of time that a system can be used for productive work • Let’s you keep your job • Reliability… • An application or service provides the same results under similar load • Provides consistent, correct results • Let’s you sleep a little better at night

  13. Availability and Reliability… • Don’t sacrifice reliability for availability!!! • Don’t put off service pack application or critical system maintenance to so your availability numbers look good (i.e. replacing a dead disk) • In general, 8 hours of scheduled, off-peak downtime or degraded service is more acceptable to users than 1 hour of unplanned downtime in the middle of the business day.

  14. Fault Tolerance versus High Availability • Fault tolerance • Components that keep an application functioning in the event of a component failure • Disks (RAID 1, 5, 0+1) • Redundant Power Supplies • UPS • High Availability • Does not necessarily guarantee 100% availability, just higher availability • Moving an application to an alternate server

  15. So, what are WE talking about today? • We are going to focus on: • Reliability • Fault tolerance • Preventing ‘disasters’ • Increasing availability through better reliability, fault tolerance, and procedures

  16. What is an Exchange disaster? • Answers vary from organization to organization • Typically loss of data • Loss of messaging services for more than one or two hours during scheduled operations? • Loss of a single mailbox? • Failure of a specific service? • Microsoft measures downtime based on the number of users affected! • 1000 users on a server that is down for 5 minutes would be 5000 minutes of downtime! • That kind of downtime does NOT look good on a resume

  17. Appraise the cost of downtime • User productivity • Missed contractual obligations • Missed sales or customer contact • Loss of customer confidence • Loss of end user good will • Loss of credibility • Loss of your job! 

  18. Clustering 101 • Providers higher availability • Clustering does exactly what it claims to do; it protects your organization against hardware failures. • Clustering gets a bad rap for a number of reasons: • Improper operations • Lofty expectations or assumptions • Allows the passive node to be shutdown or rebooted for maintenance

  19. Non-clustered configuration costs • Possible configuration: • Dell Dual Xeon 2.8GHz • 4GB RAM • 700GB disks • 160/320GB SDLT Tape • Windows 2003 Standard Server • Exchange 2003 Enterprise Edition • 1,500 Exchange CALs • Veritas Backup Exec w/Exchange Agent • Cost = approximately $91,000

  20. Clustered configuration costs • Possible configuration: • 2 Dell Dual Xeon 2.8GHz • 4GB RAM • 700GB disks • 2 copies Windows 2003 Advanced Server • 1 copy Exchange 2003 Enterprise Edition • 1,500 Exchange CALs • Veritas Backup Exec w/Exchange Agent • Veritas SAN Option • Dell rack • Dell fiber-based SAN and SAN connected 160/320GB SDLT Tape Drive • Cost = approximately $190,000

  21. To cluster or not to cluster…. • Price potentially doubles! • Complexity triples! • You must understand Windows / Active Directory / Exchange / Clustering / SANs • Layer 8 problems – The Political layer • Management expectations are higher! • Danger Will Robinson! Danger! • Layer 9 problems - The Bozone layer • Snuffy the Network Admin • Fail-over is NOT instantaneous (at best 2 – 3 minutes) • Still have a single points of failure (the SAN, the network infrastructure)

  22. To cluster or not to cluster… • If you don’t have 99.7% (1 day of downtime) availability right NOW, clustering won’t help. • People and procedures are the highest sources of failures. “High availability starts from within, grasshopper”

  23. Downtime Common Causes: 13 customers and 25 outages • 4 virus outbreaks requiring a shutdown • 4 SAN failures • 4 Shutdowns due to insufficient disk space • 1 Exceeded 16GB limit on Exchange standard • 1 File based A/V software corrupted EDB • 1 Admin applied wrong security template • 1 Operator could not restore database – 5 days! • 1 Database corrupt, 1018 error (device driver) • 1 Database corrupt, operator plugged external SCSI subsystem in while live • 1 Loss of organization’s only global catalog • 1 Loss of organization’s only DNS server • 1 Administrator incorrect configured directory replication – loss of GAL • 1 Server blue screening every few hours (service pack / firmware issue) • 1 Motherboard failure • 1 SCSI controller failure • 1 Power to the campus data center failed

  24. Ooops… • All but 3 of these outages could have been prevented with better procedures, training, and reliability preparedness. • Only 2 of these could have been prevented with clustering. • Many of these were prolonged or made worse due to insufficient training or procedures. • Exchange was not directly to blame

  25. Change and Configuration Control • Never make changes without a process in place: • Document the changes to be made or patches to be applied. • Test the change in your lab • Responsible parties should review / approve • Notify affected parties • Schedule and give notice to the users • Implement • “Process” is going to become omnipresent for IT

  26. Service Level Agreements (SLA) • Many types of SLAs • From vendor to customer • From IT Department to management/users • For an IT, the SLA may provide: • Published hours of operation • Expected system responsiveness • Guidelines for operation and recovery • Sets expectations for the user community • Guideline for planning server hardware and configuration • May provide mechanism for reporting and accountability

  27. SLA: Defining Recovery Time • SLA states that in the event of corruption, it takes 4 hours to get a mailbox store back online • Largest store size is 75GB • DLT tape restores at 10GB per hour • The BEST restore time you can expect for the largest > 8 hours! • It is time to re-think store sizes, backup / restore devices, the distribution of mailboxes, or the SLA! • Estimated recovery time may not accurately estimate transaction log replay, either.

  28. Sample SLAs and information • Intermedia • http://www.intermedia.net/legal/shared_sla • http://www.service-level-agreement.net • http://servicelevelbooks.com • http://www.oakland.edu/uts/helpdesk/docs/emailservicelevel.pdf

  29. An ounce of prevention… • Eliminate single points of failure • Reliable servers / server configuration • UPS capacity - 30 minutes • Exchange configuration • Monitoring • Virus protection • Regular, reliable backups • Documentation

  30. Where are your single points of failure? • DNS • Domain controllers • Global catalog servers • Front-end servers • Storage redundancy • Network infrastructure • Backbone • WAN links • Inbound / outbound SMTP mail

  31. Server Configuration • Environment factors • Potential heat or water damage? • Physically secure • It should be really hard to hit the power button • Flash BIOS updates / firmware / device driver updates • Motherboard, disk controllers, tape devices, SANS • Check with your hardware vendor – The latest is not always the greatest • Use good quality cables for networking, fiber, and SCSI connections • Label and neatly tie-wrap them down! • Caching controllers • Using write caching only if battery backup exists; disable entirely otherwise • Budget for a ‘cold standby’ server with identical hardware

  32. Server Configuration - Disks • SCSI disks provide better performance than IDE! • Disk redundancy • All disks should have redundancy (RAID 1, 5, 0+1) • On database disks, keep the disks less 50% full • Improves restore performance • Provides capacity for unexpected growth • Allows for ESEUTIL repair • Don’t forget enough disk space for RSG • On transaction log disks, plan for at least a week of transaction logs • Never compress Exchange logs or databases!

  33. Server Configuration - Software • Latest service pack, critical fixes, and updates • Device drivers – consult manufacturer • Buggy disk device drivers is common cause for corrupt databases (and controller firmware) • Monitor security fixes • Evaluate each security / critical update to see if it applies to you and how quickly it should be applied.

  34. Server Configuration - Batteries Go Bad! • Consult manufacturer for recommended schedule to replace: • UPS batteries • Caching controller batteries

  35. Server Configuration - Consistency • Organize Exchange servers in to OUs • Use OU policy for • Auditing policy • Event log sizes and overwrite configuration • Security options • Disabled services • Custom registry settings • Information Store MAPI ports • System Attendant DS MAPI ports • W3SVC service dependencies • These can be included in the SCEREGVL.INF file – See KB 214752 • Avoid server-by-server registry changes if possible • Avoid security templates that overly restrict the local security settings or make file system permission changes.

  36. Server Configuration – Gold Build • Get your servers, software, and configuration to a ‘gold build’ • Except for critical updates, don’t change the configuration frequently Change is the enemy of availability, grasshopper!

  37. Exchange Configuration • Necessary to limit Exchange usage to prevent out-of-control or unexpected growth, viruses spreading, as well as system abuse. • Limit: • Message sizes • Recipient limits • Mailbox sizes.

  38. Exchange Configuration – Message Delivery

  39. Exchange Configuration – Mailbox Limits

  40. Exchange Configuration – Misc. • Configure deleted item recovery on all stores • Configure deleted mailbox recovery • Teach help desk how to recover ‘hard deleted’ items – KB 178630 • Direct Exchange databases to RAID 5 or RAID 0+1 volumes • Direct Exchange transaction logs to RAID 1 or RAID 0 + 1 volume • Preferably on separate disk controller from databases) • Do not rely on PSTs as primary mechanism for mail storage. • PST = BAD

  41. Exchange Configuration: Role Segmentation • Dedicate Exchange servers to specific tasks: • Mailbox servers • Public folder servers • Routing group / Internet / X.400 bridgehead • Foreign mail system connectors (MS Mail, Notes) • Wireless, fax, SMS, and pager gateways • Front-end servers • Segmentation can: • Simply complexity of your environment • Minimize impact of a server failure • Reduce recovery times • Often not practical in the ‘age of consolidation’ • If consolidating, consolidate mailbox servers from everything else

  42. We can’t all be clairvoyant .. • …but we can monitor… • Implement some type of monitoring even if you can’t afford NetIQ, OmniAnalyzer, MOM, etc… - You will be glad you did! • Exchange System Manager’s Status and Notifications is free! Recommend monitoring: • Critical services • Disk space • Queue growth • CPU usage

  43. Operational Procedures • Follow standardized and documented procedures • Keep logs of all changes, updates, and problems with Exchange servers • Whenever possible, do not work at the Exchange server console. Do office administration and automation tasks at your desktop! • Never use beta software from any vendor • Never install an e-mail client on the Exchange server. • Perform complete backups before any changes • Do not apply service packs or updates immediately after release • Do not delete user accounts and mailboxes right away. Set account expiration to the day the user left and wait a month or two. • Never set file-based virus scanning software to scan the M:\ drive or any Exchange data or transaction log directories. • If enabled, never use backup software to back up the M:\ drive

  44. I just gotta defrag! • Squash the urge to ‘over administer’ Exchange. • Rarely a reason to perform offline maintenance or offline defrags • Deleted or moved many mailboxes • Users have recently performed a ‘purge’ • If you need to get away from your kids/spouse and come in on weekends, use that time to test your restoration or disaster recovery procedures on a test network.

  45. Daily operations • The Big 5 daily tasks • Perform and verify successful backups • Check available disk space • Update virus signatures / scanning engine • Check the SMTP and X.400 queues • Check the event logs

  46. Events to watch for… • Anything that indicates a problem or error must be investigated. • Nightly successful backups • NTBackup # 8001 – SG backed up • ESE # 213 – SG backed up • ESE # 224 – Log files being purged for SG • Online maintenance (daily) • ESE # 701 – Completed online defrag • MSExchangeIS Mailbox # 1207 – Purged deleted items • MSExchangeIS Mailbox # 9535 – Purged deleted mailboxes • MSExchangeIS # 1221 – White space report • Performance suffers if online maintenance does not complete. • Make sure that online backups do not overlap online maintenance

  47. Weekly or monthly operations • If enabled, purge the BADMAIL directory • Check the log file generation • Purge / archive the protocol logging directories • Archive event logs

  48. Virus protection • Virus protection is mandatory in Exchange environments! • On the Exchange server, use a AVAPI 2.0 / 2.5 enabled virus scanner • Keep the signatures up-to-date – daily! • Client-side antivirus scanning is important, too • Publish a ‘forbidden attachment list’

  49. EXE COM CMD BAT CHM REG SCR VBS VB ASP EML HTM PIF HTML JS SHS WSH WSC Forbidden Attachment List – Minimal

  50. Other Forbidden Attachments • MPG • MPEG • MP3 • AVI • WAV • WMV • And other file types that are large and / or possibly unbusiness-like.

More Related