1 / 48

Microsoft Exchange Server 2010 SP1: High Availability Deep Dive

Microsoft Exchange Server 2010 SP1: High Availability Deep Dive. Scott Schnoll scott.schnoll@microsoft.com Principal Technical Writer Microsoft Corporation. Agenda. Deep Dive on Exchange 2010 High Availability High Availability Improvements in Service Pack 1.

lora
Download Presentation

Microsoft Exchange Server 2010 SP1: High Availability Deep Dive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microsoft Exchange Server 2010 SP1:High Availability Deep Dive Scott Schnoll scott.schnoll@microsoft.com Principal Technical Writer Microsoft Corporation

  2. Agenda • Deep Dive on Exchange 2010 High Availability • High Availability Improvements in Service Pack 1

  3. Database Availability Group (DAG) RPC Client Access and Address Book services Active Manager Active Manager Active Manager DB1 DB1 DB1 DB2 DB2 DB2 DB3 DB3 DB3

  4. Deep Dive on Exchange 2010 High Availability QuorumWitnessDAG Networks Active Manager Best Copy Selection

  5. Quorum

  6. Quorum • Quorum is used by systems that use Windows Failover Clustering to ensure that only one subset of members is functioning at any given time • Dual Usage • Data shared between the voters representing configuration, etc. • Represents a shared view of members (voters and some resources) • Number of voters required for the solution to stay running (majority); quorum is a consensus of voters • When a majority of voters can communicate with each other, the cluster has quorum • When a majority of voters cannot communicate with each other, the cluster does not have quorum

  7. Quorum • Quorum is not only necessary for cluster functions, but it is also necessary for DAG functions • Exchange 2010 uses only two of the four available cluster quorum models • Node Majority (DAGs with odd number of members) • Node and File Share Majority (DAGs with even number of members) • Quorum = (N/2) + 1 (whole numbers only) • 6 members: (6/2) + 1 = 4 votes for quorum (can lose 3 votes) • 9 members: (9/2) + 1 = 5 votes for quorum (can lose 4 votes) • 13 members: (13/2) + 1 = 7 votes for quorum (can lose 6 votes) • 15 members: (15/2) + 1 = 8 votes for quorum (can lose 7 votes)

  8. Witness and Witness Server

  9. Witness • A witness is a file share on a server (witness server) • It participates in quorum by providing a weighted vote for the DAG member that has a lock on the witness.log file • Used only by DAGs that have an even number of members (Node and File Share Majority quorum model) • Member that locks the witness.log and retains the weighted vote is referred to as the locking node • DAG members in contact with locking node are in majority and maintain quorum • Witness server does not maintain a full copy of quorum data and is not a member of the DAG or cluster

  10. Witness • Represented by File Share Witness resource • File share witness cluster resource, directory, and share automatically created and removed as needed • Uses Cluster IsAlive check for availability • If witness server is not available, cluster core resources are failed and moved to another DAG member • If other DAG member does not bring witness resource online, the resource will remain in a Failed state, with restart attempts every 60 minutes • See http://support.microsoft.com/kb/978790 for details

  11. Witness • When needed for quorum, behavior depends on resource state • If in a Failed state and needed for quorum, cluster will try to online File Share Witness resource once • If witness cannot be restarted, it is left failed and quorum is lost • If witness can be restarted, it is brought online and quorum is maintained • If in an Offline state and needed for quorum, cluster will not try to restart – quorum lost • When witness is no longer needed to maintain quorum, lock on witness.log is released

  12. Witness Server • No pre-configuration typically necessary • Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010 • Must be in the same Active Directory forest as DAG • Can be Windows Server 2003 or later • File and Printer Sharing for Microsoft Networks must be enabled • Replicating witness directory/share with DFS not supported • Not necessary to cluster Witness Server • If you do cluster witness server, you must use Windows 2008 • Single witness server can be used for multiple DAGs • Each DAG requires its own unique directory/share

  13. DAG Networks

  14. DAG Networks • A DAG network is a collection of subnets • There are two types of DAG networks • MAPI Network - connects DAG members to network resources (Active Directory, other Exchange servers, DNS, etc.) • Registered in DNS / DNS configured • Uses default gateway • Client for Microsoft Networks/File and Print Sharing enabled • Replication Network - used for/by continuous replication only (log shipping and seeding) • Not registered in DNS / DNS not configured • Typically no default gateway • Client for Microsoft Networks/File and Print Sharing disabled

  15. DAG Networks • All DAGs must have: • Exactly one MAPI network • Zero or more Replication networks • Separate network(s) on separate subnet(s) • LRU determines which replication network is used with multiple replication • DAG networks automatically created when Mailbox server is added to DAG • Initially-created DAG networks based on cluster’s enumeration of networks • Cluster enumeration based on subnet • One cluster network is created for each subnet

  16. DAG Networks

  17. DAG Networks

  18. DAG Networks • To collapse subnets into two DAG networks and disable replication for the MAPI network: Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 -Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$falseSet-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 -Subnets 10.0.0.0,10.0.1.0Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04

  19. DAG Networks • To collapse subnets into two DAG networks and disable replication for the MAPI network: Set-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork01 -Subnets 192.168.0.0,192.168.1.0 -ReplicationEnabled:$falseSet-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork02 -Subnets 10.0.0.0,10.0.1.0Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork03Remove-DatabaseAvailabilityGroupNetwork -Identity DAG2\DAGNetwork04

  20. DAG Networks • Automatic network detection occurs only when members added to DAG • If networks are added after member is added, you must perform discovery • Set-DatabaseAvailabilityGroup -DiscoverNetworks • DAG network configuration persisted in cluster registry • HKLM\Cluster\Exchange\DAG Network • DAG networks include built-in encryption and compression • Encryption: Kerberos SSP EncryptMessage/DecryptMessage APIs • Compression: Microsoft XPRESS, based on LZ77 algorithm • DAGs use a single TCP port for replication and seeding • Default is TCP port 64327 • If you change the port, you must manually change Windows firewall rules

  21. Active Manager

  22. Active Manager • Exchange component that manages *overs • Runs on every server in the DAG • Selects best available copy on failovers • Is the definitive source of information on where a database is active • Stores this information in cluster database • Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport)

  23. Active Manager • Active Manager roles • Standalone Active Manager • Primary Active Manager (PAM) • Standby Active Manager (SAM) • Active Manager client runs on CAS and Hub

  24. Active Manager • Primary Active Manager (PAM) • Runs on the node that owns the cluster core resources (cluster group) • Gets topology change notifications • Reacts to server failures • Selects the best database copy on *overs • Detects failures of local Information Store and local databases

  25. Active Manager • Standby Active Manager (SAM) • Runs on every other node in the DAG • Detects failures of local Information Store and local databases • Reacts to failures by asking PAM to initiate a failover • Responds to queries from CAS/Hub about which server hosts the active copy • Both roles are necessary for automatic recovery • If the Microsoft Exchange Replication service is stopped, automatic recovery will not happen

  26. Best Copy Selection

  27. Best Copy Selection • Process of finding the best copy to activate for an individual database given a list of status results of potential copies for activation • Active Manager selects the “best” copy to become the new active copy when the existing active copy fails • Any servers that are unreachable or “activation blocked” are ignored • Behavior difference between RTM and SP1 • List of potential passive copies is sorted differently when AutoDatabaseMountDial is set to Lossless

  28. Best Copy Selection – RTM • Sorts potential passive copies by copy queue length to minimize data loss, using activation preference as a secondary sorting key if necessary • Selects from the sorted listed based on which set of criteria met by each copy • Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy

  29. Best Copy Selection – SP1 • Sorts potential passive copies by activation preference when auto database mount dial is set to Lossless • Otherwise, sorts copies based on copy queue length, with activation preference used a secondary sorting key if necessary • Selects from the sorted listed based on which set of criteria met by each copy • Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy

  30. Best Copy Selection

  31. Best Copy Selection – RTM • Four copies of DB1 • DB1 currently active on Server1 Server1 Server2 Server3 Server4 X DB1 DB1 DB1 DB1

  32. Best Copy Selection – RTM • Sort list of available copies based by Copy Queue Length (using AP as secondary sort key if necessary): • Server3\DB1 • Server2\DB1 • Server4\DB1

  33. Best Copy Selection – RTM • Only two copies meet first set of criteria for activation (CQL< 10; RQL< 50; CI=Healthy): • Server3\DB1 • Server2\DB1 • Server4\DB1 Lowest copy queue length – tried first

  34. Best Copy Selection – SP1 • Four copies of DB1 • DB1 currently active on Server1 • Auto database mountdial set to Lossless Server1 Server2 Server3 Server4 X DB1 DB1 DB1 DB1

  35. Best Copy Selection – SP1 • Sort list of available copies based by Activation Preference: • Server2\DB1 • Server3\DB1 • Server4\DB1 Lowest preference value – tried first

  36. Best Copy Selection (RTM and SP1) • After Active Manager determines the best copy to activate • The Replication service on the target server attempts to copy missing log files from the source (ACLL) • If successful, then the database will mount with zero data loss • If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting • If data loss is outside of dial setting, next copy will be tried

  37. Best Copy Selection (RTM and SP1) • After Active Manager determines the best copy to activate • The mounted database will generate new log files (using the same log generation sequence) • Transport Dumpster requests will be initiated for the mounted database to recover lost messages • When original server or database recovers, it will run through divergence detection and either perform an incremental resync or require a full reseed

  38. Improvements in Service Pack 1 Replication and Copy Management enhancements in SP1

  39. Improvements in Service Pack 1 • Continuous replication changes • Enhanced to reduce data loss • Eliminates log drive as single point of failure • Automatically switches between modes: • File mode (original, log file shipping) • Block mode (enhanced log block shipping) • Switching process: • Initial mode is file mode • Block mode triggered when target needs Exx.log file • All healthy passives processed in parallel • File mode triggered when block mode falls too far behind (e.g., copy queue length > 0)

  40. Continuous Replication Send me the latest log files … I have log 2 ESE Log Buffer Replication Log Buffer • Database copy up to date • Log is built and inspected • Log fragment detected and converted to complete log Exx.log Log File 3 Log File 1 Log File 2 Log File 1 Log File 4 Log File 3 Log File 5 Log File 4 Log File 2 Log File 6 Log File 6 Log File 7 Log File 5 Continuous Replication – File Mode Continuous Replication – Block Mode

  41. Improvements in Service Pack 1 • SP1 introduces RedistributeActiveDatabases.ps1 to keep database copies balanced across DAG members • Moves databases to the most preferred copy • If cross-site, tries to balance between sites • Target-less switchover altered for stronger activation preference affinity • First pass of best copy selection sorted by activation preference; not copy queue length • This basically trades off even distribution of copies for a longer activation time. You might pick a copy with more logs to play, but it will provide you with better distribution of databases

  42. Improvements in Service Pack 1 • *over Performance Improvements • In RTM, a *over immediately terminated replay on copy that was becoming active, and mount operation did necessary log recovery • In SP1, a *over drives database to clean shutdown by playing all logs on passive copy, and no recovery required on new active

  43. Improvements in Service Pack 1 • DAG Maintenance Scripts • StartDAGServerMaintenance.ps1 • It runs Suspend-MailboxDatabaseCopy for each database copy hosted on the DAG member • It pauses the node in the cluster, which prevents it from being and becoming the PAM • It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Blocked • It moves all active databases currently hosted on the DAG member to other DAG members • If the DAG member currently owns the default cluster group, it moves the default cluster group (and therefore the PAM role) to another DAG member

  44. Improvements in Service Pack 1 • DAG Maintenance Scripts • StopDAGServerMaintenance.ps1 • It run Resume-MailboxDatabaseCopy for each database copy hosted on the DAG member • It resumes the node in the cluster, which it enables full cluster functionality for the DAG member • It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Unrestricted

  45. Improvements in Service Pack 1 • Exchange Management Console enhancements in SP1 • Manage DAG IP addresses • Manage witness server/directory and alternate witness server/directory

  46. Session Evaluations Tell us what you think, and you could win! All evaluations submitted are automatically entered into a daily prize draw*  Sign-in to the Schedule Builder at http://europe.msteched.com/topic/list/ * Details of prize draw rules can be obtained from the Information Desk.

  47. © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related