Role-Based High Availability with Exchange 2007

Role-Based High Availability with Exchange 2007 Jim McBee http://www.ithicos.com

Who is Jim McBee!!?? Consultant, Writer, MCSE, MVP and MCT – Honolulu, Hawaii Principal clients (Dell, Microsoft, SAIC, Servco Pacific) Author – Exchange 2003 Advanced Administration (Sybex) Contributor – Exchange and Outlook Administrator Blog http://mostlyexchange.blogspot.com http://www.directory-update.com

Agenda • High availability versus fault tolerance • Resiliency versus high availability • Server roles • Providing higher availability • Continuous replication technologies

Fault tolerance • Designing and building a server that is resistant to failure • All servers should be fault tolerant • RAID disks • ECC memory • Redundant power supplies • UPS systems • Active Directory and DNS

High availability • Components of your system that allow quicker recovery from a failure • Examples include… • Clustering • Load balanced hosts • Built-in redundancy or load balancing • DNS / application redundancy or load balancing

Resiliency • Solutions that allow for contingency of operations • Recovery in the event of a serious disaster • Not solutions that are invoked when applying a service pack or a quick power outage • Usually not automatic failover • Examples include… • Standby Continuous Replication • Local Continuous Replication

Server roles

Roles configured at installation Simplify installation Optimize the server for the jobs it performs Increase availability through the most efficient and economic means Manage the servers more intuitively

Exchange 2007 Server Roles By defining well-described roles, we can: Remove unnecessary functionality Reduce the attack surface Benefit: optimize server performance Benefit: reduced exposure in the perimeter

Server Roles 1/5 Edge Transport Must be on its own separate physical machine No other roles installed May be workgroup member or joined to an Active Directory domain Uses Active Directory Application Mode (ADAM) for configuration and recipient information Perimeter policy enforcement Message hygiene Anti-spam Transport anti-virus Not Required

Server Roles 2/5 Client Access Server (CAS) Supports Outlook Web Access, Exchange ActiveSync, Outlook Anywhere (formerly RPC/HTTPS), POP3 and IMAP4 protocols, Autodiscover, Availability, and Web services At least one CAS in each Active Directory site and domain where mailbox servers exist Requires good network connection (low latency) to mailbox servers Uses RPC communication to mailbox server

Server Roles 3/5 Hub Transport Handles message delivery and routing (see EX03) Applies policies to incoming and outgoing mail Can handle message hygiene functions Reduces cost and complexity Provides more predictable routing Reduces downtime

Server Roles 4/5 Mailbox Responsible for serving mailbox databases and public folders Mailbox access through MAPI Possible to require MAPI encryption Possible to run without public folders

Server Roles 5/5 Unified Messaging Placed in the protected corporate network Requires that Mailbox and Hub Transport roles exist Check with your phone vendor to see if their phone system will work with UM server May require PBX gateway

Things to Consider Interdependencies Mailbox servers require the Hub Transport role for message delivery – even to the same database The CAS roles provide OWA, ActiveSync, RPC over HTTP, the Availability Service, Autodiscover, and more The Edge role requires a Hub Transport server Fault tolerance Mailbox servers can only talk to Hub Transport servers in the same Active Directory site Mailbox servers will talk to Hubs on the same server before other Hubs in the same Active Directory site For proxy & re-direct scenarios CAS connects to "best" CAS CAS not the same as FE servers

High availability

Focus on Availability and Resilency • Improve data availability and resiliency • Protect mailbox data from failures and corruptions • Reduce time required to restore mailbox data • Provide data redundancy • Service availability • Make mailbox data more available • Make cluster failover less painful • Make cluster management easier • Support for ‘stretch’ or ‘geo-clusters’ • Allow large mailboxes inexpensively

Hub Transport ServerHigh Availability Options Use redundant hardware Automatically load balanced and redundant with multiple Hub Transport servers Inbound SMTP mail Direct delivery to Hub Transport from Internet Direct delivery to Hub Transport from 3rd party SMTP system Load balancing Third party load balancing Windows Network Load Balancing (NLB) Server failure will result in failure of current connections May result in some data loss for any messages in the Hub Transport Server queue database

Client Access ServerHigh Availability Options Redundant hardware Windows NLB or third party load balancing Round robin DNS (not the best solution) Server failure will result in current connections being lost User may need to re-establish connection

Unified Messaging ServerHigh Availability Options Redundant hardware Windows NLB or third party load balancing Round robin DNS PBX or Gateway redundancy Some PBXs may have load balancing options for multiple UM servers Server failure will result in any loss of current connections or call transfers in progress

Mailbox ServerHigh Availability and Resiliency Options Resiliency and recoverability Local continuous replication (LCR) Standby continuous replication (SCR) Requires Exchange 2007 SP1 High availability Cluster continuous replication (CCR) Single copy clusters (SCC) CCR and SCC require dedicated servers No other roles can exist on a clustered node except Mailbox Other roles must be on their own hardware Changes to transaction log files 1MB in size Log file is completely written after 15 minutes Checkpoint depth is still 20MB / Storage Group

Requires Microsoft Cluster Services Benefits Improved Exchange Cluster setup Traditional clustering used today Failovers use the same data copy Exchange Virtual Server = Clustered Mailbox Server 2 to 8 node Active / Passive clusters Q Shared Copy Clusters MB Logs DB

SCC Caveats • Requires expensive hardware with shared storage • Can be complicated for admins to learn • Doesn’t protect from storage/data issues • Let Servers must be on same IP subnet • Data redundancy provided through partners • Hardware must be in the Windows Server Catalog

Additional copy of the logs and database On the same server On a different volume Benefits Easy configuration Single datacenter Doesn’t require expensive hardware Online backups Very quick restoration of service Caveats Adds additional CPU/memory/disk overhead Initial seeding required Manual activation Additional storage requirements One database per storage group Local Continuous Replication

Local Continuous Replication D:\SG1\Logs D:\SG1\Copy\Logs Updated database Enable LCR Advance database by playing logs E00.log E0000000012.log E0000000011.log E0000000012.log E0000000011.log Copy and verify logs

Local Continuous Replication Tips • One database per storage group • Plan for additional hardware resources • Minimum 20% additional CPU overhead • Additional 1GB of RAM • Will more than double IOPS requirements • Maximum database size approximately 2GB • Separate storage into LUNs • Do not break LUNs in to separate partitions • Put each database on a separate LUN • Isolate active and passive LUNs • Use battery backed up storage controllers • Configure caching controllers for 75% write / 25% read • LCR activation is manual • Use Restore-StorageGroupCopy cmdlet • Use backup copy “in place” or move it

Local continuous replication demo

Benefits Potentially no single point of failure Two copies of the data on separate servers No need for shared / SAN storage. Full redundancy with automatic recovery Backup mailboxes without disturbing production Doesn’t require validation for clustered configuration Clustered Continuous Replication Witness FileShare KB 921181 Logs DB DB Logs

CCR Advantages • No single point of failure • Fast recover • Simplified hardware and storage requirements • Simplified deployment • Out-of-the-box replication solution • Can “stretch” the cluster to a second data center • Ability to offload VSS-based backups to passive node • Can integrate with SCR

CCR Caveats Requires Microsoft Cluster Services Majority Node Set cluster Requires a third “voting” node - uses a shared folder Two-node, Active/Passive only Backup: Streaming backup against production storage groups VSS backup against production and replica storage groups Limit of one database per storage group Can be used for PF database if it is the only PF database in the organization Initial database seeding required Servers must be on same IP subnet Transaction logs pulled over SMB shares Some scenarios required log validation, replay Database failure does not cause failover

Standby Continuous Replication • Replication to a standby server Logs DB DB Logs • Coming in Service Pack 1 • Source and target machines can be • Stand-alone • In two different MSCS clusters • On different subnets • Controlled per storage group • Many-to-one and one-to-many supported • Manually activated

LCR versus CCR versus SCR • LCR • Focused towards resiliency • Improve restore time • Administrator has to initiate restore manually • Single data-center solution • Implements log shipping and replay out of the box • Log files are copied locally and replayed • CCR • Targeted towards site resiliency • Automatic failovers • Single or two-data center solution • Supports “stretch” option • Implements log shipping and replay out of the box • Log files are copied to remote server and replayed • Simplifies cluster deployment • No SAN or shared storage • SCR • Provides site and server resiliency • “Cold spare” approach cuts hardware costs • Can be combined with LCR, CCR, and SCC for maximum flexibility

Exchange store runs normally Replication service keeps a copy of the database up-to-date Copies, inspects, and replays log files In CCR, Cluster service provides failover Move network identity (client transparency) LCR activation is manual Restore-StorageGroupCopy task Continuous Replication Basics

A ‘pull’ model Exchange server creates log files normally Log files are copied by Replication service Exxnnnnnnnn.log files copied as they appear Exx.log is copied for handoff/failover If it can’t be copied loss setting (AutoDatabaseMountDial) is consulted Lossless (0 logs lost) GoodAvailability (3 logs lost) BestAvailability (6 logs lost – default setting) Continuous Replication Basics

Continuous Replication SourceDB Store Replication Service Replication Service DBCopy Source Log Directory Inspector Directory Target Log Directory Replication Service

Continuous Replication SourceDB Store LastLogInspected LastLogReplayed Replication Service Replication Service DBCopy Source Log Directory Inspector Directory Target Log Directory LastLogCopyNotified LastLogCopied Replication Service

LastLogCopyNotified Last generation seen in the source directory LastLogCopied Last generation copied to Inspector directory by Replication service LastLogInspected Last generation inspected Moved to log file directory LastLogReplayed Last generation replayed into the database copy Available through Performance Monitor Continuous Replication Monitoring

When the copy has information not in the original it is diverged Divergence may be in database or log files Lossy failover will produce a divergence ‘Split-brain’ on a cluster also causes divergence Even if clients can’t connect, background maintenance still modifies the database Administrator error can cause divergence! e.g. running eseutil /r Divergence

Re-seed will always work Expensive for large databases Look at the common case Lossy failover Only a few log files are lost Built-in solutions Decreased log file size to reduce data loss Lost Log Resilience (LLR) Recovering from Divergence

Feature built into the Hub Transport server role Runs to redeliver mail to CMS’ in its Site Uses the creation time of the last log file copied CCR only in RTM Use Set-TransportConfig to change default settings (setting is organization-wide) Set MaxDumpsterSizePerStorageGroup be to 1.5 times the size of the maximum message that can be sent (default value is 18MB) Recommend MaxDumpsterTime be 7.00:00:00, which is seven days (default value) Transport Dumpster

Backing up the passive moves the performance hit off the active Backup the active or the passive? Remember, they can change designations Passive backup is VSS only Data Protection Manager v2 Active backup can be VSS or streaming ESE Backups from Passive Database

Questions? Thanks for attending!

Book giveaway and e-mail notice • Please give me a piece of paper with your name for drawing • Include your e-mail address or give me a business card if you want: • 20% discount code for Directory Update software • Notification e-mail when Mastering Exchange Server 2007 is available

Role-Based High Availability with Exchange 2007