Real-World Site Resilience Design in Microsoft Exchange Server 2010

EXL327 Real-World Site Resilience Design in Microsoft Exchange Server 2010 Robert Gillies Solution Architect, US Public Sector Microsoft Corporation

Agenda (the most honest agenda ever) • Make a quick joke • Talk about some stuff, hopefully at a technical level that everyone appreciates • Ask if there are any questions (especially ones that I know the answer to…) • Remind attendees to fill out the surveys!! • I really need to beat Scott Schnoll’s numbers, so I’m offering a *free* “Facebook Friendship with Ross Smith IV” to everyone who gives me top marks! • Thank attendees and wish them a great TechEd North America!

What is “Site Resilience”? • From TechNet: “A manual disaster recovery process used to recover from a complete site failure.” • Making sure that your data is in (a minimum of) two sites • Making sure that you can bring the second site online to provide services • Not necessarily only used for a “complete site failure” • Second site is expected to be an “active” site – a “hot” site • Can’t be servers not booted waiting for you to come over and turn them on • We need those servers up and accepting replication • Not automatic, but can be automated • This means that most of the process can be scripted, but since there are some fairly significant changes made (where users connect, how data flows, etc), we expect a human to make a decision to perform the activation

Technology Alone is NOT the Answer • I know you are here for the technology part, but I must make this clear… • Site Resilience, just like High Availability, requires People, Process and Technology • Technology is the EASY part! • What happens if you don’t properly train your people? • What happens if you don’t properly pay your people after you train them? • What happens when your people leave if you haven’t properly documented the processes for SR?

Requirements that Drive SR • My customers seem to all think they need SR for 100% of users • SR does add complexity to the system, and to the management of the system • 200,000 mailboxes, all with SR across 1800 miles / 2900 km • I did this with a customer, but first I tried to talk them out of it • What is the requirement that drives the need for SR? • What is the cost of SR vs the cost of lost functionality/data? • Cost of SR is not just the network and the servers and the other systems (BES, etc.) • Cost of your people and processes as well! • Is SR required for every user? • Could you just replicate the data for VIPs and have everyone else come up with a “dial tone” mailbox? • How many of you think Microsoft has SR for all mailboxes?

SR Technology in Exchange 2010 • SR in Exchange 2010 is not just about the DAG!! • Don’t worry – most of this presentation is about the DAG • CAS planning is more difficult than DAG planning • More on this in the next few slides • Remember all of the Exchange dependencies and 3rd party • Active Directory • DNS • BES servers • Unified Messaging • Integrated FAX solutions • Compliance solutions

Overview of Robert’s Rules’ Exchange 2010

CAS Namespace and Certificate Planning • “The Big Three” • mail.robertsrules.ms • autodiscover.robertsrules.ms • legacy.robertsrules.ms • The other site • maillfh.robertsrules.ms • The “Failback” URL • failbackhsv.robertsrules.ms • failbacklfh.robertsrules.ms

SAN Certificate • SAN = “Subject Alt Name” • Also called a “UC Certificate” by vendors • Allows you to have multiple valid names on a single certificate • We will have the following names on our example certificate • Subject Name • mail.robertsrules.ms • Subject Alt Name • mail.robertsrules.ms • autodiscover.robertsrule.ms • legacy.robertsrules.ms • maillfh.robertsrule.ms • failbackhsv.robertsrules.ms • failback.lfh.robertsrules.ms • This one certificate will go on all Exchange servers and any other SSL endpoints

OA, The “ClientAccessArray”, and DNS • Since we have a single certificate in this case, and OA clients will be accessing a site with the URL of maillfh.robertsrules.ms • Set-OutlookProvider EXPR • –CertPrincipalName “msstd:mail.robertsrules.ms” • This applies to the entire Exchange organization! • The “ClientAccessArray” is the MAPI connection point • Used for the RpcClientAccessServer value on mailbox DBs • Robert’s Rules will use OutlookHSV, OutlookLFH • We’ll have DNS entries for all of these • Internally, the ClientAccessArray names and all certificate names point at the appropriate Load Balancing Cluster VIP • Externally, the names from the certificates point at the appropriate reverse proxy or Load Balancing Cluster VIP

ExternalURLs and FailbackURLs • Remember that we have multiple v-dirs to set these on • OWA – Set-OwaVirtualDirectory • This is also where we set our FailbackURL! • EWS – Set-WebServicesVirtualDirectory • EAS – Set-ActiveSyncVirtualDirectory • Etc. • In HSV • All ExternalURLs set to mail.robertsrules.ms • Don’t change any InternalURLs for OWA or ECP • Set InternalURLs for all other v-dirs to mail.robertsrules.ms • In LFH • All ExternalURLs set to maillfh.robertsrules.ms • Don’t change any InternalURLs for OWA or ECP • Set InternalURLs for all other v-dirs to maillfh.robertsrules.ms

Global Load Balancers / Global Traffic Managers • Scenario: Single Namespace Worldwide for OWA • Global DNS looks at client IP address, directs to closest OWA • “Works around” the idea of a re-authentication on redirect • If you went to Greg Taylor’s SP2 talk, you’ll see that this is planned as a fix in SP2 anyway • That’s right folks, silent redirect for OWA clients! FINALLY!!! • There are also some deployment architectures that would avoid the redirection altogether, but they require at least one more site, more Client Access servers, and more hardware load balancers • Scenario: RPC MAPI Outlook Access • Basically same thing – closest CAS based on client IP address • Recommendation is to connect to the CAS closest to the mailbox • Basically, this is an expensive, complex solution to a somewhat non-issue (that becomes a real non-issue with SP2)

Site Resilience of Data • First two questions to ask: • How many failures do you want to be able to sustain in normal operating conditions and not cause a site *over of some sort? • How many failures do you want to be able to sustain when you have lost a datacenter already? • This could drive the answer to some other questions: • How many servers do you need in each datacenter? • How many HA copies do you need in each datacenter? • What about lagged copies? • How many lagged copies do you need? • If there is a single lagged copy in a given datacenter, do you RAID it? • Where do you put your lagged copies? • What if there is only one? What if you lose that copy? What is that impact?

X Maintaining Quorum - Scenarios X X

X What If We Had “Extra” Voters? X X

Sizing the Network • Exchange 2010 Mailbox Server Role Requirements Calculator • Tell the calculator… • …that you are doing a “Site Resilient Deployment” • …whether you are looking at “Active/Passive” or “Active/Active” • …what your RPO is in hours (RPO = How much data can be lost in site failure situation) • …your network link type (how fast is your network) • …your network latency (round trip latency, in milliseconds) • “Logs Generated / Hour Percentage” • This means “What percentage of all logs generated in 24 hours happened in this one hour period?” • How do you figure out this percentage? • Collectlogs VBS Script • Manually count log files and the hour that they happened • The calculator can determine how many logs will be generated per hour based on the percentages, and whether your replication will meet your RPO goals

Latency Impacts to the WAN • Latency is extremely impactful to network throughput • Maximum supported latency is 500ms round trip between DAG nodes • Higher latency raises the risk that your replication will not be up to date • DAGs with higher latency might require special tuning of DAG, replication and network parameters • For instance, you could add more databases • More databases means more TCP connections, which could give a higher overall throughput than less TCP connections

How Many WAN Connections Do I Need? • We understand that not all customers will have multiple WAN connections between datacenters • We do support a DAG with a single network • We do support a DAG with two networks and the replication network over a single WAN shared with the MAPI network • Recommendation is that you use, at a minimum, VPN technologies or router ACLs to separate your MAPI traffic from your replication traffic • You don’t want cross talk between MAPI network NICs and replication network NICs • This applies no matter your physical network configuration – you don’t want your MAPI NICs able to resolve or access the replication NICs • When you have multiple networks, remember that the replication of your index data goes over the MAPI network

Active/Active vs Active/Passive • What do you mean by “Active/Active vs Active/Passive”? • DAGs? • Datacenters? • What if you have 2 datacenters connected at LAN speeds? • What if you have more than 2 datacenters? • None of this is “right” vs “wrong” – that is the wrong question – you should ask: • What is the impact to my site resilience stance? • Can I do something different and have a higher site resilience stance?

X Two Datacenters, LAN Speed & Latency X X • Can you treat them as a “single datacenter”? • Can you treat this as a single AD site? • Single ClientAccessArray • Single set of URLs • What is the impact of failures? • Do you still need your extra voters, or will they play a role? • What about extra network traffic under normal conditions? • CAS access is “random” across either datacenter • HT selection is “random” across either datacenter • Fully half of your client access and mail delivery will traverse the network between datacenters – can you support this? X

Activating After a Datacenter Failure • DAC Mode – Database Activation Coordination • DAC Mode is designed to prevent “split brain syndrome” where two DAG members in separate datacenters mount the same DB • DACP (DAC Protocol) is a method whereby, when a DAG member is booted… • …the Active Manager comes up with a “mommy may I” bit set to 0 • …the Active Manager cannot mount a database until that bit is a 1 • …the Active Manager cannot set that bit to 1 until it contacts EVERY OTHER NODE of the DAG (actually it tries to contact every other node that is not marked as stopped, and it tries until it finds a node where the DACP bit is set to 1 or until every other non-stopped server as been contacted) • DAC Mode Activation consists of marking the “down” servers as “stopped” so that the DAG nodes can communicate with every other node that is not stopped • With SP1, even 2-node DAGs support DAC • If you have a DAG with nodes in more than 1 datacenter, then that DAG should be in DAC mode (SP1 only)

Why Is DAC Mode Always Recommended? Non-DAC Mode Activation DAC Mode Activation Mark the nodes from the failed datacenter as stopped Stop-DatabaseAvailabilityGroup –ActiveDirectorySite <FailedSiteName> -ConfigurationOnly Stop the cluster service on all remaining DAG nodes net stop clussvc Activate remaining DAG nodes Restore-DatabaseAvailabilityGroup –ActiveDirectorySite <RemainingSiteName> Activate the databases Might be automatic, might need to remove activation blocks, etc. • Force eviction of down nodes from cluster • net stop clussvc • cluster <dagname> node <dagmembername> /forcecleanup • Stop the cluster service on all DAG nodes • net stop clussvc • On one DAG node, force quorum • net start clussvc /forcequorum • Modify quorum type • cluster <DAGName> /quorum /nodemajority • Set-DatabaseAvailabilityGroup <DAGName> -WitnessServer <ServerName> • Start the cluster service on remaining nodes • net start clussvc • Perform database switchovers • Move-ActiveMailboxDatabase –Server <DAGMemberInFailedSite> -ActiveOnServer <DAGMemberInSecondSite> • Mount the mailbox databases • Get-MailboxDatabase <DAGMemberInSecondSite> | Mount-Database

Recovery Time Objective • Recovery Time Objective – How long it takes to get the services operational, and available to the user • Can you even start this clock until someone makes the decision that the datacenter is down? • I’m a big fan of the Montgomery Scott method of engineering • I tell them it will take 12 hours, and then when I get it done in an hour, I’m a hero! • What is really reasonable? • You have to activate the DAGs and get the databases mounted • You have to change the DNS records for the URLs (for Internet protocol access) and the ClientAccessArray (for MAPI access) • Set your TTL to a low value such as 5 minutes • Remember that IE has a 20 minute cache – that’s what the FailbackURL on the OWA virtual directory is for! • You can script some of the process, but you have to watch for errors and understand what the error messages mean so you can take action

Testing Site Resilience • Site Resilience is not “easy” • You must have the process documented • You must have people that understand how the DAG activation works and what to do when errors are encountered • You must practice this • Lab practice is good and can be done much more often than production without impacting users • Real world practice is a necessity • What if router ACLs in the secondary datacenter were changed for some reason? • What if the file server you were using for the Alternate FSW share has been retired? • What if a certificate in the secondary datacenter goes out of date? (NOTE: I had this recently!) • Maybe a schedule like: • Monthly failover exercises for each staff member in the test lab • Quarterly failover exercises in production with 25% of the staff involved in the activation of services each time – this guarantees that each staff member does this once per year • After-action meeting every single time – what went wrong and does it need to be documented or fixed in the process itself? • Real-world activation tests during a patch process? • Activate LFH, patch HSV, activate HSV, patch LFH, redistribute databases to normal operating designs

Testing While in Production • Assume we have designed for HSV to be the “primary” datacenter and LFH the “secondary” (only for failure scenarios) • But we take a small subset of users – say 10% of our databases – and run them out of LFH • Change the RpcClientAccessServer on those databases to OutlookLFH where all others are set to OutlookHSV • Don’t necessarily have to change the activation preferences, but you could – this would be an operations decision • Note that if you do want to change the activation preference to 1, when you do the Set-MailboxDatabaseCopy, the RpcClientAccessServer will be automatically set to OutlookLFH (SP1 only!)

Questions?

Resources • Connect. Share. Discuss. http://northamerica.msteched.com Learning • Sessions On-Demand & Community • Microsoft Certification & Training Resources www.microsoft.com/teched www.microsoft.com/learning • Resources for IT Professionals • Resources for Developers • http://microsoft.com/technet • http://microsoft.com/msdn

Complete an evaluation on CommNet and enter to win!

© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Real-World Site Resilience Design in Microsoft Exchange Server 2010