Designing For Load

Designing For Load by Peter Hamlen Vice-President of Systems Mail.com

What does Mail.com do? • We provide “vanity email addresses”, like peter@mail.com, peter@mad.scientist.com, etc. • We provide a web-based email client, allowing you to read your mail on the web. • We also host mail services for corporations.

The Growth of Mail.com 3 years ago: • 100,000 signups, 5000 daily logins, 50,000 messages a day. Today: • 12 million signups, 1 million daily logins, 4 million messages a day.

What am I going to talk about? • The design of our system to store our users’ mail (known as the “mailstore”) • The choices we made. • The mistakes we made. • The evolution of our mailstore design to handle high-capacity.

Why talk about load? • Load directly affects our speed and reliability, the two most important features a website can have. • Fact: Load causes 95% of the failures we encounter.* • There are several tactics that allow you to dramatically reduce the effect load has on your system. * 83.7% of all statistics are made up. - Stephen Wright

Load decisions can be tricky • It’s not always clear, especially in the low-traffic stages, how to design the system correctly for load.. • Intuition is rarely accurate, unless you’re a genius.

Mail.com practical examples The point of these examples is the following: • I’m not qualified to make generalizations on the “right way” to design for load - we’re still learning... • But I can show you the mistakes that we made - and hopefully you can avoid them. • Some of the problems we’ve solved - on the others, the jury is still out.

Group Exercise #1 For our webserver, we have a parameter MAX_CONN, which specifies how many open HTTP connections we can have at any given time. We’re in the process of tuning our webservers to give the best user experience. Is it better to have a high MAX_CONN (and accept many connections) or a low MAX_CONN (and refuse some connections)? Why?

What really happened: • We set our MAX_CONN very high, so that we would never refuse a connection. After all, we wanted to service all our requests. • At one point, we had 10 webservers in production. Average response time was under 4 seconds. • We removed one webserver for maintenance. Response time increased to 6 seconds. • We had to remove another webserver, bringing us to 8. Response time increased to 65 seconds!!!!

Our answer: • The MAX_CONN example illustrates what I call “the death spiral”. • If your system becomes heavily loaded and you build up queues, the system will “churn” as it’s handling the queues. • Thus, under heavy load, the system throughput drops! • This causes more requests to queue, which drops the throughput more. • Very quickly, this will bring your system to its knees.

Tactic: Build a cut-off • The simplest solution to avoiding the “death spiral” is to always build a cut-off into your system. • In our case, we have a threshold of open connections (about 170) above which our response times go non-linear. • The solution to this problem is to build a “cut-off” - once we have 150 open connections, every incoming response gets a simple “System too busy” reply. That reply is extremely fast, and therefore we don’t get into the non-linear range.

Mailstore examples The next few examples focus mainly on our mailstore design, and problems we’ve had dealing it. Fundamentally, the storage of mail is the key to all our activities. Hence, much of our time has been spent tuning our mailstore system and redesigning it when we get it wrong. Let’s look at our first design:

Mailstore, Design #1 • We keep the message headers, folder information, etc. in Oracle. • We store the message body in NFS (one directory per user, one file per message). • We write a set of C++ classes to represent a Mailbox, a Folder, a Message. • The classes use SQL to access Oracle • The classes read from the file system to get the message body.

Group Exercise, #2 Discuss this with your neighbor. We’ll take a vote in a couple of minutes. Is it good or bad to store the headers in Oracle and the message bodies in NFS?

What really happened: • Incident: We run out of NFS space much earlier than expected. Messages cannot be saved. • Incident: Oracle runs out of space earlier than expected. Messages cannot be saved. • Incident: NFS volume crashes due to high load. (We have to tune the system.) • Incident: Oracle’s internal indexes become too large to rebuild within 4 hours.

Our answer • We think it’s wrong to store messages in Oracle and NFS. • It adds too much operational complexity - that is, you have to keep NFS and Oracle in sync, you have to manage the capacity of each system, you have to monitor each system. • Backups are exceptionally tricky because you have to coordinate your restores. • You are vulnerable to TWO load bottlenecks - your database AND your NFS both have to be fast enough to handle the load.

Tactic: Operational Simplicity • We’ve found that the systems that scale the best are the ones that are operationally simplest. • The fewer components you have, the fewer chances for a performance bottleneck. • The simpler it is, the easier it is to measure capacity. • The simpler it is to monitor and maintain, the more time you can spend on scaling. • This does not work in all cases.

Group Exercise, #3 Let’s go back to that first incident with the Mailstore: We ran out of room much sooner than expected. It turns out that I accidentally left in debugging code that wrote an extra copy of every message in the user’s directory. Thus, we used twice as much disk space as we needed. Is this a serious mistake or not? Do you fire me?

Our answer It’s actually a good thing because of two reasons: • We got rid of the extra messages by running a “find” command - the naming convention was such that we could find all of them. • In a company that didn’t have good planning processes, we were suddenly aware that we needed to buy more space. We ordered more space immediately. We also started tracking how much space we were using..

Tactic: Monitor your capacity The real mistake we made was that we failed to monitor our capacity, particularly our disk space capacity. People forget to measure capacity for two reasons: • It takes time and effort to measure capacity, and people are often rushed for time. • Measuring capacity only works if you do it before you have a problem. So many companies don’t realize the need until they are bitten by it once.

What about Oracle capacity? Another problem we encountered was runnning out capacity on our Oracle database. Because we were storing our messages in the database, the database was running extremely hot. We upgraded our Oracle server from 8 processors to 12 processors to 28 processors (the maximum we could place in the box). How do we expand beyond 28 processors?

Our solution • At the time, we were running 30 websites. We split the largest websites off, and placed them on their own Oracle server. • We had a difficult time splitting the NFS information. (Remember, bodies in NFS, headers in Oracle. We had to go through user by user, and move their NFS data to a new volume.) • Load immediately was cut in half on each box.

Why that solution failed... That solution had some serious drawbacks, all of them because of one problem. • We didn’t have good failure handling if the database server went down. (If the server crashed, our website crashed.) • It was just as unacceptable to have half our sites down as it was to have all of them down. • Therefore, we had to keep two machines constantly running, constantly under capacity, constantly monitored, constantly backed up. • Splitting the database nearly doubled our work in keeping the site up.

Tactic: Never place load on a single point of failure. • Avoid single points of failure whenever possible - but my experience says that people create them all the time. • Settle for never putting load on a single point of failure. • A corollary: Always make sure your loaded systems aren’t single points of failure. • As soon as you realize that you’ve made this mistake, start planning a redesign.

Group Exercise, #4 We’re up against another scaling issue. We’ve split the databases, and the website is running fine. But we’re discovering that mail delivery is slow, because inserting into Oracle is a bottleneck. This is causing mail queues. We can avoid splitting the databases again by delivering mail to a flat file in the user’s directory and inserting the messages into Oracle upon login. Is this a good idea?

Answer This is one of the trickiest questions we’ve got. It requires a lot of understanding of our mail and user traffic. Here are some of our thoughts: • We want our website to be as fast as possible. If inserting mail is slow, don’t do it in a cgi - the user will notice the slowdown. • But mail is delivered constantly - it’s high-load. If we can’t deliver mail, it will queue - and the queues will be large. Large queues generally cause death spirals!

Answer, continued... We decided to have a system where mail is not delivered into Oracle until you login. We store it in a flat file (mbox format, actually) until you login to the website. The biggest influence on our decision was our “inactive users”. We have a lot of users who sign up just to receive mailing lists. They then stop logging in when they get tired of the mailing list - but don’t stop the mail. By delivering to a flat file, we actually reduce the load on Oracle dramatically because we only deliver mail that will be read.

So where are we today? • We have a new design for our mailstore, obviously.. • It’s supposed to be a secret.… but it’s principles are based on much of what we’ve talked about.

Mailstore, V2 • The new design is based entirely on NFS, not Oracle (adhering to Operational Simplicity) • There are multiple mailstores (avoiding a single point of failure.) • The file format is basically a flat-file per user (again, Operational Simplicity). • We access the Mailstore via a TCP/IP socket. This allows the Mailstore to implement a cutoff (like our webserver which could say “server too busy”). It also allows our mailstore to be on separate machines from our webservers.

Review There are several tactics that we use to avoid being killed by load issues. They are: • Always build in “cut-offs” • Always keep it operationally simple. • Always monitor your capacity. • Never place load on a single point of failure.

Questions • Fire away!

Designing For Load

Designing For Load

Presentation Transcript

Designing a Predictable Backbone Network with Valiant Load Balancing

DESIGNING FOR

LOAD

LOAD

LOAD

LOAD

Need for Load Mitigation

LOAD

LOAD

LOAD

Expression for load

LOAD

LOAD

LOAD

Load Models for Bridges

DESIGNING FOR ANIMALS VERSUS DESIGNING FOR PEOPLE

Load Models for Bridges