A paper by Yi-Bing Lin IEEE Transactions on Mobile Computing Vol. 4, No. 2, March/April ’05

Per-User Checkpointing For Mobility Database Failure Restoration • A paper by Yi-Bing Lin • IEEE Transactions on Mobile Computing • Vol. 4, No. 2, March/April ’05 • Presented by Derek Pennington

In GPRS & UMTS networks, the Home Location Register (HLR) maintains the central database of user information. For a given user, an HLR record might contain information such as… • Mobile Station (MS) Information • telephone number • International Mobile Subscriber Identity • Service Information • subscription info • service restrictions • supplementary services • Location Information • address of Serving GPRS Support Node (SGSN)

But what happens in the event of an HLR failure? • Luckily, we periodically backup all of this user data (each backup is called a “checkpoint”). • However, the paper’s author argues that the established backup practices have room for improvement.

Various approaches to checkpointing: • All-record checkpoint • backup all users at once (eg: midnight) • costly (bottleneck effect) • Per-user checkpoint • each user has its own timing mechanism for backups • The paper’s author discusses the existing per-user checkpointing algorithm (henceforth referred to as “Algorithm 1”), and then proposes a new, improved one (“Algorithm 2”)

Order of Presentation Coverage • Introduction • Algorithm 1 vs. Algorithm 2 • Modeling the Algorithms (the math part) • Performance Evaluation of Algorithms 1 & 2 • Conclusions / Comments

Algorithm 1 • checkpoints happen at random intervals (tc) • a checkpoint may occur whether or not any user registrations have taken place • In the event of an HLR failure, if the user updated the HLR database, but that update didn’t get backed up, the record becomes obsolete • When the user’s record is obsolete, the user will lose calls until he performs a registration with the HLR.

Algorithm 2 • checkpoint timers are scheduled for random intervals (tc) • However, checkpoints will only take place when BOTH of the following are true: • tp timer expires • a registration has taken place • Like Algorithm 1, user will lose calls if his record(s) is/are obsolete a checkpoint occurs whenever we return to State 0

RECAP # Scenario For each scenario, will the user’s record(s) be valid after the HLR recovers from its failure? Algorithm 1 Algorithm 2 1 YES YES 2 YES YES 3 NO YES 4 NO NO CP timer fires = registration = failure = LEGEND

Two metrics are used to measure checkpoint algorithm performance: • E[tc]: the expected checkpoint interval • the larger the interval, the less frequent checkpoints will occur • essentially, checkpoint cost is proportional to checkpoint frequency • : the probability that the user’s HLR record is obsolete after an HLR failure/recovery • the smaller “” is, the better the checkpoint algorithm’s performance

Setting the checkpoint timer tp: • typical approaches have a fixed tp • However, this can lead to congestion with large numbers of users • Thus, in Algorithms 1 & 2, tp is a random variable with exponential distribution • Density function: • …and, in Algorithm 1, since tc = tp from checkpoint to checkpoint, the expected checkpoint interval is: checkpoints per unit time time between checkpoints

Finding  for Algorithm 1: tm – m m tp – p p = “residual time” of tm = “reverse residual time” of tm = residual time of tp = reverse residual time of tp

Finding  for Algorithm 1 (cont’d): Consider random variable t: • probability density function: f(t) • probability distribution function: • expected value: E[t] • Laplace transform: Let  be the residual time of t: • probability density function: • probability distribution function: • Laplace transform:

Finding  for Algorithm 1 (cont’d): • Also, the density function is the same: • In Algorithm 1, we say the backup record is obsolete if, at the moment of HLR failure, the time since the last checkpoint is greater than the time since the last registration • In other words: integrals of the two density functions from r*(s) defined earlier

Finding E[t] and  for Algorithm 2: • One difference between Algorithm 2 and Algorithm 1 is that the checkpoint timer will be reset based on how the previous checkpoint took place • If the previous checkpoint happened due to a timeout event, then the next checkpoint interval is: • If the previous checkpoint happened due to a registration event, then the next checkpoint interval is: • Thus, in our state machine example, we actually have two “State 0”s…

Probability that a registration will occur after a registration-caused checkpoint Probability that a timeout will occur after a registration-caused checkpoint Checkpoint occurring due to registration Probability that a timeout will occur after a timeout-caused checkpoint Checkpoint occurring due to timeout event Probability that a registration will occur after a timeout-caused checkpoint

The random variable tc is now essentially a combination of the probability that the last checkpoint happened due to a timeout and the probability that the last checkpoint happened due to a registration: • …where: (x is the probability of being in State “x”) and

Therefore, we can say: registration-caused checkpoint timeout-caused checkpoint remember that comes from

Based on the figure, we can deduce some limiting probabilities: • …which means we know more about p1 and p2: and

From… • …the density function for tc is: • …where: • …thus:

The relationships between tp, tm, and allow us to reinterpret fc(tc) into two different pieces: • fc1(tc): the situation where tp > tm • fc2(tc): the situation where tp < tm • We can reexpress fc(tc) as: • …where:

Expected checkpoint interval for Algorithm 2: integral of the density function plug-in p1 and p2 What are A1 and A2?......

So we can also express the expected checkpoint interval for Algorithm 2 as: plug-in A1 and A2

Now we need to find II… • To find the probability of getting an obsolete record, there is no close-form expression when arbitrary fm(tm) is used • The paper uses a mix-Erlang density function • proven as a good approximation to other functions as well as measured data • …and, as a comparison, the regular Erlang density function:

Now we need to find II (cont’d) • Continuing, we have the Erlang distribution function: • …and the Laplace transform expressed as:

Now we need to find II (cont’d) • The reverse residual time m of tm has: • density function: • distribution function: • Laplace transform: • And, since E[tm]=n/, we can say the following:

Now we need to find II (cont’d) • For Algorithm 2, consider the two scenarios: • A registration happens before the timeout • Checkpoint happens at the time of the timeout • A registration does NOT happen before the timeout • In this case, we wait until the next registration to checkpoint • To derive II, we only need to consider the first case… • …where:

Now we need to find II (cont’d) • Then, the density function for the reverse residual time corresponding to g(i,,t) is: • If we say that c is the reverse residual time of tc, then the density function for c is:

Now we need to find II (cont’d) • Finally, we can derive II: • …where……… (next slide)

Now we need to find II (cont’d) • …where:

Algorithm 2: Checkpoint Freq. vs. Registration Freq. • as registration frequency increases, so does checkpoint frequency • this is what we’d expect Algorithm 2’s Checkpoint Cost Improvement over Algorithm 1 • According to the graph, as registrations increase, Algorithm 2 further improves over Algorithm 1. • ??? This is not what I would expect • p. 189: “If registration activities are very frequent, then Alg. 2 behaves exactly the same as Alg. 1”

Algorithm 2: Probability of Obsolete Records After HLR Failure • X axis is 1/2, where 1 represents intervals with few registrations and 2 represents intervals with many registrations • Thus, as we move right, registrations become less frequent • Less registrations means less chance of obsolete records, thus, this cost decreases as we move right Algorithm 2’s Obsolete Record Cost Improvement over Algorithm 1 • Shows that Alg. 2 has a 20-55% improvement over Alg. 1 • This makes sense, because Alg. 2 will checkpoint when a registration occurs if the checkpoint timer has expired… whereas Alg. 1 will have obsolete records in those situations.

Conclusion • Per the analytical results, Algorithm 2 improves upon Algorithm 1 in the following ways: • 50+% savings in checkpoint cost (E[tc]) • 20-55% improvement in terms of reducing occurrences of obsolete records () • Note that this paper does NOT discuss SGSN / VLR failure and/or recovery • all SGSN-based mobile user records are temporary and not backed-up • other papers discuss SGSN failure restoration (see paper’s references)

Comments • This paper was heavy on the math, light on the explanations from step to step • Granted, maybe IEEE gave the author a requirement to fit within 6 pages of their magazine? 2x or 3x as long would make it much easier to follow. • Derek’s recommended prerequisites: • know the difference between probability density functions and probability distribution functions • know what a Laplace transform is • refresh your memory on integrals and derivations • If, in fact, simulations were performed, include the details! He apparently omitted them on purpose. Maybe they’re included in his dissertation, thesis, etc…?

Thanks! Any questions?

A paper by Yi-Bing Lin IEEE Transactions on Mobile Computing Vol. 4, No. 2, March/April ’05