The Waiting Time Paradox and biases in infectious disease observational data

1 / 28

# The Waiting Time Paradox and biases in infectious disease observational data - PowerPoint PPT Presentation

The Waiting Time Paradox and biases in infectious disease observational data. Ping Yan Lecture at Summer School on Mathematics of Infectious Diseases Program Centre for Disease Modelling, York University. Data in infectious disease studies are often observational , not following

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'The Waiting Time Paradox and biases in infectious disease observational data' - cassidy-sweet

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The Waiting Time Paradox and biases in infectious disease observational data

Ping Yan

Lecture at Summer School on Mathematics of Infectious Diseases Program

Centre for Disease Modelling, York University

• hypothesis design of experiment repetition and randomization

Outline

• Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process).
• The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process.
• Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods.
• Buses arrive at a constant rate ; the inter-arrival times X’s areindependently and identically distributed, mean

Argument 1:

The inspection time t is uniformly distributed between two buses, for symmetry

Question: what is the expected waiting time to the next bus arrival ?

Argument 2:

Because buses arrive at constant rate X’s are iid. exponentially distributed

• the “memoryless” property of the exponential distribution implies that the remaining time to the next bus follows the same exponential distribution, thus
• If shouldn’t

Paradox ! Didn’t we assume that X’s are independently and identically distributed, mean =

The Waiting Time Paradox (Feller 1966)

• An “inspector” (or a customer), inspects “at random” so that the inspection time t is uniformly distributed between the last bus and the next bus.

3

Variation matters

coefficient of variation

X (B)= duration from the last bus to the next bus seen by the inspector

The distribution of X (B) is differentfrom that ofX, (paradox as it is assumed X to be iid.)

Length-biased distribution w. p.d.f.

Question: what is the expected waiting time to the next bus arrival ?

The waiting time W has p.d.f.

symmetry with respect toX(B)

where

is correct if there is no variation in inter-arrival times,

(if buses are as punctual as Swiss trains.)

is correct if the variance of inter-arrival times satisfies

(TTC seems to be worse than this.)

The Waiting Time Paradox (Feller 1966)

4

The duration X is iid. with p.d.f. and mean

• At a snapshot, only those who have experienced the initiating event but not the subsequent event are included in a sample, with observed duration X(B) .

A sample containing only observations made of X(B) is called a “prevalence cohort”.

The distribution from a prevalence cohort corresponds to the p.d.f. and mean

because those with longerduration have greater chance to be included in data.

The Waiting Time Paradox and bias in observational data

A different way of looking at the same problem:

• Occurrence of the initiating event has constant rate (i.e. time of occurrence is uniform at any given time interval)
• The duration X is independent from the random process that generates the initiating event.

Assume the duration X iid. with p.d.f. and mean

X(B) has p.d.f.

W has p.d.f.

where

• Size biased estimation for prevalence estimate

i.e. the sample is size-biased in favor of cohorts with larger prevalence

prevalence = # or % { individuals experienced the initial event but not the subsequent event }

Under-equilibrium, prevalence = incidence x duration.

The length-bias in observed duration leads to “size-bias” in sampled prevalence.

Observational data arising in a prevalence cohort

Under equilibrium: the incidence of the initiating event occurs at constant rate

• The observed duration is length-biased

Naïve estimation for the distribution of X (e.g. incubation time, survival time, etc.) based on such prevalence cohort data leads to over-estimation.

Replacing an “inspector” by sero-conversion, which, under equilibrium, has constant rate, such that given any time interval (between two tests), a sero-conversion may occur and the sero-conversion time is uniformly distributed in the interval.

• Replacing buses with repeated testing: the inter-testing intervals X’s are iid., mean =

X(B)= duration from the last (neg.) test to the next (pos.) test covering a sero-conversion

X(B) has length-biased distribution:

The average waiting time from sero-conversion to the next (pos.) testing:

If we add an average “window period” from infection to sero-conversion,

the prevalence of infected but not yet tested (queue), prev. = incidence x mean duration

Keeping and unchanged, the testing strategy determines

(under equilibrium conditions)

Waiting Time Paradox in disease screening via repeated testing

Each test is associated with a cost κ.

Both costs are determined according to different contexts.

Each infected but not yet tested individual (in ) may be associated with a cost c to the society

Objective: Under different scenarios of infection incidence

determine the optimal testing frequency so that the queue of infected but untested is reduced to satisfy a cost-effective criterion.

Generally, the larger the incidence rate , the more cost-effective it is for more frequent testing.

Cost-effectiveness is compromised if there is large variation between inter-testing intervals or among individuals.

Waiting Time Paradox in disease screening via repeated testing

The prevalence of infected but not yet tested (queue),

(under equilibrium conditions)

An infected individual produces new infections accounting to a counting process with intensity

R0 = mean value of N , can be expressed as

such that

The premises:

If

Malthusian number

describing the early exponential growth

Re-write:

then

satisfying

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

(Ref: Wallinga and Lipstisch, 2007; Heesterbeek and Roberts, 2007)

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

N = # of infections produced by a typical infectious individual while seeded into an infinitely

large susceptible population

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

• assessing the meaning of this random variable;
• assessing whether it is observable;
• if observable, collect data and estimate g(x);
• estimate separately (usually via curve fitting);
• evaluate the Laplace transform , analytically or numerically.

In the above, there is no assumption about the integral k(x), i.e. the model

instantaneous rate at time x

However, in order to assess 1.,

we put into a structured model framework: the SEIR.

= Laplace transform

The Waiting Time Paradox as seen in R0 formulation

satisfying

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

• In the SIR model, with exponentially distributed infectious period

is the Laplace transform of the exponential distribution with mean

In this case, g(x) is the p.d.f. the infectious period.

• In the SEIR model, with both the latent period and the infectious period being exponentially distributed

= product of two Laplace transforms of the exponential distributions.

S

E

I

R

In this case, g(x) is the p.d.f. the sum of the latent period and the infectious period.

S

I

R

Anderson and May (1991) : generation time = latent period + infectious period.

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

Assuming R0 > 1:

satisfying

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

In the SEIR model, if the latent period and the infectious period are arbitrarily distributed

(with specific distributions)

(Yan, 2007)

(no latent period, exponentially disted infectious period)

(exponentially disted latent and infectious periods)

(gamma disted latent and infectious periods,

Anderson and Watson, 1980)

where are the mean values of the latent and infectious periods

= Laplace transform of

are coefficient of variation parameters

The Waiting Time Paradox as seen in R0 formulation

including:

satisfying

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

p.d.f. ofWin length-biased infectious period

p.d.f. ofthe latent period

from a snapshot point of view

Call it generation time :

• if , consistent with that by Anderson and May (1991);
• if , consistent with that by Gani and Daly (2001):
• mean latent period + half of the mean infectious period

g(x) is p.d.f. of with mean value

Fine (2003): the latent period + part of the infectious period ….

• not exactly, need to emphasize length-biased infectious period
• could be even longer than the “natural” infectious period

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

In the SEIR model, if the latent period and the infectious period are arbitrarily distributed

satisfying

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Postulate:

• assessing meaning of this random variable;
• assessing whether it is observable;
• if observable, collect data and estimate g(x);
• estimate separately (usually via curve fitting);
• evaluate the Laplace transform , analytically or numerically.

g(x) is p.d.f. of the generation time, defined as the latent period plus part of the length-biased infectious period, with mean value

For 1. above,

In the above definition, the generation time does not involve:

• another individual
• the transmission process

The “snapshot” may be thought as the time of infection of an infectee in relation to the infectious period of its infector; whereas in theory, it could be a snapshot by any “inspector”.

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

satisfying

g(x) is p.d.f. of the generation time, defined as the latent period plus part of the length-biased infectious period, with mean value

From Wallinga and Lipsitch (2007):

• from the infection time of the infector looking forward to the infection time of the infectee;
• from the infection time of the infectee looking back to the infection time of the infector.

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

Seems like, if we assign the “snapshot” as the time of infection of an infectee in relation to the infectious period of its infector, then the generation interval in Wallinga and Lipsitch (2007) should be understood in the sense of (ii) in Svensson (2007).

But …. there are strings attached ….

If associating with the generation interval in Wallinga and Lipsitch (2007) and understood it in the sense of (ii) in Svensson (2007), there are hidden assumptions.

• The infectious period contains infectee, hence length-biased, mean
• The infection times of infectees must be exchangeable so that any randomly chosen infectee (if more than one), while looking back, gives the same distribution for .
• The infection times of infectees must be uniformly distributed in the (length-biased) infectious period so that W has p.d.f. with mean i.e. symmetry.

Things that I don’t understand:

with mean

• valid interpretation at equilibrium
• is the Malthusian number, far from equilibrium

= Laplace transform of

The Waiting Time Paradox as seen in R0 formulation

Svensson (2007):

• from the infection time of the infectee looking back to the infection time of the infector.
• The system is at equilibrium so that infectors arrive at constant rate.

This puzzle further leads to the observation problem: can we collect data at the early phase of an outbreak and use the above theory ?

• Individuals who have experienced the initial event are enrolled at time
• Individuals are followed until an endpoint event, taking place at time

• assumed equilibrium
• called “enrolment” as “snapshot”, assumed uniform distribution in any fixed time interval
• the time from initial event to the observed endpoint X(B), follows p.d.f.
• the time from initial event to enrolment E, follows the distribution with p.d.f.

Generalization

• enrolment is random, independent from the random process of the initiating event.
• is not constant;
• This observation scheme is subject to left-truncation.

A generalization of the Waiting Time Paradox: left-truncation

Moving away from for general observation bias without being in equilibrium

The same issue: the observed X(B) is length-biased.

Not-so-naïve method through conditioning:

X(B) arises from the conditional distribution of X given because the eligibility of

enrolment is not having experienced the endpoint event at

Statistical methods are on the conditional distribution rather than where

Such a method provides a length-bias adjusted estimation, but is only able to estimate part of the distribution. Some information is lost in the data, unless is explicitly modelled.

A generalization of the Waiting Time Paradox: left-truncation

The objective: estimating the distribution of the duration X between the two events.

The observed X(B) is length-biased: in favor of longer durations.

Naïve analyses: treating X(B) as if X from designed experiments, lead to over-estimation

Call for joint modelling:

transmission model for how epidemiology generate data and statistical model for how data are observed.

Example:

Initiating event = diagnosis of a disease

Subsequent event = the disease is reported and entered into a registry

Objective: assessment of the reporting delay X.

Right-truncation: length-bias in favour of observing short durations

Previously, left-truncation, in favour of observing long durations

Very common in surveillance: inclusion criteria is the occurrence of the

subsequent event prior to the time of data analysis.

Bias: the case has to be reported before the time at analysis;

systematically observing data with short delay.

Example:

Initiating event = diagnosis of a disease

Subsequent event = the disease is reported and entered into a registry

Objective: assessment of the reporting delay X.

Bias: systematically observing data with short delay.

Reporting delay is a very important issue in all disease surveillance

Annual AIDS incidence in Canada as seen in 1992 and 1999

Reporting delay adjusted trend based on 1992 data presented in April 1993 along with the AIDS surveillance report.

2500

2000

As reported by Dec.31, 1992

1500

1000

500

As reported by Dec 31, 1999

0

Right-truncation: length-bias in favour of observing short durations

The gap between reported (bars) and projected (lines) trends implied long delay between diagnosis and data entry (into national registry).

Then:

Naïve analysis always leads to severe under-estimation of reporting delay.

Naïve analysis

• median reporting delay 1.6 months
• 95% completeness within14 months
• Not-so-naïve analysis:
• median delay approx. 9 months
• 85% completeness within 5 years.

# cases diagnosed at time t (to be estimated)

# cases diagnosed at time t and reported by time C (as a proportion of N(t) )

All we need to do is to estimate this proportion, which is

Adequately accounting for right-truncation and other (adm.) processes, useful tools can be developed to reflect real-time trend and built into the surveillance.

SARS outbreak in Toronto, 2003

Pre-mature declaration that SARS was over

As turned out:

Recall the strong protest against WHO’s travel advisory on Apr. 23 ?

H1N1 during the spring of 2009

As it turned out:

May 14: Is the worst over?

Other examples of reporting delay in disease surveillance: did we learn the lesson?

Based on above data, at

C = June 30, 1988

• naïve estimation: as if data from random experiment, iid.

Brookmeyer and Gail (1994):

??

naïve

• not-so-naïve: right-truncation

data from the conditional distribution

uncertainty subject to a constant proportionality

• naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning)

0.5

not-so-naïve

Kalbfleisch and Lawless (1989):

• with analysis by conditioning, the larger the C, the longer are the estimated mean and median
• without knowing the AIDS incidence, there is a loss of information in data so that one can only estimate up to a constant of proportionality the early part of the incubation period distribution.

Right-truncation: length-bias in favour of observing short durations

Initiating event = HIV infection via transfusion

Subsequent event = onset of AIDS illnesses

Another example:

Objective: estimate the incubation period X.

Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as

the only known risk factor, retrospective ascertained date at infection / transfusion

Brookmeyer and Gail (1994):

• naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning)
• naïve estimation: as if data from random experiment, iid.
• not-so-naïve: right-truncation

data from the conditional distribution

Lui, et al. (1986): C = April 30, 1985

Kalbfleisch and Lawless (1989):

• with analysis by conditioning, the larger the C, the longer are the estimated mean and median
• conditioning: mean years
• naïve analysis: mean years

Lagakos, et al. (1988): C = June 30, 1986

• conditioning: median 8.5 years

By the 1990s when large scale multi-center cohort data became available, it turned out that the median incubation period is approximately 10 years.

Right-truncation: length-bias in favour of observing short durations

Initiating event = HIV infection via transfusion

Subsequent event = onset of AIDS illnesses

Another example:

Objective: estimate the incubation period X.

Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as

the only known risk factor, retrospective ascertained date at infection / transfusion

The underlying disease trend matters.

Right-truncation: length-bias in favour of observing short durations

Kalbfleisch and Lawless (1989):

• with analysis by conditioning, the larger the C, the longer are the estimated mean and median
• without knowing the incidence of the initiating event, there is a loss of information that one only estimates up to a constant of proportionality the early part of the duration distribution.

The above statements are very important in observational data of an emerging disease with respect to retrospectively ascertained durations (incubation period, serial interval, etc.), later analyses suggest longer distribution than earlier analyses.

Call for jointly model the disease process (e.g. transmission model) and model the data generation process.

Example: stochastic versus deterministic models for

Assume

Deterministic ↔ what must happen:

• a bell-shaped determined by mathematical law.

Stochastic ↔ what might happen:

• even R0 > 1, there is a positive probability (1/3 in above cases), very few transmissions occur then followed by extinction
• otherwise, after “simmering” for a short random period of time, it takes off;
• if , the path is bell-shaped resemble but the origin is random.

A general topic related to statistics issues and disease models

Every deterministic compartment model, such as SIR, has a stochastic counterpart.

In these graphs, R0 =1.5

n= population size

models are built on unobservable events (e.g. time at infection, the passing of infection from one individual to another, duration of latency (not infectious), duration of infectiousness, duration of immunity, etc.)

• data are based on observable events (e.g. clinical onset of illness, stages of illness, duration of illness, death, physical recovery, etc.)
• some seemingly “large” data (in terms of large population) arise from a single (or few) realization of a random phenomenon (i.e. an outbreak) … extremely small “sample size”

A general topic related to statistics issues and disease models

Statistical challenges in estimating parameters in transmission models

• Observational data subject to length-bias, size bias, missing values, etc.
• large number of parameters
• hypothesis design of experiment repetition and randomization

Summary

Data in most introductory statistics textbooks are from repetition of random experiment by design. Naïve adaptation of these models and methods may lead to severe bias.

• Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process).

Although transmission process is part of the data generation mechanism, the observer sees data through additional filters, such as data management and administrative processes.

• The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process.

The gap is identified. Still lots of work need to be done.

• Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods.

Ditto.