The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox and biases in infectious disease observational data Ping Yan Lecture at Summer School on Mathematics of Infectious Diseases Program Centre for Disease Modelling, York University

Data in infectious disease studies are often observational, not following • hypothesis design of experiment repetition and randomization Outline • Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process). • The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process. • Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods.

just to show how classic it is in introductory probability textbooks • Buses arrive at a constant rate ; the inter-arrival times X’s areindependently and identically distributed, mean Argument 1: The inspection time t is uniformly distributed between two buses, for symmetry Question: what is the expected waiting time to the next bus arrival ? Argument 2: Because buses arrive at constant rate X’s are iid. exponentially distributed • the “memoryless” property of the exponential distribution implies that the remaining time to the next bus follows the same exponential distribution, thus • If shouldn’t Paradox ! Didn’t we assume that X’s are independently and identically distributed, mean = The Waiting Time Paradox (Feller 1966) • An “inspector” (or a customer), inspects “at random” so that the inspection time t is uniformly distributed between the last bus and the next bus. 3

Variation matters coefficient of variation X (B)= duration from the last bus to the next bus seen by the inspector The distribution of X (B) is differentfrom that ofX, (paradox as it is assumed X to be iid.) Length-biased distribution w. p.d.f. Question: what is the expected waiting time to the next bus arrival ? The waiting time W has p.d.f. symmetry with respect toX(B) where is correct if there is no variation in inter-arrival times, (if buses are as punctual as Swiss trains.) is correct if the variance of inter-arrival times satisfies (TTC seems to be worse than this.) The Waiting Time Paradox (Feller 1966) 4

The duration X is iid. with p.d.f. and mean • At a snapshot, only those who have experienced the initiating event but not the subsequent event are included in a sample, with observed duration X(B) . A sample containing only observations made of X(B) is called a “prevalence cohort”. The distribution from a prevalence cohort corresponds to the p.d.f. and mean because those with longerduration have greater chance to be included in data. The Waiting Time Paradox and bias in observational data A different way of looking at the same problem: • Occurrence of the initiating event has constant rate (i.e. time of occurrence is uniform at any given time interval) • The duration X is independent from the random process that generates the initiating event.

Assume the duration X iid. with p.d.f. and mean X(B) has p.d.f. W has p.d.f. where • Size biased estimation for prevalence estimate i.e. the sample is size-biased in favor of cohorts with larger prevalence prevalence = # or % { individuals experienced the initial event but not the subsequent event } Under-equilibrium, prevalence = incidence x duration. The length-bias in observed duration leads to “size-bias” in sampled prevalence. Observational data arising in a prevalence cohort Under equilibrium: the incidence of the initiating event occurs at constant rate • The observed duration is length-biased Naïve estimation for the distribution of X (e.g. incubation time, survival time, etc.) based on such prevalence cohort data leads to over-estimation.

Replacing an “inspector” by sero-conversion, which, under equilibrium, has constant rate, such that given any time interval (between two tests), a sero-conversion may occur and the sero-conversion time is uniformly distributed in the interval. • Replacing buses with repeated testing: the inter-testing intervals X’s are iid., mean = X(B)= duration from the last (neg.) test to the next (pos.) test covering a sero-conversion X(B) has length-biased distribution: The average waiting time from sero-conversion to the next (pos.) testing: If we add an average “window period” from infection to sero-conversion, the prevalence of infected but not yet tested (queue), prev. = incidence x mean duration Keeping and unchanged, the testing strategy determines (under equilibrium conditions) Waiting Time Paradox in disease screening via repeated testing

Each test is associated with a cost κ. Both costs are determined according to different contexts. Each infected but not yet tested individual (in ) may be associated with a cost c to the society Objective: Under different scenarios of infection incidence determine the optimal testing frequency so that the queue of infected but untested is reduced to satisfy a cost-effective criterion. Generally, the larger the incidence rate , the more cost-effective it is for more frequent testing. Cost-effectiveness is compromised if there is large variation between inter-testing intervals or among individuals. Waiting Time Paradox in disease screening via repeated testing The prevalence of infected but not yet tested (queue), (under equilibrium conditions)

An infected individual produces new infections accounting to a counting process with intensity R0 = mean value of N , can be expressed as such that The premises: If Malthusian number describing the early exponential growth Re-write: then satisfying g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: (Ref: Wallinga and Lipstisch, 2007; Heesterbeek and Roberts, 2007) = Laplace transform of The Waiting Time Paradox as seen in R0 formulation N = # of infections produced by a typical infectious individual while seeded into an infinitely large susceptible population

g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: • assessing the meaning of this random variable; • assessing whether it is observable; • if observable, collect data and estimate g(x); • estimate separately (usually via curve fitting); • evaluate the Laplace transform , analytically or numerically. In the above, there is no assumption about the integral k(x), i.e. the model instantaneous rate at time x However, in order to assess 1., we put into a structured model framework: the SEIR. = Laplace transform The Waiting Time Paradox as seen in R0 formulation If true, the tasks:

satisfying g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: • In the SIR model, with exponentially distributed infectious period is the Laplace transform of the exponential distribution with mean In this case, g(x) is the p.d.f. the infectious period. • In the SEIR model, with both the latent period and the infectious period being exponentially distributed = product of two Laplace transforms of the exponential distributions. S E I R In this case, g(x) is the p.d.f. the sum of the latent period and the infectious period. S I R Anderson and May (1991) : generation time = latent period + infectious period. = Laplace transform of The Waiting Time Paradox as seen in R0 formulation Assuming R0 > 1:

satisfying g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: In the SEIR model, if the latent period and the infectious period are arbitrarily distributed (with specific distributions) (Yan, 2007) (no latent period, exponentially disted infectious period) (exponentially disted latent and infectious periods) (gamma disted latent and infectious periods, Anderson and Watson, 1980) where are the mean values of the latent and infectious periods = Laplace transform of are coefficient of variation parameters The Waiting Time Paradox as seen in R0 formulation including:

satisfying g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: p.d.f. ofWin length-biased infectious period p.d.f. ofthe latent period from a snapshot point of view Call it generation time : • if , consistent with that by Anderson and May (1991); • if , consistent with that by Gani and Daly (2001): • mean latent period + half of the mean infectious period g(x) is p.d.f. of with mean value Fine (2003): the latent period + part of the infectious period …. • not exactly, need to emphasize length-biased infectious period • could be even longer than the “natural” infectious period = Laplace transform of The Waiting Time Paradox as seen in R0 formulation In the SEIR model, if the latent period and the infectious period are arbitrarily distributed

satisfying g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Postulate: • assessing meaning of this random variable; • assessing whether it is observable; • if observable, collect data and estimate g(x); • estimate separately (usually via curve fitting); • evaluate the Laplace transform , analytically or numerically. g(x) is p.d.f. of the generation time, defined as the latent period plus part of the length-biased infectious period, with mean value For 1. above, In the above definition, the generation time does not involve: • another individual • the transmission process The “snapshot” may be thought as the time of infection of an infectee in relation to the infectious period of its infector; whereas in theory, it could be a snapshot by any “inspector”. = Laplace transform of The Waiting Time Paradox as seen in R0 formulation If true, the tasks:

satisfying g(x) is p.d.f. of the generation time, defined as the latent period plus part of the length-biased infectious period, with mean value From Wallinga and Lipsitch (2007): Svensson (2007) made the distinction: • from the infection time of the infector looking forward to the infection time of the infectee; • from the infection time of the infectee looking back to the infection time of the infector. = Laplace transform of The Waiting Time Paradox as seen in R0 formulation Seems like, if we assign the “snapshot” as the time of infection of an infectee in relation to the infectious period of its infector, then the generation interval in Wallinga and Lipsitch (2007) should be understood in the sense of (ii) in Svensson (2007). But …. there are strings attached ….

If associating with the generation interval in Wallinga and Lipsitch (2007) and understood it in the sense of (ii) in Svensson (2007), there are hidden assumptions. • The infectious period contains infectee, hence length-biased, mean • The infection times of infectees must be exchangeable so that any randomly chosen infectee (if more than one), while looking back, gives the same distribution for . • The infection times of infectees must be uniformly distributed in the (length-biased) infectious period so that W has p.d.f. with mean i.e. symmetry. Things that I don’t understand: with mean • valid interpretation at equilibrium • is the Malthusian number, far from equilibrium = Laplace transform of The Waiting Time Paradox as seen in R0 formulation Svensson (2007): • from the infection time of the infectee looking back to the infection time of the infector. • The system is at equilibrium so that infectors arrive at constant rate. This puzzle further leads to the observation problem: can we collect data at the early phase of an outbreak and use the above theory ?

Initial event occurs over time t following a random process with intensity • Individuals who have experienced the initial event are enrolled at time • Individuals are followed until an endpoint event, taking place at time Previously (the Waiting Time Paradox), • assumed equilibrium • called “enrolment” as “snapshot”, assumed uniform distribution in any fixed time interval • the time from initial event to the observed endpoint X(B), follows p.d.f. • the time from initial event to enrolment E, follows the distribution with p.d.f. Generalization • enrolment is random, independent from the random process of the initiating event. • is not constant; • This observation scheme is subject to left-truncation. A generalization of the Waiting Time Paradox: left-truncation Moving away from for general observation bias without being in equilibrium The same issue: the observed X(B) is length-biased.

Not-so-naïve method through conditioning: X(B) arises from the conditional distribution of X given because the eligibility of enrolment is not having experienced the endpoint event at Statistical methods are on the conditional distribution rather than where Such a method provides a length-bias adjusted estimation, but is only able to estimate part of the distribution. Some information is lost in the data, unless is explicitly modelled. A generalization of the Waiting Time Paradox: left-truncation The objective: estimating the distribution of the duration X between the two events. The observed X(B) is length-biased: in favor of longer durations. Naïve analyses: treating X(B) as if X from designed experiments, lead to over-estimation Call for joint modelling: transmission model for how epidemiology generate data and statistical model for how data are observed.

Example: Initiating event = diagnosis of a disease Subsequent event = the disease is reported and entered into a registry Objective: assessment of the reporting delay X. Right-truncation: length-bias in favour of observing short durations Previously, left-truncation, in favour of observing long durations Very common in surveillance: inclusion criteria is the occurrence of the subsequent event prior to the time of data analysis. Bias: the case has to be reported before the time at analysis; systematically observing data with short delay.

Example: Initiating event = diagnosis of a disease Subsequent event = the disease is reported and entered into a registry Objective: assessment of the reporting delay X. Bias: systematically observing data with short delay. Reporting delay is a very important issue in all disease surveillance Annual AIDS incidence in Canada as seen in 1992 and 1999 Reporting delay adjusted trend based on 1992 data presented in April 1993 along with the AIDS surveillance report. 2500 2000 As reported by Dec.31, 1992 1500 1000 Reporting delay adjusted trend 500 As reported by Dec 31, 1999 0 Right-truncation: length-bias in favour of observing short durations The gap between reported (bars) and projected (lines) trends implied long delay between diagnosis and data entry (into national registry).

Then: Naïve analysis always leads to severe under-estimation of reporting delay. Naïve analysis • median reporting delay 1.6 months • 95% completeness within14 months • Not-so-naïve analysis: • median delay approx. 9 months • 85% completeness within 5 years. Adjustment of reporting delay # cases diagnosed at time t (to be estimated) # cases diagnosed at time t and reported by time C (as a proportion of N(t) ) All we need to do is to estimate this proportion, which is Adequately accounting for right-truncation and other (adm.) processes, useful tools can be developed to reflect real-time trend and built into the surveillance.

SARS outbreak in Toronto, 2003 Pre-mature declaration that SARS was over As turned out: Recall the strong protest against WHO’s travel advisory on Apr. 23 ? H1N1 during the spring of 2009 As it turned out: May 14: Is the worst over? Other examples of reporting delay in disease surveillance: did we learn the lesson?

Based on above data, at C = June 30, 1988 • naïve estimation: as if data from random experiment, iid. Brookmeyer and Gail (1994): ?? naïve • not-so-naïve: right-truncation data from the conditional distribution uncertainty subject to a constant proportionality • naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning) 0.5 not-so-naïve Kalbfleisch and Lawless (1989): • with analysis by conditioning, the larger the C, the longer are the estimated mean and median • without knowing the AIDS incidence, there is a loss of information in data so that one can only estimate up to a constant of proportionality the early part of the incubation period distribution. Right-truncation: length-bias in favour of observing short durations Initiating event = HIV infection via transfusion Subsequent event = onset of AIDS illnesses Another example: Objective: estimate the incubation period X. Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion

Brookmeyer and Gail (1994): • naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning) • naïve estimation: as if data from random experiment, iid. • not-so-naïve: right-truncation data from the conditional distribution Lui, et al. (1986): C = April 30, 1985 Kalbfleisch and Lawless (1989): • with analysis by conditioning, the larger the C, the longer are the estimated mean and median • conditioning: mean years • naïve analysis: mean years Lagakos, et al. (1988): C = June 30, 1986 • conditioning: median 8.5 years By the 1990s when large scale multi-center cohort data became available, it turned out that the median incubation period is approximately 10 years. Right-truncation: length-bias in favour of observing short durations Initiating event = HIV infection via transfusion Subsequent event = onset of AIDS illnesses Another example: Objective: estimate the incubation period X. Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion

The underlying disease trend matters. Right-truncation: length-bias in favour of observing short durations Kalbfleisch and Lawless (1989): • with analysis by conditioning, the larger the C, the longer are the estimated mean and median • without knowing the incidence of the initiating event, there is a loss of information that one only estimates up to a constant of proportionality the early part of the duration distribution. The above statements are very important in observational data of an emerging disease with respect to retrospectively ascertained durations (incubation period, serial interval, etc.), later analyses suggest longer distribution than earlier analyses. Call for jointly model the disease process (e.g. transmission model) and model the data generation process.

Example: stochastic versus deterministic models for Assume Deterministic ↔ what must happen: • a bell-shaped determined by mathematical law. Stochastic ↔ what might happen: • even R0 > 1, there is a positive probability (1/3 in above cases), very few transmissions occur then followed by extinction • otherwise, after “simmering” for a short random period of time, it takes off; • if , the path is bell-shaped resemble but the origin is random. A general topic related to statistics issues and disease models Every deterministic compartment model, such as SIR, has a stochastic counterpart. In these graphs, R0 =1.5 n= population size

models are built on unobservable events (e.g. time at infection, the passing of infection from one individual to another, duration of latency (not infectious), duration of infectiousness, duration of immunity, etc.) • data are based on observable events (e.g. clinical onset of illness, stages of illness, duration of illness, death, physical recovery, etc.) • some seemingly “large” data (in terms of large population) arise from a single (or few) realization of a random phenomenon (i.e. an outbreak) … extremely small “sample size” A general topic related to statistics issues and disease models Statistical challenges in estimating parameters in transmission models • Observational data subject to length-bias, size bias, missing values, etc. • large number of parameters

Data in infectious disease studies are often observational, not following • hypothesis design of experiment repetition and randomization Summary Data in most introductory statistics textbooks are from repetition of random experiment by design. Naïve adaptation of these models and methods may lead to severe bias. • Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process). Although transmission process is part of the data generation mechanism, the observer sees data through additional filters, such as data management and administrative processes. • The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process. The gap is identified. Still lots of work need to be done. • Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods. Ditto.

The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox and biases in infectious disease observational data

Presentation Transcript

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease and the Athlete

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

INFECTIOUS DISEASE

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease

Infectious Disease