Reasonable Safeguards against Contamination in mtDNA Testing, And Some Database Issues

Reasonable Safeguards against Contamination in mtDNA Testing, And Some Database Issues Dr. Frederika Kaestle (Depts of Anthropology and Biology, IU Bloomington) and Dr. Jason Eshleman (Trace Genetics)

Ancient DNA and Forensic DNA Analysis Similarities: • Template DNA unknown • Often limited template • Template often highly degraded • Far from ideal sources for biological research

Ancient DNA and Forensic DNA Analysis Differences: • Ancient DNA still largely limited to mtDNA analysis. • Ancient DNA is rarely searching for exact match. • Audience (justice system vs. academic community) familiarity with genetics differ significantly. • Level of skepticism remains high with ancient DNA. • Hypothesis testing is overt

Limitations of Ancient DNA • Minimal template. • PCR inhibitors co-extracted. • Sources have often been heavily handled. • That property of mtDNA (high copy number) that makes it possible to analyze likewise makes contamination a more significant problem.

Contamination Sources • Lab-ware and reagents come ‘pre-contaminated’ from the manufacturer • PCR tubes are particularly notorious (Schmidt et al. [1995] estimates high rate of mtDNA contamination in lab disposables) • Taq (the enzyme that catalyzes PCR) has also been shown to be contaminated with mtDNA • Other reagents shown to be contaminated include nucleotides, buffer, primers

Contamination Sources • “Sample” surface • Contamination may have existed prior to sample (e.g. bone, hair) being deposited • Contamination may have occurred after deposition but before collection • Contamination may have occurred during the collection of the sample • Contamination may have occurred during the storage and transport of the sample • Contamination may have occurred at the laboratory after sample delivery

Contamination Sources • Carryover PCR • Contamination may occur due to residues from previous PCR amplifications in the lab remaining on lab ware, lab surfaces, lab clothing, lab equipment, lab air • This is particularly problematic due to the high copy number of the PCR amplicons

Contamination Sources • People • With or without access to facilities • Lab personnel shed DNA throughout the day • Lab personnel carry DNA from others into the lab every day on the surfaces of their clothing and body

Identifying contamination(or “how to have your paper rejected upon submission”) • Does sample match lab personnel? • But you can’t rule out everyone. Identifying contamination requires some serendipity. • Understand negative controls are at times insufficient • Low level in background can be mistaken for sample. • Neg. control might not be contaminated, but sample is (e.g. PCR tube) • The sequence doesn’t “make sense.” • But now we run the risk of only finding what we’re looking for.

What IS contamination? • Unwanted DNA • Why is it unwanted? • Analyst did not intend to extract it • DNA is DNA • (It makes for a messy story? The study would be so much nicer if we didn’t find it?) • Contamination does not answer the question at hand. • But what’s the question?

What Questions Can We Answer? • Are two samples different? • What can we infer from this? • Not from the same source (but remember heteroplasmy, and possibility of contamination) • Are two samples the same? • What can we infer from this? • MIGHT be from the same source (identity by descent, possibility of contamination) • Note that anthropologists often are asking questions about FREQUENCIES of mtDNA types, not about single samples • Allows us to make population-level inferences

What IS contamination? • Thus, one way to view contamination is as DNA that leads to a false inference *if* we do not know out data are compromised.

So How Do We Detect Contamination? • Protocols designed to detect • Negative controls • Reagent/Extraction blank • Amp Negative/No Template Control • Controls alert us to the presence of DNA in a tube • We infer what this means vis a vis our samples • Comparison to sequences of probable contaminating individuals • Lab personnel • Excavators (evidence collection team) • Museum Curators and Researchers (staff with access to evidence storage)

Case Study: contamination and multiple inferences • 1990- bones and teeth found in the desert • 1996- bones identified as teenage boy, missing in 1979 • 2002- mtDNA identifies match between bones and boy’s mother • NEGATIVE CONTROLS CLEAN THROUGHOUT

Initial mtDNA results • It is *possible* that the bone was really from the missing boy • If this is a match there is also contamination • This is still merely a suggestive result as there are at least four 2-source mixtures that can produce this result • Result requires confirmation Mother: 16224, 16287, 16311 Bone: 16224T/C, 16287C/T, 16311T/C Conclusion: mixture, possible match

Subsequent mtDNA Results Mother: 16224, 16287, 16311 2nd extraction: ND (extraction failed) 5th extraction: 16183A/c, 16223, 16319G/a, 16325C/t, 16362T/C 3rd extraction: 16085, 16111, 16223, 16257, 16261, 16286, 16311 6th extraction: 16183, 16187, 16189, 16217 4th extraction: 16082A/C, 16183A/C, 16189T/C, 16217T/C, 16223, 16290, 16291, 16319, 16362 7th + 8th extractions: 16224, 16287, 16311

mtDNA Results Mother: 16224, 16287, 16311 Bone: 16224T/C, 16287C/T, 16311T/C 7th + 8th extractions: 16224, 16287, 16311 Match confirmed!

Timeline 2nd extraction (no result): 7/19/02 3rd extraction (no match): 8/6/02 4th extraction (no match): 8/13/02 5th extraction (no match): 8/22/02 6th extraction (no match): 9/3/02 7th extraction (matching): 9/10/02 8th extraction (matching): 9/28/02

Timeline 2nd extraction (no result): 7/19/02 3rd extraction (no match): 8/6/02 4th extraction (no match): 8/13/02 5th extraction (no match): 8/22/02 Benchnotes reveal 9/10 references and extractions performed at same time by same analyst. 6th extraction (no match): 9/3/02 } New references: 9/10/02 7th extraction (matching): 9/10/02 Extractions 6, 7 and 8 performed on one bone sample. 8th extraction (matching): 9/28/02

Inferences? • Possible that bones really are from missing child. • Possible that errant handling contaminated bone with reference sample. • CERTAIN that improper lab practices were used. • Contamination existed, ignored when it did not fit with desired result.

Contamination Control Tools of the aDNA Trade Keep it Clean • Surface decontamination of bone helps (Bouwman et al 2006) • DNase I (Eshleman and Smith 2001) cleanup of reagents and tubes • Positive Pressure HEPA-filtered air in the lab • Regular UV-irradiation of surfaces • Controlled and limited access to the lab • Dedicated and disposable laboratory clothing and shoes Prevent Carryover • Uni-directional travel between extraction and PCR laboratories • Use of dUTP (Uracil) and pre-digestion of subsequent PCR reactions (and ideally) Independent confirmation (in temporally separate extraction/amplification procedures and possibly at another laboratory) • Split the samples first! Why analyze contamination twice?

Kennewick Man Case Study • 3 independent ancient DNA laboratories, utilizing standard contamination controls • 3 independent samples of ancient Native American individual discovered near Kennewick, WA • Results after multiple extraction and amplification attempts (all negative controls clean): • Lab 1: multiple failures at amplification, followed by sequence identical to lab director • Lab 2: multiple failures at amplification, followed by sequence identical to student who had not been in the ancient DNA laboratory (or in town) for approximately 2 years • Lab 3: multiple failures at amplification, followed by sequence identical to lab manager who had never entered the ancient DNA laboratory

Neanderthal Case Study • Neanderthal remains had been curated in European museum for several years • Ancient DNA laboratory personnel extract DNA from tooth of a Neanderthal (39,900 years old), using standard precautions • 2 independent extractions performed • Both sequences identical to each other • Neanderthal? • Sequences compared to potential contaminating individuals • Sequence identical to paleontologist who had studied remains extensively

Reality Check There is no magic bullet • Contamination is a reality in DNA work. • Negative controls are an indicator, not a solution, not a guarantee. • Thinking clearly, asking the right questions and posing the alternatives is essential.

mtDNA Database • Assuming you have mtDNA results, what do they mean? • Previous presenters discussed many issues associated with the mtDNA database, but I would like to concentrate on an issue that has emerged from the majority of anthropological genetic research on mtDNA

Relevant mtDNA Basics • mtDNA inherited maternally, and does not recombine – distribution of sequences is determined by movement of females • mtDNA has very fast mutation rate compared to nuclear DNA – new mutations crop up all the time • This creates a situation in which new mtDNA lineages are very rare, and generally of limited distribution, and lineages in general are not randomly distributed

Relevant Anthropological Basics • People do not move randomly on the landscape • Tend not to move large distances • Tend to follow family and friends, members of their religion/language group/caste etc. • Tend to follow paths of ‘least resistance’ (valleys, rivers) • Tend to follow jobs/game/other economically important resources • Are occasionally moved against their will

African American Example • Africans have highest level of mtDNA variation in the world, and highest level of rare mtDNA sequences • Slaves transported non-randomly to Americas • South Carolina plantation owners grew mostly rice, so preferred slaves from West Africa who knew how to grow rice • Virginia plantation owners had to deal with malaria from mosquitoes who thrived in surrounding swamps, so preferred slaves from the “Gold Coast” of Africa who were resistant to malaria • Louisiana slave owners tended to purchase slaves from Portuguese and French traders, who took slaves from Angola

African American Example • During the “Great Migration” (1910-1930), large numbers of African Americans moved out of the rural south (for better jobs in light of WWI and a boll weevil crop infestation) • Those from Mississippi, Alabama, Louisiana followed Miss R. north to large cites of Midwest • Those from Carolinas and Virginia followed the coastline to D.C., Philly, NYC. • But the majority of African Americans remain in the SE even today.

African American Example • In addition to actual population movement, the genetic make-up of African American groups across the US varies significantly in the level of admixture with non-Africans • Level of admixture with European Americans is higher in the West, and in large northern cities (likely due to different social mores) • Level of admixture with Native Americans is higher in the West and Southwest (probably due to a combination of social mores and the higher number of Native Americans resident in the West)

mtDNA Database • Does the database take into account this regional variation in African American mtDNA sources? • NO • Of ~1148 samples, approximately 800 are from Houston • The samples overall are convenience samples from “blood banks, paternity-testing laboratories, laboratory personnel, clients in genetic-counseling centers, law-enforcement officers, and people charged with crimes” (NRC II (1996) supra note 5, at 30), and are thus in no way randomized with regard to geographic origin or census data on the distribution of African Americans in the US.

mtDNA database • The other subsets of the database suffer from the same problem of non-random sampling also • E.g. all of the Native American samples are Apache and Navajo. You could not pick a more non-representative tribal distribution if you tried, other than one composed entirely of Eskimo.

Native American mtDNA variation

mtDNA database • ‘Asian’ subset is also not representative of the geographic origin of Asian Americans, based on the 2000 Census

But I thought they validated it? • For the most part, validation of the various subsets of the mtDNA database involved confirming that the major haplogroups of mtDNA variation present in the ‘source’ population were also present in their American ‘relatives’, sometimes in somewhat similar frequencies • Ummmm. What’s a haplogroup?

Haplogroups? • mtDNA lineages can be divided into major subgroups (haplogroups) based on shared ancestry resulting in shared sets of mutations • Within each haplogroup are individual haplotypes of unique combinations of mutations (this is what is utilized in the inclusion statistics)

Haplogroups?

Analogy • Haplogroups are like last names • Validating the database is basically like asserting that there are ‘Smiths’ in Europe, and there are ‘Smiths’ here too, so we’ve got a representative sample • BUT the FREQUENCY of ‘Smiths’ will vary across the US, as well as across Europe, just as the frequencies of haplogroups do. So just asserting that there are ‘Smiths’ in both places, and even showing that their AVERAGE frequencies are the same across the continents, doesn’t tell you much about how representative your sample is • To identify a person more uniquely, you need their whole name (their haplotype) • The frequency of Wilmut G. Smith and Jesus A. Smith are going to be significantly different across the US (and across Europe) • Because the haplotype is what the inclusion statistic is based on, this is really the information we need

But doesn’t the 95% confidence interval take all this into account? • NO! • The calculation of a 95% CI ASSUMES that you have a random sample of the population, and that the population is not subdivided • It is TOTALLY INVALID if your sample is not random, and/or the population is subdivided

Han Chinese Example(Yao et al. 2002) • 263 unrelated individuals from 13 locations were typed for mtDNA haplotype, and sorted into haplogroups

Han Chinese Example(Yao et al. 2002)

Han Chinese Example(Yao et al. 2002) • “The comparison of the regional Han mtDNA samples revealed an obvious geographic differentiation in the Han Chinese….Hence, the grouping of different Han populations into just “Southern Han” and “Northern Han”…or the use of one or two Han regional populations to stand for all Han Chinese…does not appropriately reflect the genetic structure of the Han. Intriguingly, despite numerous historically recorded migrations and substantial gene flow across Chinese form the Bronze Age to the present time…differences between geographic regions have been maintained” (p. 649).

mtDNA Database • We simply don’t have the data to examine the significance of phylogeographic substruction within the US populations at this point • However, it is reasonable to assume that the same kind of substructure that exists in all studies of other populations that investigate haplotype phylogeography, and many that investigate haplogroup phylogeography, is also present in the US • Thus, given the small sample size and non-random sampling strategy of the mtDNA database, it is unreasonable to assume it can provide meaningful estimates of sequence frequencies for the calculation of inclusion statistics. • For more info on these issues, see Kaestle, FA, RA Kittles, AL Roth & EJ Ungvarsky (2006) Database Limitations on the Evidentiary Value of Forensic Mitochondrial DNA Evidence. Amer. Criminal Law Rev. 43:53-88.

Conclusions? • Requirements for preventing and detecting contamination within ancient DNA research are generally more strict than in forensic applications • Even with these requirements, contamination ‘slips through’ • The federal mtDNA database is currently inadequate for use in inclusion statistics calculations

Reasonable Safeguards against Contamination in mtDNA Testing, And Some Database Issues