- 291 Views
- Updated On :
- Presentation posted in: Pets / Animals

STRUCTURE. http://pritch.bsd.uchicago.edu. Riccardo Negrini. riccardo.negrini@unicatt.it. A model-based clustering methods that use molecular markers to:. Infer the properties of populations starting from single individuals. Demonstrating the presence of a populations structure.

STRUCTURE

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

STRUCTURE

http://pritch.bsd.uchicago.edu

Riccardo Negrini

riccardo.negrini@unicatt.it

A model-based clustering methods that use molecular markers to:

- Infer the properties of populations starting from single individuals

- Demonstrating the presence of a populations structure

- Detecting “cryptic” populations structure

- Classify individuals of unknown origins

- Identify immigrant

- Identify mixed individuals

Distance-based methods

Easy to apply and visually appealing

but

- The cluster identify are heavily dependant to the distance measures and to the graphical representation chosen

- Difficult to asses the level of confidence of the cluster obtained

- Difficult to incorporate additional information

- More suited to exploratory data analysis than to fine statistical inference

Dice similarity and multivariate analysis

Italian Limousine

Marchigiana

Romagnola

Distribution Dice similarity between (dotted line) and within breeds

ROM/FRI

ROM/LMI

ROM/ROM

ROM/MCG

ROM/CHI

STRUCTURE main assumption:

H-W equilibrium within populations

Linkage equilibrium between loci within populations

- STRUCTUREdoes not assume a particular mutation process so it can be use with the most common molecular markers (STR, RFLP, SNP, AFLP). Sequence data, Y chromosome or mtDNA haplotypes have to be recoded as a single locus with many alleles

- STRUCTURE accounts for the presence of H-W and LD by introducing population structure and attempts to find populations grouping that (as far as possible) are not in disequilibrium

STRUCTURE adopt a BAYESIAN approach:

Let X denote the genotype of the sampled individuals

Let Z denote the unknown population of origin of the individuals

Let P denote the unknown allele frequencies in all populations

Under H-W and LE each allele at each locus in each genotype in an independent drown from the appropriate frequency distributions

Having observed X, the knowledge on Z and P is given by the posterior probability of Bayes theorem:

Pr (Z, P|X) = Pr(Z) Pr(P) Pr(X|Z, P)

It is not possible to compute the distribution exactly but it is possible to obtain approximate samples of Z and P using MCMC and than make inference based on summary statistics of this samples

Bayesian inferences: basic principles

- No logic distinction between parameters and data. Both are random variables: data “observed” and parameters “unobserved”

- PRIOR encapsulates information about the values of a parameters before observing the data

- LIKELIHOOD is a conditional distribution that specified the probability of the data at any particular values of the parameters

- Aims of Bayesian inference is to calculate the POSTERIORdistribution of the parameters (The conditional distribution of the parameters given the data)

FORMAT OF THE DATA FILE:

Indicate learning samples

Alleles in rows

Missing data

File in txt format with tabs

Dominant data: code 1 the band presence (AA or Aa) and 2 the absence (aa)

second alleles as missing data (-9)

BUILDING A PROJECT:

Step 1

Step 2

Step 4

Step 3

…….if everything goes well:

MODELLING DECISION:

Ancestry model:

No admixture model: each ind comes purely from one of the k populations. The output is the posterior prob that individual i comes from the pop k. The prior prob for each populations is 1/k. appropriate for discrete populations and for dominat data

Admixture model: ind may have mixed ancestry i.e have inherited some fractions of its genome from ancestors in population k. The output is the posterior mean estimates of this proportions

Linkage model: If t generation in the past there was an admixture event that mixed the k populations, any individual chromosome resulted composed of “chunks” inherited as discrete units from ancestors at the time of admixture.

Using prior population information: this is the default option in structure. Not recommended in the exploratory preliminary analysis of the data. Popflag allow to specify which samples had to be used as learning samples to assist clustering

Frequency mode:

Allele frequencies independent: it assumes that allele frequencies in each populations are independent drown from a distribution specified by a parameters l. The prior says that we expect the allele frequencies in each population to be reasonably different from each others.

Allele frequencies correlate: it assumes that allele frequencies in the different populations are likely to be correlate probably due to migrations or shared ancestry. The K populations represented in the dataset have each undergone an independent drift away from the ancestral allele fequencies

How long run the program?

Length of burn-in period: number of MCMC iteration necessary to reach a “stationary distribution”: the state it visit will tend to the probability distribution of interest (e.g. Pr(Z, P|X)) that no longer depend on the number of iteration or the initial state of the variables.

Number of MCMC after burn-in: number of iteration after burn-in to get accurate parameters estimate

Loosely speaking: usually burn-in from 10,000 to 100,000 iteration are adequate.

Good estimate of the parameters P and Q can be obtained with fairly short run (100,000).

Accurate estimation of Pr(X|K) need quite long run (106)

How to choose k (number of populations)?

No rules, but only iterative method: i.e. try different k and different Length of burn-in period and number of MCMC iteration after burn-in.

- Fully resolving all the groups in your dataset testing all the values until highest values likelihood values are reached

- Determining the rough relation (low K)

Be careful to:

- Run several independent run for each K in order to verify the consistency of the estimates across run

- Population structure leads to LD among unlinked loci and departures from H-W. These are the signals used by STRUCTURE. But also inbreeding, genotyping errors or null alleles can lead to the same effect.

INTERPRETING THE OUTPUT:

The screen during run

Number of MCMC iteration

Log of data given the current values of P and Q

Divergence between populations calculated as Fst

Current estimates of ln(P|K) averaged over all the iteration since the end of burn-in period

The output file

Current estimates of Prln(P|K) averaged over all the iteration since the end of burn-in period

Q output without using prior information

Estimated membership in the clusters (k=3) and 90% probability interval (ANCENDIST turned on)

Q output using prior information

Estimated probability of belonging to the second populations or have parent and grandparent that belong to the second population

Posterior probability of belonging to the presumed population

PLOT THE RESULTS

- one vertical line/individual

- color = cluster

- more colors/line:genetic components of individual

INFERRING POPULATION STRUCTURE

RESGEN PROJECT: Towards a strategy for the conservation of the genetic diversity of European cattle

THE DATASET

More that 60 cattle breeds from Europe

5 African bos indicus breeds

20 individuals per breed

30 microsatellites

Structure parameters:

Admixture models

Allele frequencies correlate

No prior information

Germ. Simmental

Simmental

Hinterwaelder

German Yellow

Evolene

Eringer

Piemontese

Grigio Alpina

Rendena

Cabannina

Swiss HF

British HF

Jutland 1950

Dutch Belted

German BP-W

Friesian-Holland

Belgian Blue

Germ. Shorthorn

Maine-Anjou

Normande

Podolica

Romagnola

Chianina

N'Dama

Somba

Lagunaire

Borgou

Zebu Peul

Bretonne BP

Charolais

Ayrshire

Highland

Hereford

Dexter

Aberdeen Angus

Jersey

Guernsey

Betizu A

Betizu B

Pirenaica

Blonde d'Aquitaine

Limousin

Bazadais

Gasconne

Aubrac

Salers

Montbéliard

Pezzata Rossa Ital.

Swiss Brown

Germ. Br. Württemberg

Germ. Br. Bavaria

Germ. Br. Orig

Bruna Pirineds

Menorquina

Mallorquina

Retinta

Morucha

Avilena

Sayaguesa

Alistano

Rubia Gallega

Asturiana Valles

Asturiana Montana

Tudanca

Tora de Lidia

Casta Navarra

Hungarian Grey

Istrian

Swedish Red Polled

Bohemian Red

Polish Red

Red Danish

AngelnMRY

Red HF dual

Red HF dairy

Groningen WH

k=2

- EUR AFR

Podolica

Hungarian Grey

Istrian

Romagnola

N’Dama

Chianina

Zebu Peul

Somba

Lagunaire

Borgou

- k=2
- Europe – Africa

Model-based clusteringEuropean cattle

- Zebu influence in Podolian breeds

Germ. Simmental

Simmental

Hinterwaelder

German Yellow

Evolene

Eringer

Piemontese

Grigio Alpina

Rendena

Cabannina

Swiss HF

British HF

Jutland 1950

Dutch Belted

German BP-W

Friesian-Holland

Belgian Blue

Germ. Shorthorn

Maine-Anjou

Normande

Podolica

Romagnola

Chianina

N'Dama

Somba

Lagunaire

Borgou

Zebu Peul

Bretonne BP

Charolais

Ayrshire

Highland

Hereford

Dexter

Aberdeen Angus

Jersey

Guernsey

Betizu A

Betizu B

Pirenaica

Blonde d'Aquitaine

Limousin

Bazadais

Gasconne

Aubrac

Salers

Montbéliard

Pezzata Rossa Ital.

Swiss Brown

Germ. Br. Württemberg

Germ. Br. Bavaria

Germ. Br. Orig

Bruna Pirineds

Menorquina

Mallorquina

Retinta

Morucha

Avilena

Sayaguesa

Alistano

Rubia Gallega

Asturiana Valles

Asturiana Montana

Tudanca

Tora de Lidia

Casta Navarra

Hungarian Grey

Istrian

Swedish Red Polled

Bohemian Red

Polish Red

Red Danish

AngelnMRY

Red HF dual

Red HF dairy

Groningen WH

k=2

k=5

k=7

k=9

Nordic

LowlandPied

British

FrenchBrown

AlpineSpotted

Iberian

Podolian

AlpineBrown

BalticRed

North-WestIntermediates

AlpineIntermediates

Model-based clusteringEuropean cattle

- 9 homogeneous clusters + 2 intermediate zones.

Courtesy of dr. J. A. Lenstra, dr I. Nijman and Resgen Consortium

INTRABIODIV: Tracking surrogates f. intraspecific biodiversity: towards efficient selection strategies f. the conservation of natural genetic resources using comparative mapping & modelling approaches

- 59 localities
- 177 samples
- ≈80 polymorphic AFLP markers

High diversity

Low diversity

High diversity

Low diversity

- 127 localities
- 381 samples
- 123 polymorphic AFLP markers

High diversity

Low diversity

Courtesy of dr. P.Taberlet and Intrabiodiv Consortium

PERFORM ASSIGNEMENT TEST

Bruna

Grigio Alpina

Rendena

Valdostana Pezzata Rossa

Frisona

CARTINA

Pezzata Rossa It.

Piemontese

Romagnola

Limousine

Cabannina

Marchigiana

Chianina

Calvana

Podolica

Mucca Pisana

Maremmana

- 16 breeds reared in Italy
- 416 individuals
- 3 AFLP primer combinations
132 polymorphisms

- Information on origins

THE REFERENCE DATASET

LMI

MCG

BRU

FRI

MMA

CHI

ROM

MUP

1

0.9

0.8

0.7

0.6

Probabilità

0.5

0.4

0.3

0.2

0.1

0

CAL

GAL

VPR

PRI

PIM

POD

CAB

REN

1

0.9

0.8

0.7

Probabilità

0.6

0.5

0.4

0.3

0.2

0.1

0

Checking the reference dataset

90% threshold

20000 burn-in + 50000 routine MCMC; 8 independent runs

90% threshold

98% of individuals correctly assigned with a p>90% (91% con p>99%)

100% of Romagnola individuals from the genetic center assigned with p>99%

THE BLIND TEST

- 44 Romagnola individuals randomly selected
- 3 AFLP primer combination ; 132 polymorphism
- No prior information

Assignement probability to the different breeds of the reference dataset

ROMAGNOLA

REN

CAB

POD

PIM

PRI

VPR

GAL

CAL

MUP

CHI

MMA

FRI

36 Romagnola cattle assigned with p>99%

BRU

MCG

LIM

4 Romagnola cattle assigned with 90%>p>99%

4 Romagnola cattle not assigned

THE RESULTS

for who are very interested

- Yang BZ, Zhao H, Kranzler HR, Gelernter J. Practical population group assignment with selected informative markers: characteristics and properties of Bayesian clustering via STRUCTURE. Genet Epidemiol. 2005 May;28(4):302-12.
- Sullivan PF, Walsh D, O'Neill FA, Kendler KS. Evaluation of genetic substructure in the Irish Study of High-Density Schizophrenia Families. Psychiatr Genet. 2004 Dec;14(4):187-9.
- Lucchini V, Galov A, Randi E. Evidence of genetic distinction and long-term population decline in wolves (Canis lupus) in the Italian Apennines. Mol Ecol. 2004 Mar;13(3):523-36
- Peever TL, Salimath SS, Su G, Kaiser WJ, Muehlbauer FJ. Historical and contemporary multilocus population structure of Ascochyta rabiei (teleomorph: Didymella rabiei) in the Pacific Northwest of the United States. Mol Ecol. 2004 Feb;13(2):291-309.
- Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003 Aug;164(4):1567-87.
- Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB. Human population genetic structure and inference of group membership. Am J Hum Genet. 2003 Mar;72(3):578-89. Epub 2003 Jan 28.
- Koskinen MT. Individual assignment using microsatellite DNA reveals unambiguous breed identification in the domestic dog. Anim Genet. 2003 Aug;34(4):297-301.
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science. 2002 Dec 20;298(5602):2381-5.
- Rosenberg NA, Burke T, Elo K, Feldman MW, Freidlin PJ, Groenen MA, Hillel J, Maki-Tanila A, Tixier-Boichard M, Vignal A, Wimmers K, Weigend S. Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics. 2001 Oct;159(2):699-713
- Randi E, Pierpaoli M, Beaumont M, Ragni B, Sforzi A. Genetic identification of wild and domestic cats (Felis silvestris) and their hybrids using Bayesian clustering methods. Mol Biol Evol. 2001 Sep;18(9):1679-93