The Barcode of Life
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

The Barcode of Life Integrating machine learning techniques for species prediction and discovery www.barcodinglife.com PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

The Barcode of Life Integrating machine learning techniques for species prediction and discovery www.barcodinglife.com. Welcome to the Meeting Barcode of Life -- Great opportunity to contribute to a fast growing area of research. Some questions to keep in mind for the afternoon….

Download Presentation

The Barcode of Life Integrating machine learning techniques for species prediction and discovery www.barcodinglife.com

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

The Barcode of LifeIntegrating machine learning techniques for species prediction and discoverywww.barcodinglife.com


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Welcome to the Meeting

  • Barcode of Life --

  • Great opportunity to contribute to a fast growing area of research.

  • Some questions to keep in mind for the afternoon…..


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Open Questions

  • Species discovery vs. prediction

  • Data structure, missing data, sample sizes

  • New visualization tools for both discovery and prediction

  • Confidence measures for species discovery and individual specimen assignments – controlling number of false discoveries - power of detection


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Barcoding Data – A first look at the data.Dimacs BOL Data Analysis Working Group meetingSeptember 26 2005

Rebecka Jornsten

Department of Statistics, Rutgers University

http://www.stat.rutgers.edu/~rebecka/DIMACSBOL/DimacsMeetingDATA/

Thanks to Kerri-Ann Norton


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Outline

  • Data Structure and Data Retrieval

  • Sequencing and Base Calling

  • Distance metrics – Sequence information

  • Clustering

  • Classification

  • Open Questions – Discussion

  • Questions to think about are highlighted in red.


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

What do the data look like?


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

What do the data look like?


What do the data look like

What do the data look like?


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Sequencing


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Sequencing

  • Peak finding

  • Deconvolution

  • Denoising

  • Normalization

  • Base calling

  • Quality assessment

  • (ABI base caller, Phred)


Sample data

www.barcodinglife.com

Leptasterias data – six-rayed sea stars

Astraptes data - moths

Collembola data - springtails

Sample Data


Sample data1

Leptasterias data – six-rayed sea stars

5 species,21 specimens

Sample sizes 3-7

Sequence length 1644

Astraptes data – moths

12 species, 451 specimens

Sample sizes 3-96, 8 with more than 20

Sequence length 594

Collembola data – springtails

18 species, 54 specimens

Sample sizes 1-5

Sequence length 635

Sample Data


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Sequence Information

  • We can compute the information content for each nucleotide.

  • Is there a lot of variability between species at locus j?

  • Is there lot of variability within a species at locus j?

  • Are the same loci discriminating between multiple species?


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Astraptes:

Within-species entropy for the 9 species with 20+ specimens

“pure”


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Mutual information of each locus for the 9 species


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Pair wise (Mutual Information)

10 vs. 11

10 vs. 12

10 vs. 12

2 vs. 10

2 vs. 12


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Distance Metric

  • To group the specimens in an unsupervised fashion we need to come up with a distance metric.

  • Without prior information of which loci are informative, we compute distances using the entire sequence (for Astraptes 594 bases)

  • The 0-1 distance metric is the most commonly used

  • However, some bases are ‘uncalled’ – usually denoted by letter other than a,c,g,t

  • How should we take this into account?


H clustering astraptes

H. Clustering: Astraptes


Pam clustering astraptes

PAM Clustering: Astraptes


H clustering collembola

H. Clustering:Collembola


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Another example: Leptasterias – 5 species, 21 specimens

All groups

Groups 3 vs. 4


H clustering leptasterias

H. Clustering: Leptasterias

Group 1

Group 2

Problem….


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

PAM Clustering: LeptasteriasSelecting the number of clusters via silhouette width, CV etc leads to the combining of species 3 and 4 – these data does not support a separate species.


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Classification

  • A classifier that in principle closely resembles the hierarchical clustering approach is kNN

  • Leave-one-out Cross-Validation:

  • On the Leptasterias data 1-2 specimens are misallocated with this classifier.

  • Both of these specimens are in group 4 (and mislabeled as 3).

  • Via cross validation we see that one observation is only labeled as 4 if it’s in the training set, o/w 3.

  • The other mislabeled observation fails in 15 out of 20 training scenarios. Both these specimens may have been mislabeled?


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Classification

  • A simple alternative is to use a centroid-based classifiers

  • Assign new specimens to the species with respect to which the specimen is closest to the species consensus sequence.

  • We can match specimens to a consensus sequence based on the 0-1 distance, or

  • use the position weights of each letter base in the consensus sequence.


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Classification

  • On the Leptasterias data, the consensus sequence (CS) based classifier makes 2 errors (LOO CV)

  • “Vote of confidence”=weighted 0/1 distance to CS


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Classification

  • Relative voting (RV) strength illustrates that species 3 and 4 are difficult to separate, and the misallocated specimens are associated with low relative votes

  • RV =max(weighted similarity)-(max-1)(weighted similarity)

  • (max-1)(weighted similarity)


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

Base calling is not perfect – errors are made and there are programs (e.g. phred) that can analyze the ABI traces and assign confidence measures to each base.

An interesting question is – can we obtain similar error rates for species prediction and discovery with smaller sample sizes if quality measures are incorporated into the analysis?


The barcode of life integrating machine learning techniques for species prediction and discovery barcodinglife

  • Before the Discussion Session

  • Try out some clustering techniques on the sample data

  • Number of uncalled bases?

  • Length of sequences?

  • Sample sizes – effect on clustering?

  • Sample sizes – effect on classification?


  • Login