5 Open Problems in Bioinformatics
Download
1 / 24

5 Open Problems in Bioinformatics - PowerPoint PPT Presentation


  • 273 Views
  • Updated On :

5 Open Problems in Bioinformatics. Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein Structure Evolution. From genomes to pedigrees. Coalescent Rebombination process. Seqeunce/Individual Boundary. Pedigree process.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '5 Open Problems in Bioinformatics' - niveditha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

5 Open Problems in Bioinformatics

  • Pedigrees from Genomes

  • Comparative Genomics of Alternative Splicing

  • Viral Annotation

  • Evolving Turing Patterns

  • Protein Structure Evolution


Slide2 l.jpg

From genomes to pedigrees

Coalescent Rebombination process

Seqeunce/Individual

Boundary

Pedigree process

Three Processes

  • Recombination

  • Choosing Parents

  • The Mutational Process

From Yun Song


Slide3 l.jpg

Probability of Data given a pedigree.

Elston-Stewart (1971) -Temporal Peeling Algorithm:

Father

Mother

Condition on parental states

Recombination and mutation are Markovian

Lander-Green (1987) - Genotype Scanning Algorithm:

Father

Mother

Condition on paternal/maternal inheritance

Recombination and mutation are Markovian

Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm


Slide4 l.jpg

Benevolent Mutation and Recombination Process

Genomes with r and m/r --> infinity

r - recombination rate, m - mutation rate

  • Counting within a small interval would reveal the length of the path connecting the two segments.

  • Siblings are readily revealed, since they will have segments with 2m density of mutations

  • The distribution of path lengths are readily observable between two sequences

  • All embedded phylogenies are observable


Slide5 l.jpg

1

2

1

2

1

2

1

2

Pedigree 1

Pedigree 2

1

2

1

2

1

2

1

2

From Phylogenies to Pedigrees

Mike’s counter example, linkage and individuals

Gluing Phylogenies together

Sibling Sequences come from different parents.

Different Pedigrees

Same Phylogenies

Individual 1

?

A recombinants’ parent are sister sequences.

grandparents

Individual 2



Slide7 l.jpg

From Transcripts to the AS-Graph

S

E

E

S

  • How well known is the AS-graph as a function number of transcripts?

  • A family and distribution of transcripts, can they be explained an AS-graph with probabilities at donor sites or do we need probabilities for (donor,acceptor) pairs? Or possibly even more complicated situations. And is sampling transcripts good enough to distinguish these situations.


Slide8 l.jpg

Mini-project: reliability of AS-detection.

  • Choose Idealized AS-Graph:

  • Genome

  • Choose donor and acceptor sites in random pairs.

  • For each possible splice pair assign probability for choosing it.

  • This should define a probability for all transcripts.

  • Generate a set of transcripts.

  • Reconstruct AS-Graph.

  • Key questions:

  • How many transcripts must be sampled to detect AS.

  • How well will the AS-Graph be recovered?


Slide9 l.jpg

Optimal DAG (directed acyclic graph) under restrictions

Optimal Paths:

Sub-optimal Paths:

  • Finding a set of annotations:

  • Find set of paths, maximizing sum of scores.

  • The score of minimal path must be above threshold.

  • Two paths must differ significantly: An enclosed area, the maximal height must be d higher than the boundary defining it. Height(i,j) = di,j + di,j

  • Does known AS genes have more CTO structure than non-AS genes?

  • Do the AS correspond to the CTO structure

  • Is the CTO structure evolutionary conserved?


Slide10 l.jpg

Phylogenetically related ASGs

E

E

E

S

S

S

S

S

S

E

E

E

  • Is ASG conserved?

  • What is conserved?

  • How is selection along position dependent on splicing status?


Slide11 l.jpg

Virus Annotation

Classes of Gene Structures

http://www.tulane.edu/~dmsander/WWW/335/Diarrhoea.html

Diarrhoea Causing Arrangements

Illustrating the 3 main classes of gene structures: Unidirectional, Convergent and Divergent.

http://www.tulane.edu/~dmsander/WWW/335/Retroviruses.html

http://www.tulane.edu/~dmsander/WWW/335/Papovaviruses.html

Retroviridae Arrangements

Papoviridae Arrangement


Slide12 l.jpg

The Problems of Viral Annotation

  • HMM gene structure generator (McCauley)

  • Gene Structure Evolution (de Groot)

  • Alignment (Caldeira, Lunter, Rocco)

  • Recombination (Lyngsø, Song)

  • Multiple constraints: RNA secondary structure, gene conservation, binding/transcriptional instructional sites.


Slide13 l.jpg

HMM States

Non-coding

Coding RF1

Coding RF2

Coding RF3

Coding RF1,2

Coding RF1,3

Coding RF2,3

Coding RF1,2,3

Our 8 State HMM which allows for Unidirectional overlapping gene structures


Slide14 l.jpg

Combining Levels of Selection.

Assume multiplicativity: fA,B = fA*fB

Protein-Protein

Hein & Støvlbæk, 1995

Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Protein-RNA

Singlet

Doublets

Contagious Dependence


Slide15 l.jpg

Table illustrating the performance benefit in Sensitivity we obtain utilizing a Phylogenetic HMM. We extend the HMM model to include evolutionary information from 13 aligned HIV2 sequences.


Slide16 l.jpg

GenBank: Centralized resource for publicly available viral sequence data.

Entrez Genomes currently contains 2120 Reference Sequences for 1510 viral genomes and 36 Reference Sequences for viroids.

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html

Properties of overlapping genes are conserved across microbial genomes.Genome Res. 2004 Nov;14(11):2268-72.

Within microbial genomes, one third of annotated genes contain some degree of overlap, and one third of these are either Convergent or Divergent.

Krakauer, D.C. Stability and evolution of overlapping genes.

Evolution 54: 731-739 (2000) Genome Res. 2004 Nov;14(11):2268-72.

General preponderance of overlapping gene structures is roughly a 90:9:1 ratio split across Unidirectional, Convergent and Divergent arrangements.


Slide17 l.jpg

Turing Patterns sequence data.


Slide18 l.jpg

Mathematical models to understand biological patterns sequence data.

From Maini’s Home Page: http://www.maths.ox.ac.uk/~maini

Turing Model


Slide19 l.jpg

Different parameters lead to different patterns sequence data.

Stripes: p small

Spots: p large

[From: Leppanen et al. Dimensionality effects in Turing pattern formation, Int. J. Mod. Phys. B 17, 5541-5553 (2003)]


Slide20 l.jpg

3 suggestions sequence data.

Networks and Turing Patterns

2. Stochastic Partial Differential Equations

3. Phylogenetically related Turing Patterns


Slide21 l.jpg

Evolutionary Models of Protein Structure Evolution sequence data.

?

?

?

?

Known

Unknown

Known

300 amino acid changes

800 nucleotide changes

1 structural change

1.4 Gyr

a-globin

Myoglobin

1. Given Structure what are the possible events that could happen?

2. What are their probabilities? Old fashioned substitution + indel process with bias.

Bias: Folding(SequenceStructure) & Fitness of Structure

3. Summation over all paths.


Slide22 l.jpg

2 suggestions sequence data.

A. Structure (Homology Modelling, Topology)

Folding(SequenceStructure)

As a first approximation similar structures should be compared and the problem could be solved by comparative modelling.

Fast Homology Modelling

Using Protein Topology as Hidden Variable

Fitness of Structure – such functions are common place in guiding prediction programs.

B. MCMC


Slide23 l.jpg

Questions to be asked sequence data.

Negative Note:

Protein Structure Analysis is much harder than Sequence Analysis. Much of the first hand impression will remain: “Structures are either trivially similar or highly dissimilar” – the middle ground is empty.

At Gyr scale other rearrangements occur.

Positive Note: If it works

Test of smooth/catastrophic structure evolution

Separation of analogous/homologous similarities

Protein Evolution in General

How closely linked are homologous and structurally equivalent sites?

http://www.biochem.ucl.ac.uk/bsm/cath/

http://scop.mrc-lmb.cam.ac.uk/scop/


Slide24 l.jpg

Summary sequence data.

Pedigrees from Genomes

Does infinite genomes determine pedigrees?

How many pedigrees are there?

Comparative Genomics of Alternative Splicing

How well do you know the ASG?

How do you measure selection on the ASG?

Viral Annotation

How well can you annotate viruses from observed evolution?

Evolving Turing Patterns

Turing Patterns and Networks

Stochastic Turing Patterns

Phylogenetically Related Turing Patterns

Protein Structure Evolution

Full Model of Structure Evolution

Model of Protein Topology Evolution