BioInformatics Consultation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 12

BioInformatics Consultation Practice 9 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22 PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

BioInformatics Consultation Practice 9 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler @ t-online.hu. Content of the Practice. Genome Browsers Basic terms GUI

Download Presentation

BioInformatics Consultation Practice 9 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bioinformatics consultation practice 9 g bor pauler ph d tax reg no 63673852 3 22

BioInformatics Consultation

Practice 9

Gábor Pauler, Ph.D.

Tax.reg.no: 63673852-3-22

Bank account: 50400113-11065546

Location: 1st Széchenyi str., 7666 Pogány, Hungary

Tel: +36-309-015-488

E-mail: [email protected]


Content of the practice

Content of the Practice

  • Genome Browsers

    • Basic terms

      • GUI

      • Database engine: On-Line Analitical Processing

    • Methodology: On-Line Analitical Processing

      • Relational Database System

        • The Star Schema

      • Data Cubes

        • Dimensions

        • Hierarchies: Aggregate/Drill

        • Measures

        • Formulas

    • Software: NCBI Genome

  • References


Genome browsers basic terms 1

Genome Browsers: Basic terms 1

  • Genome Browsers (Genom Böngésző)are integrated software tools containing:

    • The visible part: Graphical User Interface (GUI) for:

      • Display entire genomes with annotated data in graphic format organized by genomic nucleotide position coordinate axis:

        • Gene prediction and structure,

        • Proteins, Domains, Motives

        • Expressionregulation,

        • Alternative Splicing variations

      • Display comparison of multiple genomes and facilitate Comparative genomic Analysis (Összehasonlító Genom Elemzés)


Genome browsers basic terms 2

Genome Browsers: Basic terms 2

  • Hidden inside the software: Genomic browsers (should) contain a Database Engine (Adatbázis Motor) enabled for On-Line Analytical Processing (OLAP) (On-Line Analitikus Feldolgozás):

    • Stores genomic data and annotations from multiple (even partially conflicting) sources in a Data Warehouse (Adattárház): Relational Database (Relációs Adatbázis) with standardized design

    • Can generate multi-dimensional (Többdimenziós) reports

    • On an easy to use Graphic User Interface (GUI), where modifying breakup levels and aggregation of genomic data can be done for mouseclick, without any manual coding

  • Why is it important (besides it looks nice for IT people) for a biologist?

    • Even if genomes are large data sets stored as 1-dimensional data (nucleotide sequences in 5’-3’ or 3’-5’ DNA strands)

    • But their logical structure is multi-dimensional (Több dimenziós) data set packed in 1 dimensional storage, where each dimension is hierarchic (Dimenzionális hierarchia) with multiple breakup levels (Dimenzió Szint):

      • Philogenic groups: Kingdoms > Families > Clusters > Species

      • Physical layout: Genome > Chromosome > DNA Strand > Gene

      • Gene structure: Gene > Expression factors > Introns/Exons

      • Proteomics: Protein > Domain > Motives

      • Sequencing+Assembly: Sample > Dictionary > Contig > Fragment > EST

    • A biotech researcher deals with Ill-structured Problems

      (Rosszul Strutúrált Probléma) in research: algorithm and

      Input/Output data structures of the problem is not clearly

      defined

    • Therefore he/she may require viewing genomic data agg-

      regated (Aggregálva) in any possible combination of

      breakup levels of different dimensions very quickly (eg. in

      what final protein products can be formed from alternative

      splicing of gene x, what are the all possible alternative

      spliced cDNAs from gene x in all paralog sequences, etc.)

    • The point is that the researcher usually cannot tell predefi-

      ned „standard breakup level combinations”, they depend

      on iterative research process


Genome browsers analogy with dirty business managerial reporting

Genome Browsers: Analogy with dirty business: Managerial reporting

  • This problem has interesting analogy with a very far area of science: Top managerial reporting of corporate business data:

  • The first Corporate Data Management applications started to work at large multinational companies at 1970s:

    • They were incredibly expensive (both software and mainframe fardware of the day)

    • They enabled to service customers faster and more reliably at operative level

    • But they were incredibly useless at strategic level: they flooded top managers with tons of predefined standardized reports every week, what they instantly threw at the garbage can, and complained „How expensive it was, and it cannot give even the simplest info what I need in time!”.

  • The reason was that early relational database systems could do only static structured reporting which conflicted the ill-structured nature of top managerial decisions:

    • A top manager needs only 5-6 numbers every week, but several million dollars can live or die on that numbers

    • Therefore: the deadline is yesterday – he needs them damn quickly!

    • He cannot tell even in the previous week what he will need on next week – it depends on!

    • He has no time to read lenghty standard reports containing all possible breakup of the data: an average top manager can read 4 pages/day (if he can read at all…), all other time is invested in lobbying for the company through social relations: tennis or dinner with politicians of governing party, group orgies with representatives of opposition party (just in any case…)

  • Introduction of OLAP dynamic structured reporting tools solved this problem at the beginning of 1990s, and probably will be heavily involved in genome browsers in the future. Therefore, let us see its theoretic basics:


Basics of olap relational database management rdbm 1

Basics of OLAP: Relational Database Management (RDBM) 1

  • All genomic browsers translate and store annotations of genome data in Relational Data-base (Relációs Adatbázis) instead of simply using orginal FASTA or EBI records. Why?

    • Large data sets can be searched fast if they are stored on hard drive in fixed record lenght data tables: we can compute start of nth record instead of read through all data

    • Data structure of genomes called Empirical Data Structure (EDS) is non fixed lenght by definition:1 chromosome can contain many genes,1 gene can contain many exons,etc.

  • RDBM resolves this conflict decomposing (Szétbont) EDS into Entities (Egyedek): an object which can have numerous occourences (Előfordulás) described with the very same attribu-tes(Tulajdonság) can be stored in fixed lenght space: eg.Exon: Start/EndPos, SpliceNum

  • Decomposition is made by Cardinality Analysis, CA(Számosság elemzés): it examines how much occourences of one entity related to other entity examined between them 2 directions:

    • 1Genecan containmanyExons, but1Exonbelongs to1Gene =1:many relation of 2 entities (we denote Entity, Attributes, their relation, its cardinality with color codes)

    • 1Genecan codemanyProteins, and1Proteincan be coded bymanyGenes=many:many relation of 2 separate entities

    • 1Exonhas only1StartPos, 1EndPos, 1SpliceSequenceNumber=1:1 relation, these are attributes of the very same entity


Basics of olap relational database management rdbm 2

Basics of OLAP: Relational Database Management (RDBM) 2

  • To preserve original data of EDS, decomposed Entities should be connected by relations (Relációk): referential connection with cardinality 1:many among the following attributes:

    • Primary key (Elsődleges kulcs) attribute: uniquely identifies occourences of an entity, therefore it will be the 1 side of the relation. It is denoted with orange eg. GeneID

    • Foreign key (Idegen kulcs) attribute: reference to primary key of another entity with the same name and type, it will be the many side of the relation, denoted olive eg. GeneID

    • For example:1Genecan containmanyExons, but1Exonbelongs to1Gene  allways the many side (Exon) references to uniquely identified 1 side (Gene):

    • Many:Many relations are assembled from two 1:many relations and a Relation entity: 1Gene can CodemanyProteins, and1Protein can be Coded by manyGenes


Basics of olap relational database management rdbm 3

Basics of OLAP: Relational Database Management (RDBM) 3

Cluster

ClusterID

ClusterName

FamilyID

EntityName

EntityNameID

Text

Integer

Fraction

Binary

Date

Time

Image

Sound

Movie

ReqForeignKey

OptForeignKey

Modifier

Modified

Status

MasterEntity

MasterID

MasterName

Sample

SampleID

SampleName

Dictionary

DictionaryID

Restrictase

SampleID

Contig

ContigID

ContigName

DictionaryID

Fragment

FragmentID

FragmentName

ContigID

Species

SpeciesID

SpeciesName

ClusterID

Protein

ProteinID

ProteinName

Sequence

SequenceID

SeqenString

StartPos

EndPos

Strand

Date

ESTID

MotiveID

ExonID

GeneID

SpeciesID

Motive

MotiveID

MotiveName

DomainID

Genome

GenomeID

GenomeName

Chromosome

ChromosomeID

ChromosomeName

GenomeID

Strand

StrandID

Direction

ChromosomeID

Exon

ExonID

ExonName

GeneID

Gene

GeneID

GeneName

StrandID

Kingdom

KingdomID

KingdomName

Family

FamilyID

FamilyName

KingdomID

EST

ESTID

ESTName

FragmentID

Domain

DomainID

DomainName

ProteinID

  • One can see that it is hard to overwiev relations of a difficult database with dozens of entities from little sample tables. Therefore relational database design is represented at Entity Relationship (Egyedkapcsolati) Diagram, ERD:

  • Entites are rounded corner boxes with EntityName at the top. Blue background denotes codetable/master entities with minimal data change in time, yellow denotes relational/transaction entities: rapid, irrevocable data changes in time

  • Attributes are listed with their data type icons:( , , , , , , , , ) and names: italic means optional-, normal means required-, bold means auto-filled attribute

  • Data attributes are purple, primary keys are orange prompted by ( ), foreign keys are olive prompted by( ), auto-filled system logging attributes are black

  • 1:many relations are denoted by ( ) connecting primary- and foreign keys

    OLAP systems can work only with database design called Star(Csillag)Schema:

  • In the „center”, there are transaction entity observed sequences:

  • In the „arms” there are master data entities of dimension levels:

Dimension:Sequencing+Assembly

Dimension:Philogeny

Dimension:Gene structure

Dimension:Proteomics


Content of the practice1

Content of the Practice

  • Genome Browsers

    • Basic terms

      • GUI

      • Database engine: On-Line Analitical Processing

    • Methodology: On-Line Analitical Processing

      • Relational Database System

        • The Star Schema

      • Data Cubes

        • Dimensions

        • Hierarchies: Aggregate/Drill

        • Measures

        • Formulas

    • Software: NCBI Genome

  • References


Olap terminology

OLAP: Terminology

Seqn.

Count:

0

Seqn.

Count:

4

Protein2

Protein1

Seqn.

Count:

7

Seqn.

Count:

0

Gene1

Seqn.

Count:

3

Seqn.

Count:

1

Seqn.

Count:

2

Splice2

Gene2

Splice1

  • OLAP systems import data from star schema based relational databases, but use slightly modified basic terms and more advanced – but far higher computing resource consumption - tools for data storage:

  • Dimension (Dienzió): variable by which data can be grouped: eg. GeneStructure

  • Level (Szint):hierarchic internal structure of a dimension based on chain of 1:m relationships: eg. Genome > Choromosome > Strand > Gene > Exon

  • Position (Pozíció): possible values of a dimension level: eg. Strand:(5’-3’,3’-5’)

  • Data Cube (Adatkocka): multi dimensional data storage and aggregation object consisting:

  • Cells (Cella): Data storage formed by Cartesian product of positions of selected levels of dimensions: eg. Protein(P001, P002) × Gene(G001, G002) = (P001, G001), (P001, G002), (P002, G001), (P002, G002) They store:

  • Measures (Mérték): Data aggregated or computed from transaction data records: eg. Count of sequences

  • View (Nézet): As data cubes are multi-dimensional objects, they cannot be fully represented on a 2D display (screen or printout). View is a dynamically selected part defined by user with:

    • Row Dimension|Level

    • Column Dimension|Level

    • For any more dimensions:Page filter positions

    • Cell data content measures:

      • Aggregated from original data (Count, Sum, Avg, Min, Max)

      • Calculated by mathematical formula (Ln, Sin, Cos, Exp,..)


Software ncbi genome

Software: NCBI Genome

  • http://www.ncbi.nlm.nih.gov/sites/genome


References

References

  • Theory of Genome Projects and Annotation:

    • http://www.plosone.org/article/info:doi/10.1371/journal.pone.0006291

    • http://en.wikipedia.org/wiki/Genome_project

    • http://www.arabidopsis.org/portals/genAnnotation/genome_annotation_tools/index.jsp

    • http://vega.sanger.ac.uk/index.html

    • http://www.ensembl.org/index.html

  • Genom Browser Software:

    • http://www.ncbi.nlm.nih.gov/sites/genome

    • http://genome.ucsc.edu/

    • http://genome.ucsc.edu/cgi-bin/hgGateway

    • http://www.bioviz.org/igb/

    • http://genoviz.sourceforge.net/

    • http://www.affymetrix.com/partners_programs/programs/developer/tools/download_igb.affx

    • http://genoviz.sourceforge.net/

    • http://apollo.berkeleybop.org/current/index.html


  • Login