Genomic Database

Genomic Database By Xuehua Liu Simran Chhugani

Topics covered • What’s Genomic Database & Why it is important • Database Examples and General Characteristics • Data Models: RDBMS, OODBMS, ORDBMS • Algorithm Study for Sequence Analysis • Programming Languages • Deep research on popular similarity searching tools: FASTA, BLAST • Problem Existed and Future Development direction • Summary

What is a Genomic Database • Genomic Database is to organize all of the information on an organism into a computer database. • brought about by the Human Genome Project (HGP) • The subject of genomic databasing goes beyond the databasing of a single species. • It is a logically derived single database from the available data sets • Is in the field of bioinformatics and computational biology.

Why it is an important topic • It works with more than one database—integration is important. • It has a capability beyond the relational database searching. • Popular searching tool BLASTA is first developed for it. • Others like: • Genomic database is being used by researchers for drug design, medical care, phylogenetic analysis, evolutionary analysis, personalized medicine, and many other applications. • The need to rapidly analyze large-scale genomic sequences, keeping up with the flood of data from HGP. • It is an important topic you should know for your future. News On Post Bulletin

Examples and the General Characteristics--Examples • GDB: The Genome Database is a unique public repository of confirmed information about human genes—human genomic mapping data and supporting information, created by HGP. • Its goal is to elucidate the information that makes up the genetic blueprint of human beings. • Comprises descriptions of three types of objects from humans: Genomic Segments, maps, and Variations. • started at Johns Hopkins University in 1990, since 1998 the Hospital for Sick Children in Toronto, Canada, has maintained it, and later it has found a new home at RTI International. • NCBI: The National Center for Biotechnology Information • Its organization behind is a US Government agency • NCBI uses four core data elements: DNA sequences, protein sequences, three-dimensional structures, and bibliographic citations. • Facilitate not only data retrieval but also discovery, and it’s stable. • Has developed many useful resources and tools: • Tools: Genomes Division of Entrez; the NCBI Human Genome map Viewer, e-PCR, BLAST, the GeneMap ’99, • Databases: the LocusLink, OMIM, dbSTS, dbSNP, dbEST, and UniGene databases.

Examples and the General Characteristics—Examples (cont’) • MGI: The Mouse Genome Initiative Database is the primary public mouse genomic catalogue resource. • Encompassing three databases: MGD, GXD, and MGS. • MGD: The Mouse Genome Database is one of the most comprehensively curated genetic databases. • Other Genomic Databases: • http://www.ebi.ac.uk/2can/databases/genomic.html • http://biobase.dk/Embnetut/Gdb5omim/g5opt7.html

Examples and the General Characteristics--Characteristics • Genomic database contains complete genome sequences. They are long. • More types of data included: • information on DNA and protein sequences • Genetic, physical, and cytological maps • Genes and their mutant phenotypes • Protein and RNA products and their properties • Biochemical or developmental pathways • Regulatory circuits • Literature references • And many other types of data, as needed to describe the organism fully

Examples and the General Characteristics— Characteristics (cont’) Example: GOBASE integrates diverse data types.

Examples and the General Characteristics— Characteristics (cont’) • Data formats in all interested species databases used by Genomic Database should be standard • So that the data are readily retrievable by database programs, and interspecies exchanges can be facilitated. • What it should be able to be done with Genomic Database : • Makes relational types of database searches, consisting of retrieving complete data entries. • Goes beyond relational database searching to search for matches within a data entry • Supports different types of interactive displays. • All the information should be appropriately cross-referenced so that the sets of data behave effectively as a single database.

Data Models --RDBMS • RDBMS • Advantages: • This model is well supported and readily available. • Best meet the immediate need to enter data concurrently and to maintain the data in a consistent format. • Permits use of existing databases and provides reliable support for large dtabases. • Disadvantages: • Not very satisfactory for support of the functionality required for a genomic database, like supporting map type. • Supports value rather than identity semantics, so the interconnections between arbitrary groups of items are not possible. • Difficult to achieve a fast, interactive response, in particular to browse the interconnections across various data sets. • RDBMS's usually store complex data as uninterpreted BLOBs (Binary Large Objects)

Data Models --OODBMS • OODBMS • Advantages: • Every data item is represented as a typed “objects” and different operations can be implemented for each object type. • Related objects can be easily interconnected. • So, this model can provide a powerful system for data manipulation and cross-reference. • Disadvantages: • Many ODBMS systems seem to lack very important high level features, including scalability, security, server-side functions and concurrency. • Suffered from the lack of a common query language.

Data Models -- ORDBMS • ORDBMS: The Next Wave • They are relational in nature because they support SQL; they are object-oriented in nature because they support complex data. • In essence they are a marriage of the SQL from the relational world and the modeling primitives from the object world. • Vendors of Object-Relational DBMSs include Illustra, UNISQL, Omniscience, and Hewlett-Packard (with their Odaptor for Allbase and more recently for Oracle). • Advantages: • The whole goal is to allow users to ask more complex queries on more complex data than could be done with previous DBMS's. • It combines the best of both worlds.

Data Models -- Summary • ORDBMS is an application that is query-oriented and requires complex data.

Algorithm Study for Sequence Analysis • Parallel algorithms for sequence analysis. • The recognition of important features in a sequence, such as genes, must be highly automated to eliminate the need for time-consuming manual gene model building. • Five distinct types of algorithms must be combined into a coordinated toolkit to synthesize the complete analysis: • Pattern recognition, statistical measurement, sequence comparison, gene modeling, and data mining. • Similarity Sequence Searching using sequence comparison algorithm • High-speed sequence comparison is used to compare one DNA or protein sequence with another in a way that extracts how and where the two sequences are similar. • Many organisms share many of the same basic genes and proteins, and information about a gene or protein in one organism provides insight into the function of its “relatives” or “homologs” in other organisms. • Experiments in simpler organisms often provide insight into the importance of a gene in humans, so sequence comparison is a very important tool.

Algorithm Study for Sequence Analysis (cont’) • Popular Similarity Searching Tools: • FASTA: The first widely-used program for database similarity searching. • To achieve a high degree of sensitivity, this program performs optimized searches for local alignments using a substitution matrix. • The FASTA program does not investigate every word hit encountered, but instead looks initially for segments containing several nearby hits. • BLAST: Introduced a number of refinements to database searching that improved overall search speed and put database searching on a firm statistical foundation. • There are several variants of BLAST, all of which make sure of sequence databases located on server machines, which obviates the need for any local database maintenance.

Programming Languages • Bio*-Programming tools: http://www.bioinformatics.ca/links_directory/?subcategory_id=46 • BioJava: The BioJava Project is an open-source project dedicated to providing Java tools for processing biological data. • This will include objects for manipulating sequences, file parsers, CORBA interoperability, DAS, access to ACeDB, dynamic programming, and simple statistical routines to name just a few things. • BioPerl: The BioPerl Project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research. • Bioperl is a collection of more than 500 Perl modules for bioinformatics. • BioPython: The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

Programming Languages (cont’) • Application Languages: • Perl: Among the many computer languages available to biologists, Perl--with its highly developed capacities in string handling, text processing, networking, and rapid prototyping--has emerged as the programming language of choice for biological data analysis. • Python, an interpreted, interactive, object-oriented programming language used widely in Bioinformatics.

Programming Languages (cont’)

Similarity Sequence Searching

The end Any Questions? Thank you!

References • Baxevanis, Andreas D. Ouellette, B.F. Francis. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Second Edition. Wiley-interscience. • Smith Douglas W. Biocomputing: Informatics and Genome Projects. Academic Press. • Tisdall, James D. Master Perl for Bioinformatics. O’Reilly. • Microbial Genome Program. By Office of Science Office of biological and Environmental Research. • www.gdb.org • http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml • http://s2k-ftp.cs.berkeley.edu:8000/postgres/papers/Informix/www.informix.com/informix/corpinfo/zines/whitpprs/illuswp/wave.htm • http://genome-www.stanford.edu/ • http://www.bioinformatics.ca/links_directory/?category_id=4 • http://www.well.ox.ac.uk/~johnb/genomic.html#databases • http://weedsworld.arabidopsis.org.uk/Vol3ii/Cherry-Flanders-Petel.WW.html • http://megasun.bch.umontreal.ca/gobase/gobase2.html • http://www.ebi.ac.uk/2can/databases/index.html • http://www.msi.umn.edu/cgl/software/ • http://db.cs.berkeley.edu/papers/ERL/erl00.html • http://www.ncbi.nlm.nih.gov/BLAST/

Similarity Sequences searching • Comparison of two sequences in biological data • Similarity sequence searching is the solution to this problem • It discards most irrelevant sequences and perform the exact algorithm on the small group of remaining sequences.

How does similarity sequence work • Set of algorithms like FASTA and BLAST programs are used to compare the query sequence to all sequence in the specified database • Comparisons are made in pair wise • Each comparison is given a score reflecting the degree of similarity between the query and the sequence compared • Similarity is shown by aligning the two sequences • There are two types of alignments global and local

FASTA • Developed by Lipman and Pearsonin 1988 . It has following stages: • Stage I: • The algorithm looks up for ktup(short for k respective tuples). The standard ktup 6 for DNA and 2 for protein sequence matching. The matching ktup length substrings are referred to as hotspots. All the ktup substrings are stored in lookup table or hash. • A positive score is given for each hotspot and a negative score is given for the space between the consecutive hot spots. • The single best sub alignment found is called init1where indel are not allowed as it derived from single diagonal, and low scoring sub alignments are discarded.

FASTA (Cont ‘) • Stage II • Now the diagonal run allows the indel and then subalignment retrieved is combined with the subalignment derived from the previous run. This way the subalignments whose scoring is above some cutoff level are combined together into a single high scoring alignment where spaces are allowed this alignment is called initn. • Stage III • In the last stage the database sequence are ranked according to initn scores and the full dynamic programming algorithm is used to align the query sequence against each of the highest ranking result sequences.

BLAST (Basic Local Alignment Search Tool ) • BLAST introduced in 1990 is a set of program which uses an algorithm developed by (NCBI) that seeks out local alignment (alignment of some portion of two sequences) as opposed to global alignment (alignment of two sequences over their entire length). By searching for local alignment, BLAST can identify regions of similarity in two sequences. • BLAST algorithm works as follows : • BLAST searches for Ktup words for protein 3 letter words, and for DNA 11 letter words. In BLAST the words can be similar, not necessarily identicalExample:. • Blast finds the matching pairs and then extend the word pairs as much as possible ie, as long as the total weight increases. The result is the high scoring segment pairs (HSPs).

BLAST (Cont’d) • Connects the HSPs by aligning the sequence in between them. The quality of each pair-wise alignment is represented as a sore and the scores are ranked. • Scoring matrices are used to calculate the score of the alignment base by base a unitary matrix is used for DNA pairs and substitution matrices are used for amino acid alignments. Positions at which a letter is paired with a null are called gaps. Gap scores are negative. • The significance of each alignment is computed as a P value or an E value. For example high scoring pairs whose similarity is based on repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect meaningful similarity between the query and the match.

BLAST VS FASTA • While BLAST and FASTA essentially perform the same tasks, each tool performs better in certain queries than in other. • One of the biggest differences between BLAST and FASTA is time. • Another big difference between the two tools is the database access that each provides. • BLAST offers the flexibility of various search modes which FASTA does not provide. • BLAST's algorithm allows for gapped matches. This allows BLAST to include sequences that, while not being a perfect match, may be a significant match for the query. • FASTA's algorithm requires a perfect match during the first stage of the search that can force the tool too overlook some significant matches. • Looking at the various advantages over FASTA, BLAST has become the dominant search engine for biological sequence database.

BLAST (Cont’d) • There are several BLAST services : • BLASTP – comparing an amino acid query sequence with other stored in protein sequence databases. • BLASTN – comparing a nucleotide query sequence against a nucleotide sequence database. • BLASTX – comparing a nucleotide query sequence translated in all reading frames with other amino acid sequence stored in protein sequence databases. • TBLASTN – compares a protein query sequence against a nucleotide sequence database, dynamically translated in all six reading frames (both strands). • TBLASTX – compares six frame conceptual translation products of a nucleotide query sequence (both strands) against a nucleotide sequence database dynamically translated in all six reading frames (both strands). • Links :http://www.ncbi.nlm.nih.gov/BLAST/

Genomic database the future problems • Major challenge is to describe complete biological systems using integrated approaches. • The problem is further extended by the large heterogeneity of the underlying data and storage formats that make it difficult for scientists to retrieve and analyze the information that that they need. • Organizing, combining and interpreting biological information has become more complex. • Equally essential for the success of a database is the quality of the stored information. • Most relevant data are not actually stored in databases but in the scientific literature. • In the future, information- extraction and data-mining techniques will need to fill the gaps by providing additional features directly retrieved from texts and other sources.

Summary • Importance of Genomic Database in the field of drug design and medical care. • Importance of accurate and fast retrieval of the data from the genomic database • Genomic Database needs to support relational database searching, match searching within a data entry, interactive display, and cross-referencing. • Popular Similarity Sequence Searching tools such as FASTA, BLAST using sequence comparison algorithm. • Perl is the most popular programming language used in Bioinformatics. • The ORDBMS is the next wave for Genomic Database.

K-tup

Init1

Initn

Blast words

HSP’s

Connecting the HSPs

Genomic Database

Genomic Database

Presentation Transcript

Genomic Database Update

Genomic Instability

Genomes on Line Database The GOLD genomic standards

Realized Genomic Relationships and Genomic BLUP

Genomic Complexity

Genomic CDS

Genomic Imprinting

Genomic Imprinting

Genomic Firsts

Cow Adjustment and Genomic Database Update

Genomic Databases

Genomic Circuits

Genomes on Line Database The GOLD genomic standards

Genomic Sequencing

GENOMIC IMPRINTING

Genomic Analysis

Genomic Screening

Genomic Evaluations

Results Genomic Analysis:

Genomic Database - Ensembl

PCR Genomic technologies

Biological (genomic) information