stuart m brown ph d director nyu bioinformatics core
Download
Skip this Video
Download Presentation
Stuart M. Brown, Ph.D. Director: NYU Bioinformatics Core

Loading in 2 Seconds...

play fullscreen
1 / 52

Stuart M. Brown - PowerPoint PPT Presentation


  • 530 Views
  • Uploaded on

Bioinformatics Data and Databases. Stuart M. Brown, Ph.D. Director: NYU Bioinformatics Core. Biologists Collect Lots of Data. Hundreds of thousands of species Millions of articles in scientific journals Genetic information: gene names phenotype of mutants

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stuart M. Brown' - daniel_millan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stuart m brown ph d director nyu bioinformatics core

Bioinformatics Data and Databases

Stuart M. Brown, Ph.D.

Director: NYU Bioinformatics Core

biologists collect lots of data
Biologists Collect Lots of Data
  • Hundreds of thousands of species
  • Millions of articles in scientific journals
  • Genetic information:
    • gene names
    • phenotype of mutants
    • location of genes/mutations on chromosmes
    • linkage (distances between genes)
slide3
High Throughput lab technology
    • PCR
    • Rapid inexpensive DNA sequencing
    • Many methods of collecting genotype data
      • Assays for specific polymorphisms
      • Genome-wide SNP chips
  • Must have data quality assessment prior to analysis
what is a database
What is a Database?
  • Organized data
  • Information is stored in "records" and "fields"
  • Fields are categories
    • Must contain contain data of the same type
  • Records contain data that is related to one object
a spreadsheet can be a database
A Spreadsheet can be a Database
  • columnsare Fields
  • Rows are Records
  • Can search for a term within just one field
  • Or combine searches across several fields
data formats
Data Formats
  • How to organize various types of genetic data?
  • Need standard formats
  • DNA sequence = GATC, but what about gaps, unknown letters, etc.
    • How many letters per line
    • ?? Spaces, numbers, headers, etc.
    • Store as a string, code as binary numbers, etc.
  • Use a completely different format for proteins?
fasta format
FASTA Format
  • In the process of writing a similarity searching program (in 1985), William Pearson designed a simple text format for DNA and protein sequences
  • The FASTA format is now universal for all databases and software that handles DNA and protein sequences

One header line, starts with > with a [return] at end

All other characters are part of sequence.Most software ignores spaces, carriage returns. Some ignores numbers

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..

CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA

ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT

GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC

CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG

TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA

GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT

CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA

TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

multi sequence fasta file
Multi-Sequence FASTA file

>FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-PA; parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annotation_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ

PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA

SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ

YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR

DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE

IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL

>FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-PA; parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annotation_IDs:CG32854-PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV

RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS

>FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159; name=CG33919-PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annotation_IDs:CG33919-PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW

NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER

RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY

QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN

>FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-PA; parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annotation_IDs:CG12410-PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK

NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE

LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN

LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC

ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD

GPVNNNY

other standards
Other Standards?
  • Other types of important medical and genetic data may not have universal standards:
    • Genotype/haplotype
    • Clinical records
    • Gene expression
    • Protein structure
    • Alignments
    • Phylogenetic trees
reformatting data files
Reformatting Data Files
  • Much of the routine (yet annoying) work of bioinformatics involves messing around with data files to get them into formats that will work with various software
  • Then messing around with the results produced by that software to create a useful summary…
public databases
Public Databases
  • In addition to your own experimental data, access to public data is essential for epidemiology
    • Complete genome sequences (human and pathogens/vectors)
    • SNPs
    • Genotypes
    • Population Sets
    • Supplemental data for specific Journal articles
genbank is a database
GenBank is a Database
  • Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research
  • Flatfile: Composed entirely of text
    • you could print the whole thing out
  • Each submitted sequence is a record
  • Had fields for Organism, Date, Author, etc.
  • Unique identifier for each sequence
    • Locus and Accession #
accession numbers
Accession Numbers!!
  • Databases are designed to be searched by accession numbers (and locus IDs)
  • These are guaranteed to be non-redundant, accurate, and not to change.
  • Searching by gene names and keywords is doomed to frustration and probable failure
  • Neither scientists nor computers can be trusted to accurately and consistently annotate database entries
  • If only scientists would refer to genes by accession numbers in all published work!
http www ncbi nlm nih gov genbank
http://www.ncbi.nlm.nih.gov/Genbank
  • GenBank is managed by the National Center for Biotechnology Information (NCBI) at the NIH (part of the U.S. National Library of Medicine)
  • Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.
  • Now GenBank is over 100billion bases
  • Scientists access GenBank directly over the Web at www.ncbi.nlm.nih.gov
slide17

What is GenBank?

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research 2007 Jan ;35(Database issue):D21-5).

There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006.

relational databases
Relational Databases
  • Databases can be more complex than a single spreadsheet
  • GenBank has proteins and SNPs as well as DNA
  • Some fields (i.e. phosphorylation sites) apply to protein, but not DNA
  • Better to create a separate spreadsheet format for Protein records
  • Each different spreadsheet is called a Table
  • Different Tables are linked by key fields
    • (i.e. DNA and protein for same gene)
many tables at ncbi
Many Tables at NCBI
  • The NCBI hosts a huge interconnected database system that, in addition to DNA and protein, includes:
    • Journal Articles (PubMed)
    • Genetic Diseases (OMIM)
    • Polymorphisms (dbSNP)
    • Cytogenetics (CGH/SKY/FISH & CGAP)
    • Gene Expression (GEO)
    • Taxonomy
    • Chemistry (PubChem)
database design
Database Design

A database can only be searched in ways that it was designed to be searched

You can search within a specific Field in a specific Table - and sometimes can combine searches from different Fields and/or Tables

(Boolean: "AND" and "OR" searches)

Bad to search for "human hemoglobin" in a \'Description\' field

Much better to search for "homo sapiens in \'Organism\' AND "HBB" in \'gene name\'

web query
Web Query
  • Most Scientific databases have a web-based query tool
  • It may be simple…
entrez has pre computed links between tables
ENTREZ has pre-computed links between Tables
  • Relationships between sequences are computed with BLAST
  • Relationships between articles are computed with "MESH" terms (shared keywords)
  • Relationships between DNA and protein sequences rely on accession numbers
  • Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.
other important databases
Other Important Databases
  • Genomes
  • Proteins
  • Biochemical & Regulatory Pathways
  • Gene Expression
  • Genetic Variation (mutants, SNPs)
  • Protein-Protein Interactions
  • Gene Ontology (Biological Function)
slide30

UCSC Genome Browser

Search by gene name:

or by sequence:

slide32

Lots of additional data can be added as optional "tracks"

- anything that can be mapped to locations on the genome

snps single nucleotide polymorphisms
SNPs (Single Nucleotide Polymorphisms)
  • Genetic variation
  • Can be alleles of genes
  • also differences in non-coding regions collected from genome sequencing of different individuals
  • dbSNP at the NCBI - all public SNP data
  • SNP Consortium at CSHL - high quality set
kegg kyoto encylopedia of genes and genomes
KEGG: Kyoto Encylopedia of Genes and Genomes
  • Enzymatic and regulatory pathways
  • Mapped out by EC number and cross-referenced to genes in all known organisms

(wherever sequence information exits)

  • Parallel maps of regulatory pathways
protein protein interactions
Protein-Protein Interactions
  • Metabolic and regulatory pathways
  • Transcription factors
  • Co-expression
  • Biochemical data
    • crosslinking
    • yeast 2-hybrid
    • affinity tagging
  • Useful feedback to genome annotation/protein function and gene expression
genome ontology
Genome Ontology
  • Genetics is a messy science
  • Scientists have been working in isolation on individual species for many years - naming genes, mutants, odd phenotypes
    • “sonic hedgehog”
  • Now that we have complete genome sequences, how to reconcile the names across all species?
  • Genome Ontology uses a single 3 part system
    • Molecular function (specific tasks)
    • Biological process (broad biologial goals - e.g cell division)
    • Cellular component (location)
database search strategies
Database Search Strategies
  • General search principles - not limited to sequence (or to biology)
  • Use accession numbers whenever possible
  • Start with broad keywords and narrow the search using more specific terms
  • Try variants of spelling, numbers, etc.
  • Search all relevant databases
  • Be persistent!!
bioinformatics paradigm
Bioinformatics Paradigm
  • Find the data
  • Download the data
  • Reformat the data
  • Collect the samples
  • Run molecular analysis
  • Filter the data
  • Run analysis software
  • Collect and sort results
  • Publish / Data sharing
ad