Comprehensive Introduction to Biostatistics and Bioinformatics

Introduction to Biostatistics and Bioinformatics http://fenyolab.org/ibb2015/ ibb2015@fenyolab.org

Introduction to Biostatistics and Bioinformatics Lectures: Tuesdays and Thursdays 2-4 pm in different locations: Skirball 3rd and 4th Floor Seminar Rooms and Smilow 1st Floor Seminar Room. Check web site: http://fenyolab.org/ibb2015. Tutorials: Thursdays 4 pm in Room 718 in the Translational Research Building (227 E 30th St between 2nd and 3rd). Homework: Given on Tuesdays and due on Sunday. Course Assessment Participation (20%) Assignments (40%) Exam (40%)

Introduction to Biostatistics and Bioinformatics Learning Objective • Good experimental design. • Automatically process data generated in the lab and combine it with public data. • Statistical methods for data analysis.

Introduction to Biostatistics and Bioinformatics Lectures • Introduction to Biological Data • Introduction to Python I • Introduction to Python II • Introduction to Python III • Introduction to Python IV • Exploring Data & Descriptive Statistics • Sequence Alignment Concepts • Sequence Database Searching • Probability • Distributions • Estimation • Hypothesis Testing • Analysis of Variance • Regression & Correlation • Experimental Design & Analysis

This Lecture IBB_2015 Data types and representations in Molecular Biology

Learning Objectives • text formats for some common genomics data types • formatting text with tag:value pairs • basic database concepts • details of the FASTA format • Data formats in public molecular biology databases • Genbank, dbSNP • Genome Browsers: BED format • Database queries: field specific queries

Biologists Collect Lots of Data • Hundreds of thousands of species • Millions of articles in scientific journals • Genetic information: • gene names • phenotype of mutants • location of genes/mutations on chromosmes • linkage (distances between genes)

High Throughput lab technology • PCR • Gene expression microarrays • Rapid inexpensive DNA sequencing • Many methods of collecting genotype data • Assays for specific polymorphisms • Genome-wide SNP chips • Must have data quality assessment prior to analysis

Data files • Various assay technologies/machines collect raw data in custom formats • Images • Trace files • Machine specific binary formats • Convert to textto share scientific data • Why text? • Does not require custom software to read the data • Stable for long periods of time across different computing systems (ASCII is universal) • Can be smoothly shared across many different computing systems • The WWW is built with text (html)

Text has many different formats GFF3 FASTA ##gff-version 3 #!gff-spec-version 1.20 ##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7425 NC_015867.2 RefSeqcDNA_match 66086 66146 .- . ID=aln0;Target=XM_008204328.1 1 61 +; for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1 NC_015867.2RefSeqcDNA_match 65959 66007 .- . ID=aln0;Target=XM_008204328.1 62 110 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1 NC_015867.2RefSeqcDNA_match 65799 65825 .- . ID=aln0;Target=XM_008204328.1 111 137 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1 >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG FASTQ @SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC +SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 +50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII @SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152 NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT +SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152 #.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG

tag:value pairs • A very common way to organize text is with tag:value pairs • address: Publisher's address (usually just the city, but can be the full address for lesser-known publishers) • annote: An annotation for annotated bibliography styles (not typical) • author: The name(s) of the author(s) (in the case of more than one author, separated by and) • booktitle: The title of the book, if only part of it is being cited • chapter: The chapter number • HTML is a tag system to display text in web browsers. <b>This text is bold</b> <a href="http://www.w3schools.com">This is a link</a> <h1 style="font-family:verdana">This is a heading</h1> <p style="color:green;margin-left:20px;">This is a paragraph.</p>

What is a Database? • Structured data • Information is stored in "records" and "fields" • Fields are categories • Must contain data of the same type • Records contain data that is related to one object across fields • A record does not need to have data in every field • A record is a series of tag-value pairs where fields are the tags Unique Identifier

A Spreadsheet can be a Database • columnsare Fields • Rows are Records • Can search for a term within just one field • Or combine searches across several fields

Spreadsheet data can be saved as tab or comma separated values Tab delimited Ovary embryo (0-3hrs) embryo (3-6hrs) embryo (6-9hrs) Pupal 0.130666 0.0178557 0.476863 6.40536 7.35556 0.443061 0.0144366 0.412562 3.87577 4.73383 0.273747 0.0752579 1.44697 22.5721 16.4197 0.0643887 0.208803 39.1049 58.8561 0.798709 8.93599 0.208607 5.28872 77.217 0.692811 0.229979 0.0298946 0.219502 2.07522 10.2703 0.0277432 0.0220337 0.738405 1.34702 6.14537 1.46693 0.0361666 3.4654 2.5313 3.20737 0.408315 0.0392186 2.53851 4.1273 3.655 0.108006 0.0572734 2.08545 10.0762 3.29876 0.151759 3.82547 485.993 530.451 1.24837 0.0793942 0.129111 5.38445 27.6188 0.23297 0.139144 0.180263 1.06842 35.8966 3.07092 gene,Ovary,embryo(0-3hrs),embryo(3-6hrs),embryo(6-9hrs),Pupal LOC100118025,0.04541333,0.006205798,0.165735055,2.226200589,2.556445228 LOC100122637,0.233690353,0.007614514,0.217603805,2.044255893,2.496835435 LOC100116733,0.033557481,0.009225546,0.177377903,2.76701782,2.012821249 LOC100120954,0.003250874,0.010542103,1.974338817,2.971542769,0.040325437 LOC100122540,0.483847049,0.01129521,0.286362403,4.180982477,0.037512862 LOC100119626,0.089661159,0.01165491,0.085576525,0.809059218,4.004048189 Scr,0.016751983,0.013304455,0.445865943,0.813361695,3.710715923 LOC100119924,0.685022497,0.016888969,1.618261922,1.182058753,1.49776786 LOC100121348,0.18959044,0.018210136,1.178691029,1.916404302,1.697104093 csv

Data Formats • How to organize various types of genetic data? • Need standard formats • DNA sequence = GATC, but what about gaps, unknown letters, etc. • How many letters per line • ?? Spaces, numbers, headers, etc. • Store as a string, code as binary numbers, etc. • Use a completely different format for proteins?

FASTA Format • In the process of writing a similarity searching program (in 1985), William Pearson designed a simple text format for DNA and protein sequences • The FASTA format is now universal for all databases and software that handles DNA and protein sequences One header line, starts with > with a [return] at end All other characters are part of sequence.Most software ignores spaces, carriage returns. Some ignores numbers >URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

Multi-Sequence FASTA file >FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-PA; parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annotation_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294; MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQPKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLASLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQYHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLRDYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPEIVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL >FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-PA; parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annotation_IDs:CG32854-PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87; MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS >FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159; name=CG33919-PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annotation_IDs:CG33919-PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191; MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINWNLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFYQVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN >FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-PA; parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annotation_IDs:CG12410-PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257; MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELKNCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPELFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKNLDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCCECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD How many fields? What is the key, what is the value? Does the header have tag:value encoded data?

Other Standards? • Other types of important medical and genetic data may or may not have universal standards: • Genotype/haplotype • Clinical records • Gene expression • Genome annotation • Protein structure • Alignments • Phylogenetic trees

Where/How are Data Formats Defined? • Unfortunately, there is no single repository of Standards for important widely used bioinformatics data formats. • Each file type has its own peculiar history, and may or may not have a home database, or an official group that maintains and/or enforces a standard. • GenBank format is defined by the NCBI GenBank database.http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html • The BED format (for genome intervals) is defined by the UCSC Genome Browser: https://genome.ucsc.edu/FAQ/FAQformat.html#format1 • The GFF3 format is defined on the Sequence Ontology website:http://www.sequenceontology.org/resources/gff3.html • FASTA and FASTQ formats are “de facto” standards that are not formally defined or enforced by anyone: Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71. • I typically Google search “xyz file format definition”

Software File Formats • Within a single software language (eg. Python, Perl, Java, etc.), file formats are rigorously defined as data types. • Thus we can know with certainty where in the file to find numbers, text, gene IDs, chromosome locations, etc. when we are writing a program. • There may be challenges when reading data into software from public sources that do not obey the same rigorous standard.

SNPStats HapStat

Reformatting Data Files • Much of the routine (yet annoying) work of bioinformatics involves messing around with data files to get them into formats that will work with various software • Then messing around with the results produced by that software to create a useful summary…

Public Databases • In addition to your own experimental data, access to public data is essential for epidemiology • Complete genome sequences (human and pathogens/vectors) • SNPs • Genotypes • Population Sets • Supplemental data for specific Journal articles

GenBank is a Database • Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research • Flatfile: Composed entirely of text • you could print the whole thing out • Each submitted sequence is a record • Had fields for Organism, Date, Author, etc. • Unique identifier for each sequence • Locus and Accession #

Fields

dbSNP record Reference SNP (refSNP) Cluster Report: rs1042574 Organism: human (Homo sapiens) Molecule Type: Genomic Created/Updated in build: 86/141 Map to Genome Build: 106/Weight Validation Status: byCluster Variation Class: SNV: single nucleotide variation RefSNP Alleles: C/T (FWD) Allele Origin: Ancestral Allele: C Variation Viewer: unknown Clinical Significance: NA MAF/MinorAlleleCount: NA MAF Source: HGVS Names: NC_000014.9:g.24166518C>T, NM_006084.4:c.*322C>T, NT_026437.13:g.5942995C>T >gnl|dbSNP|rs1042574|allelePos=140|totalLen=345|taxid=9606|snpclass=1|alleles='C/T'|mol=Genomic|build=138 CCTTTTTTTT TTTTWADTTT GAGATATACG CCCTCTTTCA TCTGTAAGGG ACTAGGAAAT TCCAAATGGT GTGAACCCAG GGGGCCTTTC CCTCTTCCCT GACCTCCCAA CTCTAAAGCC AAGCACTTTA TATTTTCCT Y TTAGATATTC MCTAAGGACT TAAMATAAAA TTTTATTGAA AGAGGAATCA GTATCTGATT TTCTGGGAGA AGAAGGTAGC AGTGGTCACA GATAGAGATG TAAACTTAAG AGTGGGGCAC TGGGGTTCTC TTCCTGCTGA CATCTCCAGC CTCTTTCCTC TCCTCTGCCC ACAGGTTCTG GCTAAGAKGC TGCCTGGGCC CTGTG

Accession Numbers!! • Databases are designed to be searched by accession numbers (and locus IDs) • These are guaranteed to be non-redundant, accurate, and not to change. • Searching by gene names and keywords is doomed to frustration and probable failure • Neither scientists nor computers can be trusted to accurately and consistently annotate database entries • If only scientists would refer to genes by accession numbers in all published work!

http://www.ncbi.nlm.nih.gov/Genbank • GenBank is managed by the National Center for Biotechnology Information (NCBI) at the NIH (part of the U.S. National Library of Medicine) • Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year. • Now GenBank is over 150billion bases • Scientists access GenBank directly over the Web at www.ncbi.nlm.nih.gov

What is GenBank? GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research 2007 Jan ;35(Database issue):D21-5). There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006.

Relational Databases • Databases can be more complex than a single spreadsheet • GenBank has proteins and SNPs as well as DNA • Some fields (i.e. phosphorylation sites) apply to protein, but not DNA • Better to create a separate spreadsheet format for Protein records • Each different spreadsheet is called a Table • Different Tables are linked by key fields • (i.e. DNA and protein for same gene)

Many Tables at NCBI • The NCBI hosts a huge interconnected database system that, in addition to DNA and protein, includes: • Journal Articles (PubMed) • Genetic Diseases (OMIM) • Polymorphisms (dbSNP) • Cytogenetics (CGH/SKY/FISH & CGAP) • Gene Expression (GEO) • Taxonomy • Chemistry (PubChem)

Database Design A database can only be searched in ways that it was designed to be searched You can search within a specific Field in a specific Table - and sometimes can combine searches from different Fields and/or Tables (Boolean: "AND" and "OR" searches) Bad to search for "human hemoglobin" in a 'Description' field Much better to search for "homo sapiens in 'Organism' AND "HBB" in 'gene name'

Web Query • Most Scientific databases have a web-based query tool • It may be simple…

… or complex

ENTREZis the GenBank web query tool

Advanced query interface:

Web API • In addition, many public databases have a specific query language that can be used by any software to create automated queries. • This is usually known as an Application Programming Interface (API). • If the interface communicates over the http protocol (used by web browsers), then it is a Web API (the simplest to work with as a novice programmer)

ENTREZ has pre-computed links between Tables • Relationships between sequences are computed with BLAST • Relationships between articles are computed with "MESH" terms (shared keywords) • Relationships between DNA and protein sequences rely on accession numbers • Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.

NCBI Databases contain more than just DNA & protein sequences

Other Important Databases • Genomes • Proteins • Biochemical & Regulatory Pathways • Gene Expression • Genetic Variation (mutants, SNPs) • Protein-Protein Interactions • Gene Ontology (Biological Function)

UCSC Genome Browser Search by gene name: or by sequence:

BED format • Genome Browsers use a BED format that defines a genomic interval as positions on a reference genome. • An interval can be a anything with a location: gene, exon, binding site, region of low complexity, etc. • BED files can also specify color, width, some other formatting. chromosome start end chr1 213941196 213942363 chr1 213942363 213943530 chr1 213943530 213944697 chr2 158364697 158365864 chr2 158365864 158367031 chr3 127477031 127478198 chr3 127478198 127479365 chr3 127479365 127480532 track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0

http://genome.ucsc.edu/FAQ/FAQformat.html#format1 The first three required BED fields are: 1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). 2.chromStart -The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. 3.chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. The 9 additional optional BED fields are: 4.name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode. 5.score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). 6.strand - Defines the strand - either '+' or '-'. 7.thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position. 8.thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays). 9.itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser. 10.blockCount - The number of blocks (exons) in the BED line. 11.blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount. 12.blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount. In BED files with block definitions, the first blockStart value must be 0, so that the first block begins at chromStart. Similarly, the final blockStart position plus the final blockSize value must equal chromEnd. Blocks may not overlap. Example:Here'san example of an annotation track that uses a complete BED definition: track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Lots of additional data can be added as optional "tracks" - anything that can be mapped to locations on the genome

Ensembl at EBI/EMBL

Comprehensive Introduction to Biostatistics and Bioinformatics

Comprehensive Introduction to Biostatistics and Bioinformatics

Presentation Transcript

Introduction to Biostatistics

Introduction to Bioinformatics

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Introduction to Biostatistics

Introduction to biostatistics

Introduction to BioInformatics

INTRODUCTION TO BIOSTATISTICS

Introduction to Bioinformatics

Introduction to Bioinformatics

Biostatistics Bioinformatics Core

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Biostatistics

Introduction to Biostatistics

Introduction to Biostatistics

Introduction to biostatistics

Biostatistics and Statistical Bioinformatics

INTRODUCTION TO BIOSTATISTICS

Introduction to Biostatistics

INTRODUCTION TO BIOSTATISTICS

INTRODUCTION TO BIOSTATISTICS