Fundamentals in Sequence Analysis 1.(part 1). Review of Basic biology + database searching in Biology. Hugues Sicotte NCBI. The Flow of Biotechnology Information. Gene. Function. > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Review of Basic biology + database searching in Biology.
> DNA sequence
> Protein sequence
1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance)
2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars.
Adenine(A), Citosine (C), Guanine (G) and Thymine (T).
[note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]
3) Each side of the double helix faces it´s complementary base.
A T, and G C.
4) Biochemical process that read off the DNA always read it from the 5´´side towards the 3´ side. (replication and transcription).
5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g.
If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´
6) DNA information is copied over to mRNA that acts as a template to produce proteins.
We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes)
Prokaryotes (intronless protein coding genes)
Transcription (gene is encoded on minus strand .. And the reverse complement is read into mRNA)
CoDing Sequence (CDS)
Translation: tRNA read off each codons, 3 bases at a time, starting at start codon until it reaches a STOP codon.
Prokaryotes (operon structure)
In prokaryotes, sometimes genes that are part of the same operational pathway are grouped together under a single promoter. They then produce a pre-mRNA which eventually produces 3 separates mRNA´s.
Note the degeneracy of the genetic code. Each amino acid might have up to six codons that specify it.
is a small RNA that has a very specific secondary and tertiary structure such that it can bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T
Three-dimensional Tertiary structure
Secondary structure of tRNA
Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change.
Most modern gene prediction programs need to be “trained”. E.g. they find their own consensus and assembly rules given a few examples genes.
A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORF’s restrict the search space of possible gene candidates.
E.g. selfid program([email protected])
On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or minus strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
Some programs have other conventions for naming frames.. (0..5, 1-6, etc)
Gene finding in eukaryotic cDNA uses ORF finding +blastx as well.
try with gi=41 ( or your own piece of DNA)
In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus)
The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. ( many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)
mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a messenger to carry the information stored in the DNA in the nucleus to the cytoplasm where the ribosomes can make it into protein.
~6-12% of human DNA encodes proteins(higher fraction in nematode)
~10% of human DNA codes for UTR
~90% of human DNA is non-coding.
Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both.
Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail.
Insertion process flanks mRNA sequence with short direct repeats.
Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence.
Do not confuse with single-exon genes.
Because the mRNA is actually read off the minus strand of the DNA, the nucleotide sequence are always quoted on the minus strand.
In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. There is no symbol for Uracil.. It is always represented by a ´T´
Even genomic sequence follows that convention. A gene on the ´plus´ strand is quoted so that it is in the same strand as it´s product mRNA.
Flat-file version of a database.
GI Changes with each update of the sequencerecord.
Accession Number: Secondary key: Points to same locus and sequence despite sequence updates.
Accession + Version Number equivalent to Gi
Relational Database (Normalizing a database for repeated sub-elements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.)
But google only indexes about 40% of the web.. So you may have to use other web spiders.
(disclaimer.. I don’t own stock in that company.. But I’d like to)
EMBL does not differentiate between the different types of RNA records, while NCBI (and DDBJ) do. In Entrez EMBL records are patched up to add that information.
MOST important data format!!!
>identifier descriptive text
nucleotide of amino-acid
sequence on multiple lines if needed.
>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin
Restriction enzymes have a pattern recognition sequence, and then within or a few bases away from that pattern is the actual cutting site
I prefer the bairoch format (SWISSPROT format)
ID enzyme name
ET enzyme type
OS microorganism name
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RL jour, vol, pages, year, etc.REBASE (Restriction enzymes dataBASE)