Data
Download
1 / 147

Data - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Data. Sequences and Other Stuff. Sequence Data. Nucleic Acid and Protein Sequences. Sources of Genetic Sequences User GCG supplied databases Flat File Oracle Relational Database NCBI supplied databases Other databases. Sequence Databases. Genbank EMBL DDBJ NCBI PIR Swiss-Prot

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data' - ida


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data

Data

Sequences

and

Other Stuff



Nucleic acid and protein sequences
Nucleic Acid and Protein Sequences

  • Sources of Genetic Sequences

    • User

    • GCG supplied databases

      • Flat File

      • Oracle Relational Database

    • NCBI supplied databases

    • Other databases


Sequence databases
Sequence Databases

  • Genbank

    • EMBL

    • DDBJ

  • NCBI

  • PIR

  • Swiss-Prot

  • Swiss-Prot TrEMBL


Genbank
Genbank

  • Primary nucleic acid sequence database

  • Maintained by NCBI

    • National Center for Biotechnology Information

    • http://www.ncbi.nlm.nih.gov

  • Current Release 122, 2/2001

  • 11,720,120,326 bases

  • 10,896,781 sequences


How many organisms are in the sequence databases april 1 2001

Species

1995

1996

1997

1998

1999

2000

2001

Increase(since 1995)

Increase(12 months)

all:

16109

23119

32880

43516

61952

87751

95168

490%

40.9%

Viruses:

1845

2122

2678

2968

3573

4428

4857

163%

32.4%

Bacteria:

2939

3847

6091

8711

14322

22758

24878

746%

53.3%

Archaea:

162

235

385

555

1015

1709

1906

1076%

68.8%

Eukaryota:

10366

15901

22596

29926

41420

56961

61571

493%

37.4%

How Many Organisms Are In The Sequence Databases?(April 1, 2001)


Other ncbi databases
Other NCBI Databases

  • HTGS

  • EST

  • STS

  • GSS

  • RefSeq

  • Unigene

  • Genomic


Data
HTGS

High Throughput Genomic Sequences

  • ‘Unfinished' DNA sequences generated by the high-throughput sequencing centers

  • Phase 0

    • Single-few pass reads of a single clone (not contigs)

  • Phase 1

    • Unfinished, may be unordered, unoriented contigs, with gaps

  • Phase 2

    • Unfinished, ordered, oriented contigs, with or without gaps

  • Phase 3

    • Primary division (Genbank)

    • Finished, no gaps (with or without annotations)


Data
EST

  • Expressed Sequence Tags

    • “Single-pass" cDNA sequences

    • Generally representative of the 3’ ends of cDNAs

    • More “full-length” ESTs now available


Data
STS

  • Sequence Tagged Sites

    • Sequence and mapping data

    • Short genomic landmark sequences


Data
GSS

  • Genome Survey Sequences

  • Similar to the EST division, except that its sequences are genomic in origin, rather than cDNA

    • Random “single pass read” genome survey sequences.

    • Cosmid/BAC/YAC end sequences

    • Exon trapped genomic sequences

    • alu PCR sequences


Refseq
RefSeq

  • NCBI Reference Sequence project

  • Provides reference sequence standards for the naturally occurring molecules from chromosomes to mRNAs to proteins

  • Stable reference point for:

    • mutation analysis

    • gene expression studies

    • polymorphism discovery


Refseq1
RefSeq…

  • Curated RefSeq

    • transcripts and proteins

  • Genome Annotation

    • contigs, transcripts, and proteins

  • Complete Genomes

    • genomes, chromosomes, and proteins


Unigene
Unigene

  • Experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters

    • Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

  • Includes EST and cDNA sequences

  • Includes human, rat, mouse, cow and zebrafish


Homologene
HomoloGene

  • Curated and calculated orthologs and homologs for genes represented in UniGene and LocusLink

  • Includes human, mouse, rat, zebrafish, cow and drosophila


Locuslink
LocusLink

  • Provides a single query interface to curated sequence and descriptive information about genetic loci

    • Nomenclature

    • Aliases

    • Sequence accessions

    • Phenotypes

    • EC numbers

    • MIM numbers

    • UniGene clusters

    • Homology

    • Map locations

    • Web sites


Embl and ddbj
EMBL and DDBJ

  • European Molecular Biology Laboratory

    • Hinxton, UK

    • http://www.ebi.ac.uk/

  • DNA Data Bank of Japan

    • Mishima, Japan

    • http://www.ddbj.nig.ac.jp/


Coordination with genbank
Coordination with Genbank

  • Prevents duplication

  • Genbank enters sequences from U.S. journals and researchers

  • EMBL handles European data

  • DDBJ handles Asian data

  • Data exchanged daily


Sequence submissions
Sequence submissions

  • Sequences entered from journals

  • Sequences submitted by individual researchers

    • BankIt

      • NCBI WWW Site

    • Sequin

      • Multi-platform program


Sequence names
Sequence Names

  • DO NOT rely on names to find particular sequences

  • Few conventions

  • Organism

    • Hum: Human

    • Mus: mouse

    • Eco: E. coli

    • Syn: synthetic


Last letter s
Last Letter(s)

  • Sometimes gives useful information

    • cg: Complete genome

    • Viruses


Other letters
Other Letters

  • Specifies a particular sequence

  • vsvcg

    • Vesicular stomatitis virus (Indiana serotype) complete genome


Embl file names
EMBL File Names

  • Ec: E. coli

  • Hs: Human


Locus name
Locus name

  • Names are short, fairly non-descriptive, and can change from one release to another

    • vsvcg

      • The complete sequence for the virus VSV

  • Most “mnemonic” names already taken

  • Genbank now using accession numbers as locus names


Accession numbers
Accession Numbers

  • Each sequence submitted to a database is assigned a unique primary accession number

  • Accession numbers do not change

  • If a sequence is merged with another, a new accession number is assigned, and the original number becomes a secondary accession number

  • Accession numbers may include version numbers

    • AO2428.2


Accession numbers1
Accession Numbers

  • Using GCG to access sequences via their accession number

  • Data Library:Accession Number

    • Flatfile - vi:JO2428

    • RDB - gcgnuc: JO2428


The sequence record
The Sequence Record

  • Different for each database

  • Locus (Name)

  • Accession Number

  • Keywords

  • Description

  • Properties

  • References

  • The Sequence


Data

analyze% typedata ge:humcftrm

!!NA_SEQUENCE 1.0

LOCUS HUMCFTRM 6129 bp mRNA PRI 15-DEC-1989

DEFINITION Human cystic fibrosis mRNA, encoding a presumed transmembrane

conductance regulator (CFTR).

ACCESSION M28668

NID g180331

KEYWORDS cystic fibrosis; transmembrane conductance regulator.

SOURCE Human, cDNA to mRNA.

ORGANISM Homo sapiens

Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;

Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 6129)

AUTHORS Riordan,J.R., Rommens,J.M., Kerem,B., Alon,N., Rozmahel,R.,

Grzelczak,Z., Zielenski,J., Lok,S., Plavsic,N., Chou,J.-L.,

Drumm,M.L., Iannuzzi,M.C., Collins,F.S. and Tsui,L.-C.

TITLE Identification of the cystic fibrosis gene: Cloning and

characterization of complementary DNA

JOURNAL Science 245, 1066-1073 (1989)

MEDLINE 89368940


Data

COMMENT A three base-pair deletion spanning positions 1654-1656 is observed

in cDNAs from cystic fibrosis patients.

FEATURES Location/Qualifiers

source 1. .6129

/organism="Homo sapiens"

/db_xref="taxon:9606"

CDS 133. .4575

/note="cystic fibrosis transmembrane conductance

regulator"

/codon_start=1

/db_xref="PID:g180332"

/translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD

SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL

LNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLR

AYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTAN

WFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWA

VNSSIDVDSLMRSVSRVFKFIDMPTEGKPTKSTKPYKNGQLSKVMIIENSHVKKDDIW

PSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLN

TEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVAD

EVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDP

VTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL

FRQAISPSDRVKLFPHRNSSKCKSKPQIAALKEETEEEVQDTRL"

BASE COUNT 1886 a 1181 c 1330 g 1732 t

ORIGIN


Data

HUMCFTRM Length: 6129 April 13, 1998 13:00 Type: N Check: 6781 ..

1 AATTGGAAGC AAATGACATC ACAGCAGGTC AGAGAAAAAG GGTTGAGCGG

51 CAGGCACCCA GAGTAGTAGG TCTTTGGCAT TAGGAGCTTG AGCCCAGACG

101 GCCCTAGCAG GGACCCCAGC GCCCGAGAGA CCATGCAGAG GTCGCCTCTG

151 GAAAAGGCCA GCGTTGTCTC CAAACTTTTT TTCAGCTGGA CCAGACCAAT

201 TTTGAGGAAA GGATACAGAC AGCGCCTGGA ATTGTCAGAC ATATACCAAA

251 TCCCTTCTGT TGATTCTGCT GACAATCTAT CTGAAAAATT GGAAAGAGAA

301 TGGGATAGAG AGCTGGCTTC AAAGAAAAAT CCTAAACTCA TTAATGCCCT

351 TCGGCGATGT TTTTTCTGGA GATTTATGTT CTATGGAATC TTTTTATATT

401 TAGGGGAAGT CACCAAAGCA GTACAGCCTC TCTTACTGGG AAGAATCATA

451 GCTTCCTATG ACCCGGATAA CAAGGAGGAA CGCTCTATCG CGATTTATCT


Data

analyze% typedata -ref GB_PR:HUMIFNRF1A Check: 6781 ..

!!NA_SEQUENCE 1.0

LOCUS HUMIFNRF1A 7721 bp DNA PRI 10-NOV-1992

DEFINITION Homo sapiens interferon regulatory factor 1 gene, complete cds.

ACCESSION L05072

NID g184648

KEYWORDS interferon regulatory factor 1.

SOURCE Homo sapiens Placenta DNA.

ORGANISM Homo sapiens

Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;

Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 7721)

AUTHORS Cha,Y., Sims,S.H., Romine,M.F., Kaufmann,M. and Deisseroth,A.B.

TITLE Human interferon regulatory factor 1: intron/exon organization

JOURNAL DNA Cell Biol. 11, 605-611 (1992)

MEDLINE 93000481


Data

FEATURES Location/Qualifiers Check: 6781 ..

source 1. .7721

/organism="Homo sapiens"

/db_xref="taxon:9606"

/tissue_type="Placenta"

/map="5q23-q31"

exon 1. .219

/gene="IRF1"

/note="putative"

/number=1

5'UTR join(1. .219,1279. .1287)

/gene="IRF1"

gene join(1. .219,1279. .1287)

/gene="IRF1"

intron 220. .1278

/gene="IRF1"

/number=1

exon 1279. .1374

/gene="IRF1"

/number=2

CDS join(1288. .1374,2738. .2837,3630. .3806,3916. .3965,

4073. .4202,4386. .4508,5040. .5089,6248. .6383,6670.

.6794)

/gene="IRF1"

/codon_start=1

/product="interferon regulatory factor 1"

/db_xref="PID:g184649"

/translation="MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKH

GWDINKDACLFRSWAIHTGRYKAGEKEPDPKTWKANFRCAMNSLPDIEEVKDQSRNKG

SSAVRVYRMLPPLTKNQRKERKSKSSRDAKSKAKRKSCGDSSPDTFSDGLSSSTLPDD

HSSYTVPGYMQDLEVEQALTPALSPCAVSSTLPDWHIPVEVVPDSTSDLYNFQVSPMP

STSEATTDEDEEGKLPEDIMKLLEQSEWQPTNVDGKGYLLNEPGVQPTSVYGDFSCKE

EPEIDSPGGDIGLSLQRVFTDLKNMDATWLDSLLTPVRLPSIQAIPCAP"


Data

intron 1375. .2737 Check: 6781 ..

/gene="IRF1"

/number=2

exon 2738. .2837

/gene="IRF1"

/number=3

intron 2838. .3629

/gene="IRF1"

/number=3

exon 3630. .3806

/gene="IRF1"

/number=4

intron 3807. .3915

/gene="IRF1"

/number=4

exon 3916. .3965

/gene="IRF1"

/number=5

intron 3966. .4072

/gene="IRF1"

/number=5

...

exon 5040. .5089

/gene="IRF1"

/number=8

intron 5090. .6247

/gene="IRF1"

/number=8

exon 6248. .6383

/gene="IRF1"

/number=9

intron 6384. .6669

/gene="IRF1"

/number=9

exon 6670. .7656

/gene="IRF1"

/number=10

3'UTR 6795. .7656

BASE COUNT 1750 a 1946 c 2253 g 1772 t

ORIGIN


Data

analyze% typedata -ref est:hum091226f Check: 6781 ..

!!NA_SEQUENCE 1.0

LOCUS HUM091226F 152 bp mRNA EST 02-APR-1996

DEFINITION Homo sapiens retinal fovea EST HFV091226 sequence.

ACCESSION L48850

NID g1254959

KEYWORDS EST; expressed sequence tag.

SOURCE Homo sapiens (clone: EST HFV091226) age normalized retinal foveae

cDNA to mRNA.

ORGANISM Homo sapiens

Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;

Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (sites)

AUTHORS Adams,M.D., Kerlavage,A.R., Fields,C. and Venter,J.C.

TITLE 3,400 new expressed sequence tags identify diversity of transcripts

in human brain

JOURNAL Nature Genet. 4 (3), 256-267 (1993)

MEDLINE 93364420

REFERENCE 2 (sites)

AUTHORS Liew,C.C., Hwang,D.M., Fung,Y.W., Laurenssen,C., Cukerman,E.,

Tsui,S. and Lee,C.Y.

TITLE A catalogue of genes in the cardiovascular system as identified by

expressed sequence tags

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 91 (22), 10645-10649 (1994)

MEDLINE 95024171

REFERENCE 3 (bases 1 to 152)

AUTHORS Bernstein,S.L., Borst,D.E., Neuder,M.E. and Wong,P.

TITLE Characterization of a human fovea cDNA library and regional

differential gene expression in the human retina

JOURNAL Genomics 32 (3), 301-308 (1996)


Data

FEATURES Location/Qualifiers Check: 6781 ..

source 1. .152

/organism="Homo sapiens"

/note="Expressed sequence tags (first pass sequencing)

from randomly selected bacteriophage clones (mRNA-cDNA)

from human retinal fovea. The library is age normalized

from ten sets of donor foveae 2-79 years old.

/db_xref="taxon:9606"

/clone="EST HFV091226"

/dev_stage="age normalized"

/tissue_type="retinal foveae"

mRNA <1. .>152

/standard_name="EST HFV091226"

BASE COUNT 31 a 42 c 41 g 36 t 2 others

ORIGIN


Data

analyze% typedata -ref sts:humswx153 Check: 6781 ..

!!NA_SEQUENCE 1.0

LOCUS HUMSWX153 192 bp DNA STS 24-MAY-1993

DEFINITION Human chromosome X STS sWXD153; single read.

ACCESSION L15212

NID g292645

KEYWORDS STS; primer; sequence tagged site.

SOURCE Homo sapiens DNA.

ORGANISM Homo sapiens

Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;

Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 192)

AUTHORS Kere,J., Nagaraja,R., Mumm,S.R., Ciccodicola,A., D'Urso,M. and

Schlessinger,D.

TITLE Mapping human chromosomes by walking with sequence-tagged sites

from end fragments of yeast artificial chromosome inserts

JOURNAL Genomics 14, 241-248 (1992)

MEDLINE 93052321


Data

COMMENT Submitted by: David Schlessinger, Check: 6781 ..

Center for Genetics in Medicine,

Washington University School of Medicine, Box 8232 4566 Scott

Avenue, St. Louis, MO 63110, USA

e-mail: davids@wugenmail.wustl.edu

Primer A: TAAAGGGATCGCCAAGGAC

Primer B: CTTACTCATTTGCTGGATTCTC

STS size: 85bp

Template: 600 ng/100ul

Primer: 40 pmoles/100ul

dNTPs: 100 uM

MgCl2: 1.5 mM

KCl: 100 mM

TrisHCl: 10 mM

Taq Polymerase: 0.125 U

NH4Cl: 5 mM

pH: 8.6

Total Vol: 5 ul

PCR Profile:

Denaturation: 94 degrees C for 1.00 minute(s)

Annealing: 55 degrees C for 2.00 minute(s)

Polymerization: 72 degrees C for 2.00 minute(s)

PCR Cycles: 35

Thermal Cycler: P-E.


Data

FEATURES Location/Qualifiers Check: 6781 ..

source 1. .192

/organism="Homo sapiens"

/db_xref="taxon:9606"

/map="Xq13-q24"

STS 60. .144

/standard_name="sWXD153"

primer_bind 60. .78

primer_bind complement(123. .144)

BASE COUNT 72 a 26 c 60 g 29 t 5 others

ORIGIN

analyze%


Swiss prot
Swiss-Prot Check: 6781 ..

http://www.expasy.ch/sprot/

  • Protein Database

  • University of Geneva

  • Arranged by protein function

  • Release 39.15

  • March 19, 2001

  • 94,152 entries

  • Provides annotated protein records


Swiss prot names
Swiss-Prot Names Check: 6781 ..

  • Protein_Species

  • Allows easier comparisons when studying evolutionary relationships

  • H1b_Human

    • Human histone 1b


Swiss prot names1
Swiss-Prot Names Check: 6781 ..

  • Vgl*_*

    • Viral glycoproteins

  • VGLG_HRSVL

    • Viral GLycoprotein G

    • Human Respiratory Syncytial Virus Long strain


Data

analyze% typedata swp:H1b_Human Check: 6781 ..

!!AA_SEQUENCE 1.0

ID H1B_HUMAN STANDARD; PRT; 218 AA.

AC P10412;

DT 01-MAR-1989 (REL. 10, CREATED)

DT 01-MAR-1989 (REL. 10, LAST SEQUENCE UPDATE)

DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)

DE HISTONE H1B (H1.4).

GN H1F4.

OS HOMO SAPIENS (HUMAN).

OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;

OC EUTHERIA; PRIMATES.

RN [1]

RP SEQUENCE FROM N.A.

RX MEDLINE; 92009931.

RA ALBIG W., KARDALINOU E., DRABENT B., ZIMMER A., DOENECKE D.;

RL GENOMICS 10:940-948(1991).

RN [2]

RP SEQUENCE.

RC TISSUE=SPLEEN;

RX MEDLINE; 87057092.

RA OHE Y., HAYASHI H., IWAI K.;

RL J. BIOCHEM. 100:359-368(1986).


Data

CC -!- FUNCTION: HISTONES H1 ARE NECESSARY FOR THE CONDENSATION OF

CC NUCLEOSOME CHAINS INTO HIGHER ORDER STRUCTURES.

CC -!- SUBCELLULAR LOCATION: NUCLEAR.

CC -!- THIS VARIANT ACCOUNTS FOR 60% OF HISTONE H1.

DR EMBL; M60748; G184074; -.

DR PIR; A24413; HSHU1B.

DR PIR; C40335; C40335.

DR HSSP; P08287; 1GHC.

KW CHROMOSOMAL PROTEIN; NUCLEAR PROTEIN; DNA-BINDING; MULTIGENE FAMILY;

KW ACETYLATION; METHYLATION.

FT INIT_MET 0 0

FT MOD_RES 1 1 ACETYLATION.

FT MOD_RES 25 25 METHYLATION (PARTIAL).

FT DOMAIN 35 113 GLOBULAR.

SQ SEQUENCE 218 AA; 21734 MW; 5A277FB0 CRC32;


Data

H1B_HUMAN Length: 218 April 13, 1998 13:19 Type: P Check: 2701 ..

1 SETAPAAPAA PAPAEKTPVK KKARKSAGAA KRKASGPPVS ELITKAVAAS

51 KERSGVSLAA LKKALAAAGY DVEKNNSRIK LGLKSLVSKG TLVQTKGTGA

101 SGSFKLNKKA ASGEAKPKAK KAGAAKAKKP AGAAKKPKKA TGAATPKKSA

151 KKTPKKAKKP AAAAGAKKAK SPKKAKAAKP KKAPKSPAKA KAVKPKAAKP

201 KTAKPKAAKP KKAAAKKK

analyze%


Swiss prot trembl
Swiss-Prot TrEMBL Check: 2701 ..

  • Translation of all EMBL Nucleic Acid coding sequences not yet present in Swiss-Prot

  • Allows rapid availability without immediate annotation

  • Release 16.3

  • March 30, 2001

  • 436,896 entries


Trembl divisions
TrEMBL Divisions Check: 2701 ..

  • Everything in TrEMBL: spt

  • sp_bacteria

  • sp_fungi

  • sp_human

  • sp_invertebrate

  • sp_mammal

  • sp_mhc

  • sp_organelle

  • sp_phage

  • sp_plant

  • sp_rodent

  • sp_unclassified

  • sp_vertebrate


Protein identification resource pir
Protein Identification Resource - PIR Check: 2701 ..

http://pir.georgetown.edu/

  • National Biomedical Research Foundation

  • Georgetown University

  • Current Release 67.05

  • March 23, 2001

  • 219,178 Entries


National biomedical research foundation
National Biomedical Research Foundation Check: 2701 ..

  • Database begun over twenty years ago by Margaret O. Dayhoff

  • Originally published sequences in book form

  • Started with sequences derived from direct amino acid sequencing


Data

analyze% typedata -ref PIR1:HSHU1B Check: 2701 ..

!!AA_SEQUENCE 1.0

P1;HSHU1B - histone H1-4 - human

N;Alternate names: histone H1.4; histone H1b

C;Species: Homo sapiens (man)

C;Date: 31-Dec-1988 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997

C;Accession: C40335; A24413

R;Albig, W.; Kardalinou, E.; Drabent, B.; Zimmer, A.; Doenecke, D.

Genomics 10, 940-948, 1991

A;Title: Isolation and characterization of two human H1 histone genes within

clusters of core histone genes.

A;Reference number: A40335; MUID:92009931

A;Accession: C40335

A;Status: preliminary

A;Molecule type: DNA

A;Residues: 1-219 <ALB>

A;Cross-references: GB:M60748; NID:g184073; PID:g184074

A;Experimental source: blood

R;Ohe, Y.; Hayashi, H.; Iwai, K.

J. Biochem. 100, 359-368, 1986

A;Title: Human spleen histone H1. Isolation and amino acid sequence of a main

variant, H1b.

A;Reference number: A24413; MUID:87057092

A;Accession: A24413

A;Molecule type: protein

A;Residues: 2-219 <OHE>

A;Experimental source: spleen


Data

C;Comment: This variant accounts for 60% of histone H1. Check: 2701 ..

C;Genetics:

A;Gene: GDB:H1F4

A;Cross-references: GDB:120030; OMIM:142220

A;Map position: 12q11-12q21

C;Superfamily: histone H1

C;Keywords: acetylated amino end; chromosomal protein; DNA binding; methylated

amino acid; nucleosome; spleen

F;2-219/Product: histone H1-4 #status experimental <MAT>

F;2-32/Domain: amino-terminal <NH2>

F;33-110/Domain: globular <GLB>

F;111-219/Domain: carboxyl-terminal <END>

F;2/Modified site: acetylated amino end (Ser) (in mature form) #status

experimental

F;26/Modified site: N6-methyllysine (Lys) (partial) #status experimental


Iproclass database pir
iProClass Database - PIR Check: 2701 ..

http://pir.georgetown.edu/iproclass/

  • Comprehensive family relationships and structural/functional classifications and features of proteins

    • Superfamilies

    • Families

    • Domains


Gcg supplied databases
GCG Supplied Databases Check: 2701 ..

  • GCG sequence database files are NOT normal UNIX files.

    • UNIX commands cannot be used to manipulate sequences in these databases

  • Stored as Data Libraries

  • Stored in Oracle RDB


Sequence data updates
Sequence Data Updates Check: 2701 ..

  • Genbank

    • Daily

  • GCG Flat file

    • No longer updated

    • Last update June, 2000

  • GCG SeqStore

    • Oracle RDB

    • Daily updates


Database listing gcg ff
Database listing – GCG-FF Check: 2701 ..

Databases available:

GenBank Release 118.0 (06/2000)

EMBL (Abridged) Release 62.0 (03/2000)

PIR-Protein Release 65.0 (06/2000)

NRL_3D Release 27.0 (03/2000)

SWISS-PROT Release 39.0 (06/2000)

SP-TREMBL Release 14.0 (06/2000)

PROSITE Release 16.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)


Database listing seqstore
Database listing – SeqStore Check: 2701 ..

Databases available:

GCGNUC updated nightly by DATASERVE

GCGPROT updated weekly by DATASERVE

GCGEST updated nightly by DATASERVE

PROSITE Release 15.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)


Data libraries
Data Libraries Check: 2701 ..

  • Allows rapid searches

  • Sequences organized into groups

  • Each data library can be referred to by a logical name

  • Individual sequences can be extracted from the data library.


Logical names gcg sequence databases

Logical Names: Check: 2701 ..GCG Sequence Databases

http://www.microbio.uab.edu/seqCourse/datalib.htm


Gcg seqstore oracle based sequences

GCG SeqStore (Oracle-based Sequences) Check: 2701 ..

Data Library Names


Gcg flat file

GCG Flat-file Check: 2701 ..

Data Library Names




Sequence tag databases
Sequence Tag Databases Check: 2701 ..


Protein databases
Protein Databases Check: 2701 ..


Ncbi blast databases

NCBI Blast Databases Check: 2701 ..




Specifying sequences
Specifying Sequences Check: 2701 ..

  • Filename

  • Data library specification

  • Accession number specification


Sequences within your own directories
Sequences within your own directories Check: 2701 ..

  • Use the normal file specification:

    lefkowit/sequences/vsvcg.seq


Sequences within a data library
Sequences within a Data Library Check: 2701 ..

  • Flatfile Data Library:Sequence Name

    • sw:vglg_vsvsj - VSV G protein in the SwissProt library

    • primate:humada

      • The sequence for human adenosine deaminase mRNA

  • SeqStore

    • gcgprot:vglg_vsvsj

    • gcgnuc:humada


Sequence formats
Sequence Formats Check: 2701 ..

  • GCG requires a specific sequence format

  • Sequences entered from outside GCG must be reformatted

    • analyze% reformat

      • GCG program

    • analyze% readseq

      • Non-GCG addition


Non gcg sequence file
Non-GCG Sequence File Check: 2701 ..

analyze% cat seq.txt

ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTC

AGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAG

TCAAGAGAATCATTGACAACACAG

analyze%


Analyze reformat
analyze% reformat Check: 2701 ..

analyze% reformat -check seq.txt

Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme

data file(s) so that they can be read by GCG programs.

Minimal Syntax: % reformat [-INfile=]reformat.txt -Default

Prompted Parameters: None

Local Data Files:

-DATa=translate.txt three-letter to one-letter codes


Data

Optional Parameters: Check: 2701 ..

[-OUTfile=]NewSeqName names the output file

-EXTension=.seq specifies a file name extension for the output

-LIStfile[=reformat.list] writes a list file of output sequence names

-MSF reformats sequences into an MSF output file

-RSF reformats sequences into an RSF output file

-PROtein or -NUCleotide insists that the sequences are reformatted as

protein or nucleotide sequences

-DEGap removes gap characters (. and ~) from the sequence

-LINesize=50 sets number of characters per line

-BLOcksize=10 sets number of characters per block

-BLAnklines=1 puts blank lines between the sequence lines

-NONUMbering suppresses numbering

-NOCOMments suppresses comments

-DNA changes U into T

-RNA changes T into U

-UPPer makes all sequence characters uppercase

-LOWer makes all sequence characters lowercase

-ONEIntothree translates one-letter peptides into three-letter

-THReeintoone translates three-letter peptides into one-letter

-NOHEAding input sequence from stdin contains no header

information


Data

-COMparison reformats a scoring matrix instead of a sequence

(used with -PROtein or -NUCleotide, insists

that the matrix is reformatted as a protein

or nucleotide scoring matrix)

-GAPweight=12 specifies the gap creation penalty associated

with the scoring matrix

-LENgthweight=4 specified the gap extension penalty associated

with the scoring matrix

-SCAle=10 multiplies each value in the scoring matrix

by 10 (use any number from .01 to 100.0)

-EQUALSformat writes the scoring matrix in a form that may be

more easily read

-OLDCMPformat converts a pre-Version 9 scoring matrix into

a Version 9 scoring matrix (all options used

with -COMparison can also be used with

-OLDCMPformat. -PROtein or -NUCleotide must be

specified with -OLDCMPformat

-TRANSlate=filename.txt lets you name the translation table

-NOMONitor suppresses the screen trace showing each output

file

Add what to the command line ?

No ".." divider

seq.txt length: 100 bp

analyze%


Reformatted sequence
Reformatted Sequence of a sequence

analyze% cat seq.txt

'!!NA_SEQUENCE 1.0

REFORMAT of: seq.txt check: 3430 from: 1 to: 100 April 9, 1998 14:31

(No documentation)

seq.txt Length: 100 April 9, 1998 14:31 Type: N Check: 3430 ..

1 ACGAAGACAA ACAAACCATT ATTATCATTA AAAGGCTCAG GAGAAACTTT

51 AACAGTAATC AAAATGTCTG TTACAGTCAA GAGAATCATT GACAACACAG

analyze%


Gcg sequence import programs
GCG Sequence Import Programs of a sequence

  • fromstaden

  • fromembl

  • fromgenbank

  • frompir

  • fromig

  • fromfasta

  • fromtrace


Gcg sequence export programs
GCG Sequence Export Programs of a sequence

  • tostaden

  • topir

  • toig

  • tofasta


Readseq

ReadSeq of a sequence

General reformatting program


Analyze readseq
analyze% readseq of a sequence

analyze% readseq

readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):

seq.fasta

1. IG/Stanford 10. Olsen (in-only)

2. GenBank/GB 11. Phylip3.2

3. NBRF 12. Phylip

4. EMBL 13. Plain/Raw

5. GCG 14. PIR/CODATA

6. DNAStrider 15. MSF

7. Fitch 16. ASN.1

8. Pearson/Fasta 17. PAUP/NEXUS

9. Zuker (in-only) 18. Pretty (out-only)

Choose an output format (name or #):

8


Readseq formatted sequence
ReadSeq Formatted Sequence of a sequence

Name an input sequence or -option:

seq.txt

Name an input sequence or -option:

analyze% cat seq.fasta

>seq.txt, 100 bases, D66 checksum.

ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTCAGGAGAAACTTT

AACAGTAATCAAAATGTCTGTTACAGTCAAGAGAATCATTGACAACACAG

analyze%


Sequence file utilities
Sequence File Utilities of a sequence

  • Chopup

    • Break up long lines in a text file prior to running reformat

  • Breakup

    • Breakup long sequences into individual, overlapping sequence files


Data

>uunt, 751719 bases, 1F08 checksum. of a sequence

ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATAAAAAGGATTCC

ATACGATCTTATTTCTGATCAAGCTTATGCAATTCTACAAAATGCTAAAA

CTCATAAAGTTTGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTT

GAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCT

AAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAG

ACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAAT

ATTAAAAATCGTTTTAATTTTAGTGATGAACTTTTACGTTACAATTTTAA

CAATTTAGTAATTAGTAATTTTAATCAAAAAGCGATTAAGGCGATTGAAA

ATTTATTTTCAAATAACTATGATAATAGTTCAATGTGTAACCCTTTATTT

TTATTTGGTAAAGTTGGTGTTGGTAAAACGCATATCGTGGCTGCTGCTGG

TAATCGTTTTGCTAATAGTAATCCTAATTTAAAAATTTATTATTATGAAG

GGCAAGATTTTTTTCGAAAGTTTTGTTCTGCTTCGTTAAAAGGGACTAGT

TATGTTGAAGAGTTTAAAAAAGAAATTGCTTCAGCAGATTTATTAATTTT

TGAAGATATTCAAAATATCCAATCACGTGATTCAACGGCTGAATTGTTTT

TTAATATCTTTAATGATATAAAATTAAATGGTGGAAAAATTATCTTAACA

TCTGACCGTACACCAAACGAACTTAATGGTTTTCATAATCGAATTATTTC

GAGATTAGCGTCAGGTTTGCAGTGTAAAATTTCTCAACCCGACAAAAATG

AAGCTATTAAAATTATTAATAATTGGTTTGAATTCAAAAAAAAATATCAA

ATTACTGACGAAGCTAAAGAATATATTGCTGAAGGTTTTCACACTGATAT

TAGACAGATGATtGGTAATCTAAAACAAATTTGTTTTTGAGCGGACAATG

ATACTAATAAAGATTTAATAATCACAAAAGATTATGTAATTGAGTGTTCA

GTTGAAAACGAAATTCCACTAAATATTGTTGTTAAAAAACAATTTAAACC


Data

analyze% readseq of a sequence

readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):

uunt.seq

1. IG/Stanford 10. Olsen (in-only)

2. GenBank/GB 11. Phylip3.2

3. NBRF 12. Phylip

4. EMBL 13. Plain/Raw

5. GCG 14. PIR/CODATA

6. DNAStrider 15. MSF

7. Fitch 16. ASN.1

8. Pearson/Fasta 17. PAUP/NEXUS

9. Zuker (in-only) 18. Pretty (out-only)

Choose an output format (name or #):

5

Name an input sequence or -option:

uunt

Name an input sequence or -option:


Data

analyze% more uunt.seq of a sequence

uunt

uunt, Length: 751719 (today) Check: 7944 ..

1 ATGGCTAATA ATTATCAAAC TTTATATGAT TCAGCAATAA AAAGGATTCC

51 ATACGATCTT ATTTCTGATC AAGCTTATGC AATTCTACAA AATGCTAAAA

101 CTCATAAAGT TTGCGATGGT GTTTTATATA TAATTGTAGC CAATGCCTTT

151 GAAAAAAGTA TTATTAACGG TAATTTTATT AACATTATTT CTAAATATCT

201 AAGCGAAGAA TTCAAAAAGG AAAATATTGT TAATTTTGAA TTTATTATAG

251 ACAATGAAAA ATTATTAATT AATAGCAATT TTTTAATTAA AGAAACTAAT

301 ATTAAAAATC GTTTTAATTT TAGTGATGAA CTTTTACGTT ACAATTTTAA

351 CAATTTAGTA ATTAGTAATT TTAATCAAAA AGCGATTAAG GCGATTGAAA

401 ATTTATTTTC AAATAACTAT GATAATAGTT CAATGTGTAA CCCTTTATTT

451 TTATTTGGTA AAGTTGGTGT TGGTAAAACG CATATCGTGG CTGCTGCTGG

501 TAATCGTTTT GCTAATAGTA ATCCTAATTT AAAAATTTAT TATTATGAAG

551 GGCAAGATTT TTTTCGAAAG TTTTGTTCTG CTTCGTTAAA AGGGACTAGT

...

751301 GAAAATAAAC TACGATTTGA TTAGAATGAA TTTTTTGTTG TTTCTTAATT

751351 GTATCAAGTA TATCTTCATT TTTTTTTAGA CTAATAAAAT TAGCCATAAA

751401 AATTATTTTT CACTAGAAAC TGTTAGACTA TGACGCCCTT TAAGTCTTCT

751451 TCTAGCTAAA ACATTACGCC CATTTTTTGT TTTCATGCGT GCACGAAAAC

751501 CATGCACTTT TGCTCTTTTA CGATTATTAG GTTGAAACGT TCTTTTCATA

751551 AATCCACCGC CCTCTTACTT TTTTGAAAAC ATAATATGGA TTATTATAAC

751601 ATTTTAGTTA TTTTTTATTT AATATATTTT TTTAAAAAAG TCAATGATAT

751651 CTTTTTAAAA ATAAACATAT ATAATATGAT AATAGGACAA AGATTATTTA

751701 TAAAAAATAG AGGTTACTA


Data

analyze% map uunt.seq of a sequence

Map maps a DNA sequence and displays both strands of the mapped sequence

with restriction enzyme cut points above the sequence and protein

translations below. Map can also create a peptide map of an amino acid

sequence.

***Error: Sequence "uunt.seq" could not be read or is not in GCG format

analyze% breakup uunt.seq

BreakUp reads a GCG-format sequence file containing more than 350,000

sequence characters and writes it as a set of separate, shorter,

overlapping sequence files that can be analyzed by Wisconsin Package programs.

uunt_0.seq length: 110000 bp

uunt_1.seq length: 110000 bp

uunt_2.seq length: 110000 bp

uunt_3.seq length: 110000 bp

uunt_4.seq length: 110000 bp

uunt_5.seq length: 110000 bp

uunt_6.seq length: 110000 bp

uunt_7.seq length: 51719 bp

analyze%



Multiple sequences
Multiple sequences of a sequence

  • If the program prompts with: sequences(s), file(s), or file name(s), then it can accept more than one input file


Specifying multiple sequences1
Specifying Multiple Sequences of a sequence

  • Wild Card Specification

  • File of File Names

    • List Files

  • Multiple Sequence Format File


Wild card specification flatfile
Wild card specification (flatfile) of a sequence

  • GenEMBL:*

    • All sequences in Genbank and EMBL

  • Primate:*

    • All primate sequences in GenBank

  • Primate:Hum*

    • All Human sequences in GenBank

    • EMBL uses HS for human


Wild card specification seqstore
Wild card specification (SeqStore) of a sequence

  • gcgnuc:*

    • All sequences in Genbank and EMBL

  • Must create a query or list for most groupings


File of sequence names
File of Sequence Names of a sequence

  • List Files

  • You or certain GCG programs can construct a file containing any number of sequence names.


Specify as @sequence names fil
Specify as @Sequence_names.fil of a sequence

  • The @ tells the program that Sequence_names.fil is a file of sequence names

  • The program uses all listed sequences


Contents of a file of sequence names
Contents of a File of Sequence Names of a sequence

  • Begin with a comment

  • Sequence file names follow a double period at the end of a line: ..

  • Other comments can be included if preceded by a !

  • One sequence name per line


File of sequence names1
File of Sequence Names... of a sequence

  • Put an ! in front of a name to have the program ignore that particular entry.

  • A sequence name may include a wild card

  • The file can contain another file of sequence names as a listing

    • It must be preceded by an @


Hsp70 fil file
hsp70.fil File of a sequence

January 21, 1998 ..

SWP:Hs70_Brelc

SWP:Hs70_Chick

SWP:Hs70_Human

SWP:Hs70_Leido

SWP:Hs70_Leima

SWP:Hs70_Maize

SWP:Hs70_Mouse

SWP:Hs70_Pethy

SWP:HS77_Yeast

SWP:GR78_Yeast -BEGin=43 -END=682

sequences/hsp70/ssa4.pep

ob0/users/lefkowit/sequences/hsp70/ssa1.pep

SWP:DNAK_EColi


Multiple sequence files msf
Multiple Sequence Files (msf) of a sequence

  • File containing multiple sequences that are related and have been aligned

  • Specifying msf files:

    • filename.msf{*}

    • The {*}indicates which sequences are to be used

  • You can exclude a sequence in subsequent analyses by preceding its name within the msf file with an ! sign.


Hsp70 msf
hsp70.msf of a sequence

PileUp of: @Hsp70.Fil

Symbol comparison table: GenRunData:NWSGapPep.Cmp CompCheck: 1254

GapWeight: 3.0

GapLengthWeight: 0.1

Pileup.Msf MSF: 738 Type: P December 26, 1990 13:39 Check: 288 ..

Name: Hs70_Plafa Len: 738 Check: 9820 Weight: 1.00

Name: Hs70_Thean Len: 738 Check: 120 Weight: 1.00

!Name: Hs70_Leido Len: 738 Check: 7985 Weight: 1.00

//

1 50

Hs70_Plafa .......... .....MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE

Hs70_Thean .......... .......... .......MTG PAIGIDLGTT YSCVAVYKDN

Hs70_Leido .......... .......... ......MTFD GAIGIDLGTT YSCVGVWQNE

51 100

Hs70_Plafa NVDIIANDQG NRTTPSYVAF T.DTERLIGD AAKNQVARNP ENTVFDAKRL

Hs70_Thean NVEIIPNDQG NRTTPSYVAF T.DTERLIGD AAKNQEARNP ENTIFDAKRL

Hs70_Leido RVDIIANDQG NRTTPSYVAF TSDSERLIGD AAKNQVAMNP HNTVFDAKRL


Rsf files
rsf Files of a sequence

  • Rich Sequence Format

  • Allows entry of additional information about each sequence

  • File can contain multiple sequences

    • Allows gaps

  • Different sequences do not need to be related

  • Create and Edit rsf files within SeqLab


Rsf sequence information
rsf Sequence Information of a sequence

  • Creator/author of the sequence

  • Sequence weight

  • Creation date

  • One-line description of the sequence

  • Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project

  • Known sequence features


Rsf file specification
rsf File Specification of a sequence

  • Similar to msf files

  • hsp70.rsf{*}

    • Use all the sequences in the file

  • hsp70.rsf{hs70_human}

    • Only use this single sequence

  • hsp70.rsf{hs70*}

    • Only use sequences whose name starts with hs70


Data

analyze% more rsb.rsf of a sequence

!!RICH_SEQUENCE 1.0

..

{

name dc-62-18537

descrip Description: PileUp of: *.seq

type DNA

longname dc-62-18537

checksum 8717

creation-date 4/10/98 15:45:50

strand 1

sequence

TCCACCGTGCTCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA

ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA

TTAAATCCTAAT

}

{

name swed-60-860

descrip Description: PileUp of: *.seq

type DNA

longname swed-60-860

checksum 8595

creation-date 4/10/98 15:45:50

strand 1

sequence

TCCACCGTGATCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA

ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA

TCAAATCCTACT

}



List refinement
List Refinement of a sequence

  • Run search program 1

  • Create a list of file names

  • Use as input to search program 2

  • Create a second list of file names

  • Edit the listfile at each step as necessary.

  • etc.


Programs which create a list of sequences
Programs Which Create a List of Sequences of a sequence

  • Names

  • Blast

  • Lookup

  • StringSearch

  • FindPatterns

  • FastA

  • TFastA


Names
Names of a sequence

  • Searches sequence names for a match

    • analyze% names primate:Hum*

      • Will create a file listing all human sequences present in GenBank

  • Dependent on knowing name features

    • GenBank:Hum*

    • EMBL:Hs*


Data

analyze% names -check pr:huma* of a sequence

Names identifies GCG data files and sequence entries by name. It can

show you what set of sequences is implied by any sequence specification.

Minimal Syntax: % names [-INfile=]GenEMBL:Humhb* -Default

Prompted Parameters:

[-OUTfile=]Term output file name (defaults to your terminal)

Options:

-SHOwfiles=132 limits documentation in the output file to column 132

-NOHEAding suppresses the heading at the top of the file.

-NOMONitor suppresses the screen monitor

Add what to the command line ?

What (file of filenames) output file (* TERM *) ?

gb_pr1:

huma1aadr huma1acm huma1acmb huma1ar1

huma1ar2 huma1at huma1ata huma1atb


Data

analyze% more list.file of a sequence

!!SEQUENCE_LIST 1.0

! NAMES from: pr:huma* April 13, 1998 14:55 ..

gb_pr1:huma1aadr LOCUS HUMA1AADR 2002 bp mRNA PRI 04-NOV-1991 DEFINITION Human a

lpha-A1-adrenergic receptor mRNA, complete cds. ACCE

gb_pr1:huma1acm LOCUS HUMA1ACM 1520 bp mRNA PRI 30-OCT-1994 DEFINITION Human alp

ha-1-antichymotrypsin (AACT) mRNA, complete cds. ACC

gb_pr1:huma1acmb LOCUS HUMA1ACMB 559 bp DNA PRI 30-OCT-1994 DEFINITION Human alp

ha-1-antichymotrypsin gene, exon 1. ACCESSION M18035

gb_pr1:huma1ar1 LOCUS HUMA1AR1 890 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha

-1-antitrypsin-related protein gene, exon 2. ACCESSI

gb_pr1:huma1ar2 LOCUS HUMA1AR2 3758 bp DNA PRI 30-OCT-1994 DEFINITION Human alph

a-1-antitrypsin-related protein gene, exons 3, 4 and

gb_pr1:huma1at LOCUS HUMA1AT 143 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-

1-antitrypsin (alpha-1-AT) mRNA, 3' end. ACCESSION M

gb_pr1:huma1ata LOCUS HUMA1ATA 322 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha

-1-antitrypsin gene, exon 1 (unexpressed). ACCESSION

gb_pr1:huma1atb LOCUS HUMA1ATB 1345 bp mRNA PRI 30-OCT-1994 DEFINITION Human alp

ha-1-antitrypsin mRNA, complete cds. ACCESSION M1146


Stringsearch
StringSearch of a sequence

  • Old search method

  • Searches for a particular text pattern in the sequence documentation.

    • Definition Search

    • Record Search

      • Complete search for possible text occurances

  • Very Slow!!


Lookup gcgff only
Lookup (gcgff only) of a sequence

  • Rapid Text Pattern Searching

  • Uses an index of sequence file documentation

  • Allows field-specific searches

  • Allows AND; OR; NOT matching


Lookup considerations
Lookup Considerations of a sequence

  • Be sure that analyze is set to use a vt100 terminal:

    • analyze% setenv TERM vt100

  • Lookup may miss some sequences

    • Dependent on the annotation

    • Spelling counts

  • Searches are case Insensitive


Logical operators within a field
Logical Operators Within a Field of a sequence

  • AND: &

    • A & B means find all entries that contain both A and B.

  • OR: |

    • A | B means find all entries that contain either A or B.

  • BUT-NOT: !

    • A ! B means find all entries that contain A but do not contain B.


Data

analyze% lookup -check of a sequence

LookUp identifies sequence database entries by name, accession number,

author, organism, keyword, title, reference, feature, definition,

length, or date. The output is a list of sequences.

The LookUp program is experimental in this release. LookUp sometimes

crashes or produces incorrect results if you query a nucleic acid

database and request fragment output. Please look carefully at your

results.

Minimal Syntax: % lookup [-ALLtext=]Globin -Default


Data

Prompted Parameters: of a sequence

-LIBrary=SwissProt[,...] lookup in specified data libraries

-ALLtext=Globin searches all text indices for globin

-DEFInition=Globin words indexed independently "Globin & Region"

-AUThor=Smithies for more than one "Smithies,O. & Slightom,J.L."

-KEYword=Globin see document before using keywords

-NAMe=hsggl3 entry name

-ACCessionnumber=S12345 accession number

-ORGanism="Homo Sapiens" genus and species

-REFerence=Cell&1981 complete reference: "Cell & 26 & 191- & 1981"

-TITle=History title of citation "History & Duplication"

-FEAture=Gamma any word in a feature table

-SHOrtest=100 find only sequences of length 100 or more

-LONgest=400 find only sequences of length 400 or less

-EARliest=01-apr-1992 sequences modified on or after April 1, 1992

-LATest=30-apr-1992 sequences modified on or before April 30, 1992

-MATch=OR specifies inter-field logic (AND is default)

-OUTfile=lookup.list output file for list of sequences


Data

Optional Parameters: of a sequence

-NOWILdcardextension turns off automatic wildcard extension

-INfile=@lookup.list searches in lookup.list instead of libraries

-ANNotate=FEAture[,...] shows fields from original annotation in output

acceptable values include: ACCession, AUThor,

DATe, DEFinition, FEAture, NAMe, KEYword,

ORGanism, REFerence, and TITle

-FRAgments shows features as fragments instead of whole

entries

-COMplete shows only features with unambiguous coordinates

-MONitor shows databases searched and how many hits found

Add what to the command line ?


Data

LOOKUP in what sequence libraries: of a sequence

a) swissprot

b) sptrembl

c) pir

d) embl

e) genbank

f) em_tags

g) gb_tags

h) All libraries

q) quit

Please choose one or more (* h *):


Data

Complete the query form below: of a sequence

All text:

Definition:

Author:

Keyword:

Sequence name:

Accession number:

Organism:

Reference:

Title:

Feature:

On or after (dd-mmm-yy): On or before (dd-mmm-yy):

Shortest sequence length: Longest sequence length:

Inter-field operator: AND Form of output list: Whole Entries

Press <Ctrl>D to continue.


Seqstore

SeqStore of a sequence

Sequence searching


Lookup rdb gcgrdb
Lookup_rdb (gcgrdb) of a sequence

  • Seqstore command-line sequence searching

  • Barebones – Use Seqstore Web interface


Seqstore web searching
SeqStore Web Searching of a sequence

  • Setup multiple criteria for selecting sets of sequences

  • Save as a query or list

    • Query: Active list. Changes as new sequences are added

    • List: Static list. o change with database updates

  • Save to SeqWeb

  • Powerful but can be slow


Ncbi sequence services
NCBI Sequence Services of a sequence

  • Obtain sequences directly from NCBI

    • Sequence Searches

    • Sequence Retrieval

  • Other services

    • BLAST Searches

    • Sequence Submission

    • PubMed Searches


Entrez
Entrez of a sequence

  • NCBI Databases on the Web

    • Sequence retrieval

    • Text pattern searches

  • GenBank is updated on a daily basis

  • Web Site: http://www.ncbi.nlm.nih.gov


Finding sequences by similarity

Finding Sequences by Similarity of a sequence

Using GCG


Sequence similarities
Sequence Similarities of a sequence

  • What other sequences have some primary sequence similarity to my query sequence?

  • Time and cost of the search is dependent on the size of the database

    • Restrict the size of the database


Findpatterns
FindPatterns of a sequence

  • Look for sequence patterns within sequence files

  • Allows complex pattern definitions

    • Ambiguous sequence specifications


Blast netblast
BLAST; NetBlast of a sequence

  • All search combinations possible

  • nt vs. nt database

    • blastn

  • protein vs. protein database

    • blastp

  • translated nt vs. protein database

    • blastx

  • protein vs. translated nt database

    • tblastn

  • translated nt vs. translated nt database

    • tblastx


Fasta
FastA, of a sequence

  • Search nucleotide sequences with a nucleotide query

  • Search protein sequences with a peptide query


Tfasta
TFastA of a sequence

  • Translates nucleotide sequences in all 6 reading frames

  • Search the translated sequences with a peptide query


Displaying data
Displaying Data of a sequence

  • analyze% typedata

    • Displays on your screen the contents of any GCG data file

    • -REF

      • Display documentation only


Copying data
Copying Data of a sequence

  • analyze% fetch

    • Will copy any GCG data or sequence file to your director


Sequence symbols
Sequence Symbols of a sequence

  • Sequence symbols

    • Handout lists the sequence symbols recognized by GCG

      • Ambiguity codes are as proposed by the IUB nomenclature committee

        • Used by GenBank, EMBL, and NBRF


Nucleotide symbols
Nucleotide Symbols of a sequence

IUB/GCGMeaningComplementStaden/Sanger

A A T A

C C G C

G G C G

T/U T A T

M A or C K 5

R A or G Y R

W A or T W 7

S C or G S 8

Y C or T R Y

K G or T M 6

V A or C or G B not supported

H A or C or T D not supported

D A or G or T H not supported

B C or G or T V not supported

X/N G or A or T or C X -/X

(Gap). not G or A or T or C . not supported


Amino acid symbols

IUB Symbol of a sequence3-letterMeaningCodonsDepiction

A Ala Alanine GCT,GCC,GCA,GCG !GCX

B Asp,Asn Aspartic,

Asparagine GAT,GAC,AAT,AAC !RAY

C Cys Cysteine TGT,TGC !TGY

D Asp Aspartic GAT,GAC !GAY

E Glu Glutamic GAA,GAG !GAR

F Phe Phenylalanine TTT,TTC !TTY

G Gly Glycine GGT,GGC,GGA,GGG !GGX

H His Histidine CAT,CAC !CAY

I Ile Isoleucine ATT,ATC,ATA !ATH

K Lys Lysine AAA,AAG !AAR

L Leu Leucine TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX

M Met Methionine ATG !ATG

N Asn Asparagine AAT,AAC !AAY

P Pro Proline CCT,CCC,CCA,CCG !CCX

Q Gln Glutamine CAA,CAG !CAR

R Arg Arginine CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX

S Ser Serine TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX

T Thr Threonine ACT,ACC,ACA,ACG !ACX

V Val Valine GTT,GTC,GTA,GTG !GTX

W Trp Tryptophan TGG !TGG

X Xxx Unknown !XXX

Y Tyr Tyrosine TAT, TAC !TAY

Z Glu,Gln Glutamic,

Glutamine GAA,GAG,CAA,CAG !SAR

* End Terminator TAA, TAG, TGA !TAR,TRA;TRR

Amino Acid Symbols


Other stuff

Other Stuff of a sequence

Non-sequence Data


Nonsequence data
NonSequence Data of a sequence

  • Non-Sequence Data

    • Data required to run a program

    • Copy to your directory with Fetch


Local data files
Local Data Files of a sequence

  • Copies of GCG Data files stored in your own directory.

  • May be altered as desired.


Using local data files
Using Local Data Files of a sequence

  • Programs will look first in the default directory for a particular data file with a particular name.

    • If not found the public data file will be used.

    • A user may specify a new name for the data file when running a program.


Restriction enzyme files
Restriction Enzyme Files of a sequence

  • REBASE (enzyme.dat)

    • REBASE 6/2000

    • Dr. Richard J. Roberts

    • Cold Spring Harbor Laboratory

  • Used by:

    • Map, MapSort, MapPlot


Prosite
Prosite of a sequence

  • Dictionary of sequence motifs

    • Dr. Amos Bairoch, University of Geneva

  • Release 16, 7/1999

    • over 1300 patterns

  • Used by:

    • Motifs


Profiles
Profiles of a sequence

  • Database of peptide profiles

    • Drs. Michael Gribskov and Amos Bairoch

  • Over 600 Profiles

  • Used by ProfileScan


Eukaryotic transcription factor recognition sites
Eukaryotic Transcription Factor Recognition Sites of a sequence

  • Transcription Factor Database

  • Dr. David Ghosh, NCBI

  • Release 7.5, 3/96

  • genmoredata:tfsites.dat

  • Used by:

    • FindPatterns

    • Map, MapSort, MapPlot


Codon frequency tables
Codon Frequency Tables of a sequence

  • Frequency of particular codon usage

  • Look in genmoredata

  • Organism

    • Human

    • E. coli

    • Drosophila

  • Used by:

    • BackTranslate, CodonPreference


Translation tables
Translation Tables of a sequence

  • Standard Table for translating nucleotide sequences into amino acid sequences

  • Look in genmoredata

  • Alternate translation tables

    • Mitochondria

    • Mycoplasma

  • Used by:

    • Translate, Map, Frames


Symbol comparison tables
Symbol Comparison Tables of a sequence

  • Amino acid similarities

  • What is the chance that one amino acid can substitute for another without affecting function?

  • Used by all sequence comparison programs

    • FastA, TFastA, Blast

    • Gap, BestFit

    • PileUp


Protein analysis data
Protein Analysis Data of a sequence

  • Amino acid properties

    • Charge, hydrophobicity, molecular weight, secondary structure predictions ect.

  • Protease digestion sites

  • Used by:

    • PepPlot; PlotStructure


Free energy values
Free Energy Values of a sequence

  • RNA secondary structure prediction

  • Used by:

    • Mfold, FoldRNA