Database searching
This presentation is the property of its rightful owner.
Sponsored Links
1 / 92

Database Searching PowerPoint PPT Presentation


  • 185 Views
  • Uploaded on
  • Presentation posted in: General

Database Searching. Searching for Data. Text Patterns LookUp Sequence Patterns FindPatterns ProfileSearch Sequence Similarity FastA, TFastA BLAST, NetBLAST. Introduction to Data Base Searching. What are you looking for?. "Exact" matches.

Download Presentation

Database Searching

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Database searching

Database Searching


Searching for data

Searching for Data

  • Text Patterns

    • LookUp

  • Sequence Patterns

    • FindPatterns

    • ProfileSearch

  • Sequence Similarity

    • FastA, TFastA

    • BLAST, NetBLAST


Introduction to data base searching

Introduction to Data Base Searching

  • What are you looking for?


Exact matches

"Exact" matches

  • "Have I cloned something that someone else has already worked on?"


Related sequences

"Related" sequences

  • Is there something similar to my sequence

    • Evolutionary relationships

    • Convergent function


Search program considerations

Search Program Considerations

  • Sensitivity

  • Stringency

  • Speed

  • Cost


Speed and cost

Speed and Cost

  • Time and cost of the search is dependent on the size of the database and the size of the query

    • Restrict the size of the database

  • Use the -batch qualifier to save money

  • Use GenBank's Services


Results

Results

  • Histogram

    • Plot of 'match scores" vs. number of sequences

    • Allows you to distinguish background noise from significant matches

  • Sequence Names

  • Alignments


Findpatterns

FindPatterns

  • Locate short sequence patterns in sequences

  • Nucleic acid or Protein

  • Searches both strands of a nucleic acid sequence


Pattern definitions

Pattern Definitions

  • Findpatterns, Map, Mapsort, Mapplot, and Motifs all let you search with ambiguous expressions

  • Expressions can include any legal GCG sequence character

  • Expressions can also specify:

    • OR and NOT matching

    • Begin and end constraints

    • Repeat counts


Repeats

Repeats

  • Parentheses () enclose one or more symbols that can be repeated

  • Braces {} enclose numbers that tell how many times the symbol(s) must be found

    • (GA){2,10} - GA repeated 2 to 10 times

    • G{2,} - G repeated 2 to 350,000 times

    • (GAT){,10} - GAT repeated 0 to 10 times


Taata n 20 30 atg

TAATA(N){20,30}ATG

  • TAATA, followed by 20 to 30 of any base, followed by ATG


Or matching

OR Matching

  • Enclose the different choices in parentheses and separate the choices with commas

  • RGF(Q,A)S

    • RGF followed by either Q or A followed by S.

  • GAT(TG,T,G){1,4}A means

    • GAT followed by any combination of TG, T, or G repeated from 1 to 4 times followed by A


Not matching

NOT Matching

  • Use the ~ symbol

  • GC~CAT

    • GC, followed by any symbol except C followed by AT

  • GC~(A,T)CC

    • GC followed by any symbol except A or T, followed by CC.


Begin and end constraints

BEGIN AND END Constraints

  • The pattern <GACCAT can only be found at the beginning of the sequence

  • The pattern GACCAT> can only be found at the end of the sequence


Database searching

analyze% findpatterns -check

FindPatterns identifies sequences that contain short patterns like

GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow

mismatches. You can provide the patterns in a file or simply type them

in from the terminal.

Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -Default

Prompted Parameters:

-PATterns=GAATTC,RGGAY patterns to be found

[-OUTfile=]findpatterns.find the output file name

Local Data Files:

-DATa=pattern.dat a file with a set of patterns


Database searching

Optional Parameters:

-MISmatch=1 allows mismatches in the search for your subsequence

-NAMes writes the output as a list file

-ONEstrand searches only the top strand of nucleotide sequences

-SIXbase searches only for patterns with six or more symbols

-CIRcular searches all sequences as if they were circular

-ALL does an "overlapping-set" search in nucleotide sequences

-PERFect looks only for perfect matches

-APPend appends the pattern data file to the output file

-SHOw shows every file searched even if there are no finds

-TERminal writes output to the terminal screen instead of a file

-NOMONitor suppresses the screen trace showing each file

-ONCe limits finds to patterns found a maximum of 1 time

-MINCuts=1 limits finds to patterns found a minimum of 1 time

-MAXCuts=3 limits finds to patterns found a maximum of 3 times

-EXCLude=n1,n2 excludes patterns found between positions n1 and n2

-SINce=6.90 limits search to sequences dated on or after June 1990

-BATch Submits the program to run in the batch queue

Add what to the command line ?


Database searching

FINDPATTERNS in what sequence(s) ? swp:*

Enter patterns individually, one per line.

End the list with a blank line.

Pattern 1: ygdd

Pattern 2:

What should I call the output file (* findpatterns.find *) ? ygdd.find

** findpatterns will run as a batch or at job.

** findpatterns was submitted using the command:

" atnow "

Job class000.894911339.a will be run at Mon May 11 13:28:59 CDT 1998.

analyze%


Database searching

! FINDPATTERNS on swp:* allowing 0 mismatches

! 1 YGDD May 11, 1998 11:02 ..

AAC1_PSEAE ck: 7052 len: 177

! P23181 pseudomonas aeruginosa. gentamicin 3'-acetyltransferase (ec 2.3.1.6

1 YGDD

148: YVQAD YGDD PAVAL

AMDZ_YEAST ck: 8601 len: 464

! Q03557 saccharomyces cerevisiae (baker's yeast). probable amidase ymr293c

1 YGDD

450: QVVGQ YGDD STVLD

AMOB_NITEU ck: 4649 len: 420

! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/96

1 YGDD

227: RVLLA YGDD LLMDP

AMYM_BACST ck: 5976 len: 717

! P19531 bacillus stearothermophilus. maltogenic alpha-amylase precursor (ec


Database searching

POLG_HRV1B VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIAYGDDVIFSYK

POLG_HRV2 VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIA YGDDVIFSYI

POLG_HRV89 MPSGCAGTSI FNTIINNIII RTLVLDAYKN IDLDKLKILA YGDDVIFSYN

POLG_CXA16 MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYP

POLG_HE71M MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYP

POLG_HE71B MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELKMVA YGDDVLASYP

POLG_SVDVU MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_SVDVH MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_COXB5 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_COXB3 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_CXA9 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_COXB4 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_COXB1 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_EC11G MPSGYSGTSM FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP

POLG_FMDV1 MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD

POLG_FMDVO MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD

POLG_FMDVZ MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD

POLG_FMDVA MPSDCSATGI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD

POLG_FMDVS MPSGCSATSI VNTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD

POLG_TMEVB LPSGCAATSM LNTIMNNVII RAALYLTYSN FDFDDIKVLS YGDDLLIGTN

POLG_TMEVG LPSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTN

POLG_TMEVD LLSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTN

POLG_EMCVD LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN

POLG_EMCVB LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN

POLG_EMCV LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN


Database searching

! FINDPATTERNS on swp:* allowing 0 mismatches

! 1 (L,I,V)(S,A)YGDD(L,I,V){2} May 11, 1998 11:31 ..

AMOB_NITEU ck: 4649 len: 420

! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/96

1 (L,I,V)(S,A)YGDD(L,I,V){2}

(L)(A)YGDD(L){2}

225: RSRVL LAYGDDLL MDPMD

POLG_BOVEV ck: 7260 len: 2,175

! P12915 bovine enterovirus (strain vg-5-27) genome polyprotein (coat protei

1 (L,I,V)(S,A)YGDD(L,I,V){2}

(I)(A)YGDD(L,V){2}

2,038: DDLKI IAYGDDVL ASYPY

POLG_COXB1 ck: 4153 len: 2,182

! P08291 coxsackievirus b1. genome polyprotein (coat proteins vp1 to vp4; co

1 (L,I,V)(S,A)YGDD(L,I,V){2}

(I)(A)YGDD(I,V){2}

2,045: DQFRM IAYGDDVI ASYPW

POLG_COXB3 ck: 7699 len: 2,185

! P03313 coxsackievirus b3. genome polyprotein (coat proteins vp1 to vp4; co


Fasta

FastA

  • Search nucleotide sequences with a nucleotide query

  • Search protein sequences with a peptide query


Fasta algorithm

FastA Algorithm

  • Uses a word search algorithm

  • Breaks the search into steps

  • Only the sequences with the best scores are searched in subsequent steps

  • Relatively fast

  • Sensitive


Step 1

Step 1

  • Scan the sequence database for the best hits

  • Uses a word-match type search

    • Looks for runs of short, perfect matches

  • Essentially a dotplot-like search

  • Then find the 10 Best diagonals for each sequence pair


Database searching

Initial Scan

(DotPlot)

DatabaseSequences

Query Sequence(s)


Database searching

Best Diagonals


Step 2

Step 2

  • Rescore the initial diagonals

    • Conservative replacements

    • Uses the Blosum50 symbol comparison table


Initial regions

Initial regions

  • Regions are an area of diagonals with the highest scores

  • Score reported as Init1


Database searching

Diagonals with the Highest Scores


Step 3

Step 3

  • Join adjacent diagonals.

    • Find the optimal subset of initial regions which can be joined together.

    • Corrects for gaps

  • Score reported as Init n


Database searching

Joined Diagonals


Step 4

Step 4

  • Align the sequences with the best matches

  • Uses BestFit Algorithm

    • Aligns the joined diagonals from step 3 with the query sequence

  • Score reported as Opt


Fasta summary

FastA Summary

SequenceAlignment


Specifying the word size

Specifying the Word Size

  • 1 to 6 for nt

  • 1 or 2 for aa

  • Smaller words

    • Increased sensitivity

    • Decreased stringency

      • Higher backgrounds

    • Increases cpu time


Output

Output

  • Histogram

    • Shows number of sequences falling in a particular score range

  • Sequence names with scores

  • Alignment of sequences to the query


Features

Features

  • More Sensitive than BLAST (?)

  • Slower than BLAST


Database searching

analyze% fasta -check

FastA does a Pearson and Lipman search for similarity between a query

sequence and a group of sequences of the same type (nucleic acid or

protein). For nucleotide searches, FastA may be more sensitive than BLAST.

Minimal Syntax: % fasta [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]pir:* specifies the search set

[-OUTfile=]ggamma.fasta specifies the output file name

-BEGin=1 -END=148 sets the range of interest

-WORdsize=2 sets the word size

-EXPect=2.0 lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=fastadna.cmp assigns the scoring matrix for nucleic acids

-MATRix=blosum50.cmp assigns the scoring matrix for proteins


Database searching

Optional Parameters:

-PROCessors=2 sets the number of threads devoted to the analysis

on a multiprocessor computer

Press q to quit or <Return> for more:

-MINLength=1000 searches only sequences of 1000 or more residues

-MAXLength=5000 searches only sequences of 5000 or fewer residues

-SINce=6.90 limits search to sequences dated on or after June 1990

-ONEstrand searches using only the top strand of nucleotide queries

-PAMfactor uses scoring matrix to calculate initial diagonal scores

-GAPweight=16 sets gap creation penalty (12 is protein default)

-LENgthweight=4 sets gap extension penalty (2 is protein default)

-OPTall=20 computes opt score when the initn score is 20

or higher; sorts on opt score

-NOOPTall doesn't compute opt score during search; sorts on initn

-SWalign creates final alignment as unlimited Smith-Waterman for nuc

-LIStsize=40 shows the best 40 scores (overrides EXPect)

-ALIgn=20 shows the best 20 alignments

-NOALIgn suppresses sequence alignments

-SHOWall shows complete sequences in alignment, not just overlaps

-MARKx=3 sets the alignment display mode

-NOHIStogram suppresses printing the histogram

-LINesize=60 sets number of sequence symbols per line of the alignment

-NODOCLines suppresses sequence documentation in the alignment

-BATch submits the program to run in the batch queue

-NOMONitor suppresses the screen trace for each search set sequence

Add what to the command line ?


Database searching

FASTA with what query sequence ? pol.pep

Begin (* 1 *) ?

End (* 461 *) ?

Search for query in what sequence(s) (* SwissProt:* *) ?

What word size (* 2 *) ?

Don't show scores whose E() value exceeds: (* 10.0 *):

What should I call the output file (* pol.fasta *) ?

** fasta will run as a batch or at job.

** fasta was submitted using the command:

" atnow "

job class000.894911721.a at Mon May 11 13:35:21 1998


Database searching

!!SEQUENCE_LIST 1.0

(Nucleotide) FASTA of: pol.seq from: 1 to: 1383 May 11, 1998 12:44

TO: GenEMBL:* Sequences: 436,425 Symbols: 769,709,871 Word Size: 6

Sequences too short to analyze: 25 (113 symbols)

Databases searched:

GenBank, Release 105.0, Released on 15Feb1998, Formatted on 19Feb1998

EMBL, Release 53.0, Released on 16Dec1997, Formatted on 20Feb1998

Searching with both strands of the query.

Scoring matrix: GenRunData:fastadna.cmp

Constant pamfactor used

Gap creation penalty: 16 Gap extension penalty: 4

Histogram Key:

Each histogram symbol represents 1771 search set sequences

Each inset symbol represents 16 search set sequences

z-scores computed from opt scores


Database searching

z-score obs exp

(=) (*)

< 20 216 0 :*

22 57 0 :*

24 75 1 :*

26 114 18 :*

28 100 197 :*

30 433 1195 :*

32 1343 4619 := *

34 6372 12526 :==== *

36 17948 25726 :=========== *

38 37141 42515 :===================== *

40 65424 59305 :=================================*===

42 82308 72494 :========================================*======


Database searching

84 2119 1599 :*=

86 1556 1237 :*

88 1236 957 :*

90 1075 741 :*

92 755 573 :* :===================================*====

94 635 443 :* :===========================*============

96 442 343 :* :=====================*======

98 395 266 :* :================*========

100 276 205 :* :============*=====

102 218 159 :* :=========*====

104 164 123 :* :=======*===

106 121 95 :* :=====*==

108 105 74 :* :====*==

110 67 57 :* :===*=

112 61 44 :* :==*=

114 57 34 :* :==*=

116 73 26 :* :=*===

118 50 20 :* :=*==

>120 353 16 :* :*======================


Database searching

z-score obs exp

(=) (*)

< 20 216 0 :*

22 57 0 :*

24 75 1 :*

26 114 18 :*

28 100 197 :*

30 433 1195 :*

32 1343 4619 := *

34 6372 12526 :==== *

36 17948 25726 :=========== *

38 37141 42515 :===================== *

40 65424 59305 :=================================*===

42 82308 72494 :========================================*======

44 106229 79967 :=============================================*==============

46 85717 81448 :=============================================*===

48 80192 77978 :============================================*=

50 64551 71155 :===================================== *

52 54078 62557 :=============================== *

54 48895 53435 :============================ *

56 37786 44634 :====================== *

58 33037 36644 :=================== *

60 26505 29684 :=============== *

62 22308 23798 :=============*

64 18406 18926 :==========*

66 15179 14959 :========*

68 13671 11766 :======*=

70 10703 9221 :=====*=

72 7870 7205 :====*

74 6143 5618 :===*

76 5397 4372 :==*=

78 3856 3399 :=*=

80 3153 2639 :=*

82 2636 2019 :=*

84 2119 1599 :*=

86 1556 1237 :*

88 1236 957 :*

90 1075 741 :*

92 755 573 :* :===================================*====

94 635 443 :* :===========================*============

96 442 343 :* :=====================*======

98 395 266 :* :================*========

100 276 205 :* :============*=====

102 218 159 :* :=========*====

104 164 123 :* :=======*===

106 121 95 :* :=====*==

108 105 74 :* :====*==

110 67 57 :* :===*=

112 61 44 :* :==*=

114 57 34 :* :==*=

116 73 26 :* :=*===

118 50 20 :* :=*==

>120 353 16 :* :*======================


Database searching

Results sorted and z-values calculated from opt score

1614 scores saved that exceeded 99

471579 optimizations performed

Joining threshold: 62, optimization threshold: 47, opt. width: 16

The best scores are: init1 initn opt z-sc E(867084)..

GB_VI:POL1 Begin: 5987 End: 7369

! J02281 Poliovirus type 1 (Mahoney s... 6915 6915 6915 6938.1 0

GB_VI:POLIO1B Begin: 5987 End: 7369

! V01149 Genome of human poliovirus t... 6915 6915 6915 6938.1 0

GB_VI:POL1B31B Begin: 889 End: 2271

! M17494 Poliovirus type 1 (Mahoney) ... 6879 6879 6879 6908.0 0

GB_VI:POLIO1A Begin: 5979 End: 7361

! V01148 Genome of human poliovirus t... 6879 6879 6879 6901.9 0

GB_VI:POLIOS1 Begin: 5987 End: 7369

! V01150 Genome of human poliovirus, ... 6825 6825 6825 6847.6 0

GB_PAT:I00480 Begin: 1 End: 1227

! I00480 Sequence 9 from Patent US 47... 6135 6135 6135 6162.7 0

GB_VI:PIPOLS2 Begin: 5986 End: 7368

! X00595 Poliovirus type 2 genome (st... 5275 5275 5340 5353.8 0

GB_VI:CXA24CG Begin: 6010 End: 7392

! D90457 Coxsackievirus A24, complete... 5142 5142 5142 5154.6 0

GB_PAT:I22065 Begin: 5978 End: 7360

! I22065 Sequence 1 from patent US 55... 5124 5124 5124 5136.5 0

GB_VI:PIPO3119 Begin: 5978 End: 7360

! X01076 Poliovirus type 3 complete s... 5115 5115 5115 5127.5 0

GB_VI:POL3L37 Begin: 5978 End: 7360

! K01392 Poliovirus P3/Leon/37 (type ... 5115 5115 5115 5127.5 0


Fasta results nt search polio polymerase vs genembl

FastA Results (nt search)Polio Polymerase vs GenEMBL

pdf file


Fasta results protein search polio polymerase vs swissprot word 2

FastA Results (protein search)Polio Polymerase vs SwissProt word=2

pdf file


Fasta results protein search polio polymerase vs swissprot word 1

FastA Results (protein search)Polio Polymerase vs SwissProt word=1

pdf file


Tfasta

TFastA

  • Translates the nucleotide sequence database in all 6 reading frames

  • Search the translated sequences with a peptide query

  • Algorithm is the same as for FastA


Database searching

analyze% tfasta -check

TFastA does a Pearson and Lipman search for similarity between a

query peptide sequence and any group of nucleotide sequences. TFastA

translates the nucleotide sequences in all six reading frames before

performing the comparison. It is designed to answer the question, "What

implied peptide sequences in a nucleotide sequence database are similar

to my peptide sequence?"

Minimal Syntax: % tfasta [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]GenEMBL:* search set (all of GenEMBL)

[-OUTfile=]ggamma.tfasta output file name

-BEGin=1 -END=148 range of interest

-WORdsize=2 word size

-EXPect=2.0 lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=blosum50.cmp scoring matrix for peptides


Database searching

Optional Parameters:

-GAPweight=16 gap creation penalty

-LENgthweight=4 gap extension penalty

-SINce=6.90 limits search to sequences dated on or after June 1990

-THREEFrames translates and searches only the three forward reading

frames

-FRAme=1 translates and searches only the frame specified.

-NOPAMfactor uses a constant factor to calculate initial diagonal scores

-LIStsize=40 shows the best 40 scores (overrides EXPect)

-NOATTRibutes suppresses writing the Begin, End, and Strand

list attributes to the list of best scores

-ALIgn=20 shows the best 20 alignments

-NOALIgn suppresses sequence alignments

-OPTall=20 immediately computes opt score when the initn score is 20

or higher; sorts on opt score

-NOOPTall doesn't compute opt score during search; sorts on initn

-SWalign does final alignment as Smith-Waterman

-SHOWall shows complete sequences in alignment, not just overlaps

-MARKx=3 determines the alignment display mode

-NOHIStogram suppresses printing the histogram

-LINEsize=60 number of sequence symbols per line of the alignment

-NODOCLines suppresses sequence documentation in the alignment

-NOMONitor suppresses the screen trace for each search set sequence

-BATch submits the program to run in the batch queue

-MINLength=1000 searches only sequences of 1000 or more residues

-MAXLength=5000 searches only sequences of 5000 or fewer residues

Add what to the command line ?


Database searching

TFASTA with what query sequence ? pol.pep

Begin (* 1 *) ?

End (* 461 *) ?

Search for query in what sequence(s) (* GenEMBL:* *) ?

What word size (* 2 *) ?

Don't show scores whose E() value exceeds: (* 10.0 *):

What should I call the output file (* pol.tfasta *) ?

** tfasta will run as a batch or at job.

** tfasta was submitted using the command:

" atnow "

job class000.894911765.a at Mon May 11 13:36:05 1998


Tfasta results polio polymerase vs genembl

TFastA ResultsPolio polymerase vs. GenEMBL

pdf file


Blast

BLAST

  • Basic Local Alignment Search Tool

  • Altschul, Gish, Miller, Myers, and Lipman

    • NCBI

    • J. Mol. Bio. 215:403


Blast searches

BLAST Searches

  • Locate regions of similarity between a query sequence and database sequences

    • High Scoring Segment Pair

    • Starts with a word search comparison

    • Current versions will introduce gaps as necessary

  • Will find multiple regions of similarities between the query and any one database sequence

  • Provides statistical data for similarity significance


Blast options

BLAST Options

  • BLAST

    • Uses local, GCG-supplied databases

    • Uses your own BLAST-formatted databases

      • Format with GCGToBLAST

  • NetBLAST

    • Uses NCBI's BLAST Server

    • Uses most recent version of Genbank

      • Updated daily


Netblast

NetBLAST

  • NetBLAST automatically submits the sequence to NCBI's BLAST network server

  • Results are returned to a file in your directory


Flavors of blast

Flavors of BLAST

  • BLASTN

    • nt query vs. nt database

  • BLASTP

    • protein query vs. protein database

  • BLASTX

    • nt query vs. protein database

    • nt query translated in all six frames

  • TBLASTN

    • protein query vs. translated nt database

  • TBLASTX

    • translated nt query vs. translated nt database


Database searching

analyze% gcgff

analyze% blast pol.pep

BLAST searches one or more nucleic acid or protein databases

for sequences similar to one or more query sequences of any

type. BLAST can produce gapped alignments for the matches it

finds.

Begin (* 1 *) ?

End (* 461 *) ?

*** ERROR: no databases found!

analyze%


Database searching

analyze% blast -batch -check pol.pep

BLAST searches one or more nucleic acid or protein databases

for sequences similar to one or more query sequences of any

type. BLAST can produce gapped alignments for the matches it

finds.

Minimal Syntax: % blast [-INfile1=]pir:mywhp -Default

Prompted Parameters:

[-INfile2=]pir specifies database(s) to search

-EXPect=10.0 ignores scores that would occur by chance

more than 10 times

-LIStsize=500 sets maximum number of sequences listed in the output

[-OUTfile=]mywhp.blastp names the output file

Local Data Files:

[-DATa2=blast.ldbs] names the list of available local databases

[-DATa3=blast.sdbs] names the list of available site-specific databases


Database searching

Optional Parameters:

-PROCessors=1 sets the number of processors to use

-TBLASTX if query and database are both nucleotide,

translates both and does protein comparisons

-DBNucleotideonly searches only nucleic databases

-DBProteinonly searches only protein databases

-WORdsize=0 sets word size (0 selects program default)

-MATch=1 sets nucleotide match reward

-MISmatch=-3 sets nucleotide mismatch penalty

-MATRix=BLOSUM62 assigns the scoring matrix for proteins

-GAPweight=0 sets gap creation penalty

-LENgthweight=0 sets gap extension penalty

-HITEXTTHRESHold=0 sets minimum score to extend hits

-NOFILter suppresses filtering of low complexity segments

out of nucleotide and protein query sequences

-TRANSlate=1 names genetic code for translating query

-DBTRANSlate=1 names genetic code for translating database

-EFFdbsize=0 sets effective database size (0 selects

program default)

-NOFRAgments suppresses showing list file entries as fragments

-ALIgnments=250 sets number of sequences for which to show

alignments

-VIEW=0 selects alignment view type (0-6 allowed)

-NOGAPS suppresses gapped alignments

-XDRopoff=0 sets X dropoff value for gapped alignments

-NATive produces unmodified BLAST2 output

-APPend="string" appends "string" to pass-through command line

-BATch submits program to batch queue


Database searching

Add what to the command line ?

Begin (* 1 *) ?

End (* 461 *) ?

Search for query in what sequence database:

1) GCGPROT p GCG SeqStore Protein Database

2) GCGNUC n GCG SeqStore Nucleotide Database

3) GCGEST n GCG SeqStore EST Database

Please choose one (* 1 *):

Ignore hits expected to occur by chance more than (* 10.0 *) times?

Limit the number of sequences in my output to (* 500 *) ?

What should I call the output file (* pol.blastp *) ?

** blast will run as a batch or at job.

** blast was submitted using the command:

" atnow "

commands will be executed using /bin/csh

job 989254200.a at Mon May 7 11:50:00 2001

analyze%


Local blast results

Local BLAST Results

text file


Database searching

analyze% netblast -check pol.pep

NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination.

NetBLAST can search only databases maintained at the National Center for

Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

Minimal Syntax: % netblast [-INfile1=]pir:zizm99 -Default

Prompted Parameters:

[-INfile2=]nr specifies database to search

-EXPect=10.0 ignores scores that would occur by chance

more than 10 times

-LIStsize=250 sets maximum number of sequences listed in

the output

[-OUTfile=]zizm99.netblastp names the output file

Local Data Files:

[-DATa1=netblast.rdbs] names the list of available remote databases

-MATRix=blosum62 assigns a scoring matrix for proteins


Database searching

Optional Parameters:

-NOFILter suppresses filtering of low complexity

regions out of nucleotide and protein query sequences

-GAPweight=11 sets gap creation penalty

-LENgthweight=1 sets gap extension penalty

-TBLASTX if query and database are both nucleotide,

translates both and does protein comparisons

-TRANSlate=1 names genetic code for translating query

-DBNucleotideonly searches only nucleic databases

-DBProteinonly searches only protein databases

-URL=www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-blast_report

sends HTTP query to NCBI's net server (default)

-MAIL[[email protected]] sends email to NCBI's email server

-ALIgnments=100 sets number of sequences for which to show alignments

-NOGAPS produce ungapped alignments using sum statistics

-BATch submits program to batch queue

-PROXY="gateway.company.com:99/" specifies the host and port of a proxy

server

-APPend="string;string..." appends each string, on a separate line,

to the query (NCBI's email format)

Add what to the command line ?


Database searching

Search for query in what sequence database:

1) nr p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR

2) pdb p PDB protein sequences

3) swissprot p SwissProt sequences

4) yeast p Saccharomyces cerevisiae protein sequences

5) kabat p Kabat Sequences of Proteins of Immunological Interest

6) alu p Translations of Select Alu Repeats from REPBASE

7) month p All new or revised GenBank CDS translation+PDB+SwissProt+PI

8) ecoli p E. coli genomic CDS translations

9) nr n Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's

10) pdb n PDB nucleotide sequences

11) vector n Vector subset of GenBank

12) yeast n Saccharomyces cerevisiae genomic nucleotide sequences

13) est n Non-redundant Database of GenBank+EMBL+DDBJ EST Division

14) sts n Non-redundant Database of GenBank+EMBL+DDBJ STS Division

15) htgs n High Throughput Genomic Sequences

16) mito n Database of mitochondrial sequences, Rel. 1.0, July 1995

17) kabat n Kabat Sequences of Nucleic Acid of Immunological Interest

18) epd n Eukaryotic Promotor Database

19) alu n Select Alu Repeats from REPBASE

20) month n All new or revised GenBank+EMBL+DDBJ+PDB sequences released

21) gss n Genome Survey Sequence, includes single_pass genomic data,

22) ecoli n E. coli genomic nucleotide sequences.

Please choose one (* 1 *):


Database searching

Ignore hits expected to occur by chance more than (* 10.0 *) times?

Limit the number of sequences in my output to (* 250 *) ?

What should I call the output file (* pol.netblastp *) ?

Sending query...

Awaiting results...

Done. Wrote search results to pol.netblastp

analyze%


Netblast results polio polymerase vs genbank nr

NetBLAST ResultsPolio polymerase vs. Genbank nr

text file


Web based blast searches

Web-based BLAST Searches

  • http://www.ncbi.nlm.nih.gov/BLAST/

    • Gapped BLAST with graphic summary

  • PSI-BLAST

    • Gapped BLAST followed by BLAST using a position-specific scoring matrix

    • More sensitive

    • Repeat as many times as desired


Running ncbi blast

Running NCBI Blast

  • http://www.ncbi.nlm.nih.gov/blast

    >POL ID POLG_POL1M STANDARD; PRT; 2206 AA.

    GEIPWMRPSKDAGYPIINAPSKTKLEPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFS

    KYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEALDLSTSAGYPY

    VAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASS

    LNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDAS

    LSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMI

    NNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKS

    ATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLL

    AWHNGEEEYNKFLAKIRSVPIGRALLLPEYSTLYRRWLDSF


Ssearch

SSearch

Very sensitive database searching


Ssearch1

SSearch

  • Rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type

  • This may be the most sensitive method available for similarity searches

  • VERY slow!


Database searching

analyze% ssearch -batch -check pol.pep

SSearch does a rigorous Smith-Waterman search for similarity between

a query sequence and a group of sequences of the same type (nucleic acid

or protein). This may be the most sensitive method available for

similarity searches. Compared to BLAST and FastA, it can be very slow.

Minimal Syntax: % ssearch [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]pir:* specifies the search set

[-OUTfile=]ggamma.ssearch names the output file

-BEGin=1 -END=148 sets the range of interest

-EXPect=2.0 lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=fastadna.cmp assigns the scoring matrix for nucleic acids

-MATRix=blosum50.cmp assigns the scoring matrix for proteins


Database searching

Optional Parameters:

-PROCessors=2 sets the number of threads devoted to the analysis

on a multiprocessor computer

-MINLength=1000 searches only sequences of 1000 or more residues

-MAXLength=5000 searches only sequences of 5000 or fewer residues

-SINce=6.90 limits search to sequences dated on or after June 1990

-ONEstrand searches using only the top strand of nucleotide queries

-GAPweight=16 sets the gap creation penalty (12 is protein default)

-LENgthweight=4 sets the gap extension penalty (2 is protein default)

-LIStsize=40 shows the best 40 scores (overrides EXPect)

-ALIgn=20 shows the best 20 alignments

-NOALIgn suppresses sequence alignments

-SHOWall shows complete sequences in alignment, not just overlaps

-MARKx=3 sets the alignment display mode

-NOHIStogram suppresses printing the histogram

-LINesize=60 sets number of sequence symbols per line of the alignment

-NODOCLines suppresses sequence documentation in the alignment

-BATch submits the program to run in the batch queue

-NOMONitor suppresses the screen trace for each search set sequence


Database searching

Add what to the command line ?

Begin (* 1 *) ?

End (* 461 *) ?

Search for query in what sequence(s) (* PIR:* *) ? gcgprot:*

Don't show scores whose E() value exceeds: (* 10.000000 *):

Maximum number of alignments (* 40 *) ?

What should I call the output file (* pol.ssearch *) ?

** ssearch will run as a batch or at job.

** ssearch was submitted using the command:

" atnow "

commands will be executed using /bin/csh

job 989256600.a at Mon May 7 12:30:00 2001

analyze%


Ssearch output

SSearch Output

Text file


Framesearch

FrameSearch

Optimal alignments including reading frame shifts


Framesearch1

FrameSearch

  • Finds similarities between a protein sequence and a nucleotide sequence database

  • Finds similarities between a nucleotide sequence and a protein sequence database

  • Aligns amino acids to nucleotide codons

  • Allows for frameshifts in the nucleotide sequence(s)


Running framesearch

Running FrameSearch

  • Takes a LONG LONG time to run

  • Run in Batch mode

  • Limit the size of the database


Database searching

analyze% framesearch -check -batch

FrameSearch searches a group of protein sequences for similarity to one

or more nucleotide query sequences, or searches a group of nucleotide

sequences for similarity to one or more protein query sequences. For

each sequence comparison, the program finds an optimal alignment between

the protein sequence and all possible codons on each strand of the

nucleotide sequence. Optimal alignments may include reading frame

shifts.

Minimal Syntax: % framesearch [-INfile1=]EST:Atts0012 -Default

Prompted Parameters:

-BEGin1=1 -END1=286 range of interest for a single

query sequence

[-INfile2]=SwissProt:* search set

-GAPweight=12 gap creation penalty

-LENgthweight=4 gap extension penalty

-FRAmeweight=0 frameshift gap penalty

[-OUTfile]=atts0012.framesearch output file name

Local Data Files: -MATRix=blosum62.cmp amino acid substitution matrix

-TRANSlate=translate.txt contains the genetic code


Database searching

Optional Parameters:

-BEGin1=1 -END1=100 range of interest for each query sequence

-ONEstrand searches only the top strand of nucleotide seqs

-LIStsize=40 number of scores to show

-ALIgn=40 number of alignments to show

(-NOALIgn suppresses alignments)

-GLObal searches by global alignment

-ENDWeight penalizes end gaps in global alignments like

other gaps

-HIGhroad among equally optimal alignments, shows one

with maximum gaps in protein sequence

-LOWroad among equally optimal alignments, shows one

with maximum gaps in nucleotide sequence

-LINesize=70 length of documentation for each sequence in the

output list

-PAIr=x,2,1 thresholds for displaying '|', ':', and '.'

-WIDth=50 the number of sequence symbols per line

-PAGe=60 adds a line with a form feed every 60 lines

-NOBIGGaps suppresses abbreviation of large gaps with '.'s

-NOPLOt suppresses the plot

of the search score distribution

-BATch submits program to the batch queue

-NOMonitor suppresses the screen trace of program progress

-NOSUMmary suppresses the screen summary


Database searching

FRAMESEARCH with what query sequence(s) ? uu001a.seq

Begin (* 1 *) ?

End (* 1371 *) ?

Search for query in what sequence(s) (* SwissProt:* *) ?

*** I read your local translation table "translate.txt"

What is the gap creation penalty (* 12 *) ?

What is the gap extension penalty (* 4 *) ?

What is the frameshift penalty (* 0 *) ?

What should I call the output file (* uu001a.framesearch *) ?

** framesearch will run as a batch or at job.

** framesearch was submitted using the command:

" atnow "

commands will be executed using /bin/csh

job 894913723.a at Mon May 11 14:08:43 1998


Database searching

!!SEQUENCE_LIST 1.0

FRAMESEARCH of: /export/home/lefkowit/temp/uu001a.seq

UU001

TO: sw:dnaa_* Sequences: 31 Total-length: 13,393 May 12, 1998 11:08

Databases searched:

SWISS-PROT, Release 35.0, Released on 13Dec97, Formatted on 13Dec1997

Scoring matrix: GenRunData:blosum62.cmp

Translation table: translate.txt

Gap creation penalty: 12

Gap extension penalty: 4

Frameshift penalty: 0

The best scores are: ..

SW:DNAA_MYCCA P24116 mycoplasma capricolum. chromosomal replicatio... 346

SW:DNAA_MYCGE P35888 mycoplasma genitalium. chromosomal replicatio... 316

SW:DNAA_SPICI P34028 spiroplasma citri. chromosomal replication in... 308

SW:DNAA_BORBU P33768 borrelia burgdorferi (lyme disease spirochete... 297

SW:DNAA_MYCPN Q59549 mycoplasma pneumoniae. chromosomal replicatio... 275

SW:DNAA_MYCMY P35889 mycoplasma mycoides. chromosomal replication ... 264


Database searching

uu001a.seq

DNAA_MYCCA

Quality: 346 Length: 882

Ratio: 1.197 Gaps: 5

Percent Similarity: 44.406 Percent Identity: 33.916

. . . . .

438 AACCCTTTATTTTTATTTGGTAAAGTTGGTGTTGGTAAAACGCATATCGT 487

||||||||||||::::::|||... |||...||||||||||||:::..

142 AsnProLeuPheIleTyrGlyGluSerGlyMetGlyLysThrHisLeuLe 158

. . . . .

488 GGCTGCTGCTGGTAATCGTTTTGCTAATAGTAA.TCCTAATTTAAAATTT 536

. |||||| ||| ...... ||| |||:::

159 uLysAlaAlaLysAsnTyrIleGluSerAsnPheSerAspLeuLysValS 175

. . . . .

537 ATTATTATGAAGGGCAAGATTTTTTTCGAAAGTTTTGTTCTGCTTCGTTA 586

||| ||| :::||| ||||||

176 erTyrMetSerGlyAspGluPheAlaArgLysAlaValAspIleLeuGln 191

. . . . .

587 AAAGGGACTAGTTATGTTGAAGAGTTTAAAAAAGAAATTGCTTCAGCAGA 636

||| :::|||:::|||||| |||::: ||

192 LysThrHisLysGluIleGluGlnPheLysAsnGluValCysGlnAsnAs 208

. . . . .

637 TTTATTAATTTTTGAAGATATTCAAAATATCCAATCACGTGATTCAACGG 686

|...|||||| :::|||:::||| ::: :::::: |||

209 pValLeuIleIleAspAspValGlnPheLeuSerTyrLysGluLysThrA 225

. . . . .

687 CTGAATTGTTTTTTAATATCTTTAATGATATAAAATTAAATGGTGGAAAA 736

|||:::|||||| |||||||||... ||| ...

226 snGluIlePhePheThrIlePheAsnAsnPheIleGluAsnAspLysGln 241


Framealign

FrameAlign

  • Align a protein sequence to the codons in all possible reading frames of a nucleotide sequence

  • Allows for frameshifts

  • Local or Global alignment


Database searching

analyze% framealign -check

FrameAlign creates an optimal alignment of the best segment of

similarity (local alignment) between a protein sequence and the codons

in all possible reading frames of a nucleotide sequence. Optimal

alignments may include reading frame shifts.

Minimal Syntax: % framealign [-INfile1=]EST:Atts0012 \

[-INfile2=]SW:G3pc_Arath -Default

Prompted Parameters:

-BEGin1=1 -END1=286 range of interest for first sequence

-BEGin2=1 -END2=338 range of interest for second sequence

-REVerse strand for nucleotide sequence

-GAPweight=12 gap creation penalty

-LENgthweight=4 gap extension penalty

-FRAmeweight=0 frameshift gap penalty

[-OUTfile1]=gamma.pair output file for alignment

Local Data Files: -MATRix=blosum62.cmp

amino acid substitution matrix

-TRANSlate=translate.txt contains the genetic code


Database searching

Optional Parameters:

-GLObal creates global alignment (default is local)

-ENDWeight penalizes end gaps in global alignments like

other gaps

-LIMit1=337 gap shift limit for nucleotide sequence

-LIMit2=285 gap shift limit for protein sequence

-HIGhroad among equally optimal alignments, shows one

with maximum gaps in protein sequence

-LOWroad among equally optimal alignments, shows one

with maximum gaps in nucleotide sequence

-PAIr=x,2,1 thresholds for displaying '|', ':', and '.'

-WIDth=50 the number of sequence symbols per line

-PAGe=60 adds a line with a form feed every 60 lines

-NOBIGGaps suppresses abbreviation of large gaps with '.'s

-OUTfile2[=atts0012.gap] new file for nucleotide sequence with gaps added

-OUTfile3[=g3pc_arath.gap] new file for protein sequence with gaps added

-BATch submits program to the batch queue

-NOMonitor suppresses the screen trace of program progress

-NOSUMmary suppresses the screen summary

Add what to the command line ?


Database searching

Local alignment of what sequence 1 ? uu001.pep

Begin (* 1 *) ?

End (* 457 *) ?

to what nucleotide sequence ? uu001a.seq

Begin (* 1 *) ?

End (* 1371 *) ?

Reverse (* No *) ?

*** I read your local translation table "translate.txt"

What is the gap creation penalty (* 12 *) ?

What is the gap extension penalty (* 4 *) ?

What is the frameshift penalty (* 0 *) ?

What should I call the paired output display file (* uu001.pair *) ? uu001.fran

Aligning ......................-.......................

Gaps: 3

Quality: 2285

Quality Ratio: 5.011

% Similarity: 100.000

Length: 1371


Database searching

Local alignment of: uu001a.seq check: 9730 from: 1 to: 1371

UU001

to: uu001.pep check: 6522 from: 1 to: 457

Scoring matrix: /export/home0/gcg/gcgcore/data/rundata/blosum62.cmp

Translation table: /export/home/lefkowit/temp/translate.txt

This file contains the Mold, Protozoan, and Coelenterate

Mitochondrial and the Mycoplasma/Spiroplasma Code translation

table, specified in the Feature Definition, Version 1.08, formatted

for use with GCG programs. It names amino acids in both one and

three-letter form and lists the codons which should translate into . . .

Gap Weight: 12 Average Match: 2.912

Length Weight: 4 Average Mismatch: -2.003

Frameshift Weight: 0

Quality: 2285 Length: 1371

Ratio: 5.011 Gaps: 3

Percent Similarity: 100.000 Percent Identity: 100.000

Match display thresholds for the alignment(s):

| = IDENTITY

: = 2

. = 1


Database searching

. . . . .

1 ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATAAAAAGGATTCC 50

||||||||||||||||||||||||||||||||||||||||||||||||||

1 MetAlaAsnAsnTyrGlnThrLeuTyrAspSerAlaIleLysArgIlePr 17

. . . . .

51 ATACGATCTTATTTCTGATCAAGCTTATGCAATTCTACAAAATGCTAAAA 100

||||||||||||||||||||||||||||||||||||||||||||||||||

18 oTyrAspLeuIleSerAspGlnAlaTyrAlaIleLeuGlnAsnAlaLysT 34

. . . . .

101 CTCATAAGTT.TGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTT 149

|||||||| |||||||||||||||||||||||||||||||||||||||

35 hrHisLysValCysAspGlyValLeuTyrIleIleValAlaAsnAlaPhe 50

. . . . .

150 GAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCT 199

||||||||||||||||||||||||||||||||||||||||||||||||||

51 GluLysSerIleIleAsnGlyAsnPheIleAsnIleIleSerLysTyrLe 67

. . . . .

200 AAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAG 249

||||||||||||||||||||||||||||||||||||||||||||||||||

68 uSerGluGluPheLysLysGluAsnIleValAsnPheGluPheIleIleA 84

. . . . .

250 ACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAAT 299

||||||||||||||||||||||||||||||||||||||||||||||||||

85 spAsnGluLysLeuLeuIleAsnSerAsnPheLeuIleLysGluThrAsn 100


Database searching

Next

Multiple Sequence Analysis


  • Login