1 / 30

Pipelines

Pipelines. -Keyboard -File -Pipe. Program. -Screen -File -Pipe. input. output. The “echo” program reads text from the input and writes this to the output . - Keyboard -File -Pipe. echo. -Screen -File -Pipe. input. output.

paxton
Download Presentation

Pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelines

  2. -Keyboard -File -Pipe Program -Screen -File -Pipe input output

  3. The “echo” program reads text from the inputand writes this to the output -Keyboard -File -Pipe echo -Screen -File -Pipe input output

  4. The “cat” program reads text from the inputand writes this to the output -Keyboard -File -Pipe cat -Screen -File -Pipe input output

  5. echo uniprot_sprot_plants.fasta uniprot_sprot_plants.fasta

  6. cat uniprot_sprot_plants.fasta >sp|Q43495|108_SOLLC Protein 108 OS=Solanumlycopersicum PE=2 SV=1 MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN >sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamumindicum PE=2 SV=1 MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSLRIQSEGGTTELWDE RQEQFQCAGIVAMRSTIRPNGLSLPNYHPSPRLVYIERGQGLISIMVPGCAETYQVHRSQ RTMERTEASEQQDRGSVRDLHQKVHRLRQGDIVAIPSGAAHWCYNDGSEDLVAVSINDVN HLSNQLDQKFRAFYLAGGVPRSGEQEQQARQTFHNIFRAFDAELLSEAFNVPQETIRRMQ SEEEERGLIVMARERMTFVRPDEEEGEQEHRGRQLDNGLEETFCTMKFRTNVESRREADI FSRQAGRVHVVDRNKLPILKYMDLSAEKGNLYSNALVSPDWSMTGHTIVYVTRGDAQVQV VDHNGQALMNDRVNQGEMFVVPQYYTSTARAGNNGFEWVAFKTTGSPMRSPLAGYTSVIR AMPLQVITNSYQISPNQAQALKMNRGSQSFLLSPGGRRS >sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1 MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEALEPIEVIQAEAGVTEIWDAYD QQFQCAWSILFDTGFNLVAFSCLPTSTPLFWPSSREGVILPGCRRTYEYSQEQQFSGEGG RRGGGEGTFRTVIRKLENLKEGDVVAIPTGTAHWLHNDGNTELVVVFLDTQNHENQLDEN QRRFFLAGNPQAQAQSQQQQQRQPRQQSPQRQRQRQRQGQGQNAGNIFNGFTPELIAQSF NVDQETAQKLQGQNDQRGHIVNVGQDLQIVRPPQDRRSPRQQQEQATSPRQQQEQQQGRR GGWSNGVEETICSMKFKVNIDNPSQADFVNPQAGSIANLNSFKFPILEHLRLSVERGELR PNAIQSPHWTINAHNLLYVTEGALRVQIVDNQGNSVFDNELREGQVVVIPQNFAVIKRAN

  7. The “grep” program filters the input for given termsand writes the filtered text to the output -Keyboard -File -Pipe grep -Screen -File -Pipe input output

  8. grep--help Usage: grep [OPTION]... PATTERN [FILE] ... Search for PATTERN in each FILE or standard input. Example: grep -i 'hello world' menu.hmain.c Regexp selection and interpretation: -E, --extended-regexp PATTERN is an extended regular expression -F, --fixed-strings PATTERN is a set of newline-separated strings -G, --basic-regexp PATTERN is a basic regular expression -P, --perl-regexp PATTERN is a Perl regular expression -e, --regexp=PATTERN use PATTERN as a regular expression -f, --file=FILE obtain PATTERN from FILE -i, --ignore-case ignore case distinctions -w, --word-regexp force PATTERN to match only whole words -x, --line-regexp force PATTERN to match only whole lines -z, --null-data a data line ends in 0 byte, not newline

  9. grepspuniprot_sprot_plants.fasta >sp|Q43495|108_SOLLC Protein 108 OS=Solanumlycopersicum PE=2 SV=1 >sp|Q9XHP0|11S2_SESIN 11S globulin seed storage protein 2 OS=Sesamumindicum PE=2 SV=1 >sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus annuus GN=HAG3 PE=3 SV=1 >sp|P13744|11SB_CUCMA 11S globulin subunit beta OS=Cucurbita maxima PE=1 SV=1 >sp|Q05349|12KD_FRAAN Auxin-repressed 12.5 kDa protein OS=Fragariaananassa PE=2 SV=1 >sp|O23878|13S1_FAGES 13S globulin seed storage protein 1 OS=Fagopyrumesculentum GN=FA02 PE=2 SV=1 >sp|O23880|13S2_FAGES 13S globulin seed storage protein 2 OS=Fagopyrumesculentum GN=FA18 PE=2 SV=1 >sp|Q9XFM4|13S3_FAGES 13S globulin seed storage protein 3 OS=Fagopyrumesculentum GN=FAGAG1 PE=1 SV=1 >sp|P83004|13SB_FAGES 13S globulin basic chain OS=Fagopyrumesculentum PE=1 SV=1 >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1 >sp|P93207|14310_SOLLC 14-3-3 protein 10 OS=Solanumlycopersicum GN=TFT10 PE=2 SV=2 >sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1 >sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1 >sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3 >sp|P49106|14331_MAIZE 14-3-3-like protein GF14-6 OS=Zea mays GN=GRF1 PE=1 SV=1 >sp|Q84J55|14331_ORYSJ 14-3-3-like protein GF14-A OS=Oryza sativa subsp. japonica GN=GF14A PE=2 SV=1 >sp|P85938|14331_PSEMZ 14-3-3-like protein 1 (Fragments) OS=Pseudotsugamenziesii PE=1 SV=1 >sp|P93206|14331_SOLLC 14-3-3 protein 1 OS=Solanumlycopersicum GN=TFT1 PE=3 SV=2 >sp|Q41418|14331_SOLTU 14-3-3-like protein OS=Solanumtuberosum PE=2 SV=1 >sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=

  10. Redirection By placing a “>” with a file name at the end of the command line the output can be redirected to a file.

  11. grepspuniprot_sprot_plants.fasta> out.txt

  12. The “wc” program counts lines or characters in the inputand writes the count to the output -Keyboard -File -Pipe wc -Screen -File -Pipe input output

  13. wc -l uniprot_sprot_plants.fasta 250177 uniprot_sprot_plants.fasta wc -l out.txt 33851 out.txt

  14. Creating a pipeline With the “|” character the output of one program can be linked to the input of another program

  15. pipeline grep wc input Input/ Output output

  16. grepspuniprot_sprot_plants.fasta| wc –l 33851

  17. grepspuniprot_sprot_plants.fasta| grep thaliana >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=2 SV=1 >sp|Q9S9Z8|14311_ARATH 14-3-3-like protein GF14 omicron OS=Arabidopsis thaliana GN=GRF11 PE=2 SV=1 >sp|Q9C5W6|14312_ARATH 14-3-3-like protein GF14 iota OS=Arabidopsis thaliana GN=GRF12 PE=2 SV=1 >sp|P42643|14331_ARATH 14-3-3-like protein GF14 chi OS=Arabidopsis thaliana GN=GRF1 PE=1 SV=3 >sp|Q01525|14332_ARATH 14-3-3-like protein GF14 omega OS=Arabidopsis thaliana GN=GRF2 PE=1 SV=2 >sp|P42644|14333_ARATH 14-3-3-like protein GF14 psi OS=Arabidopsis thaliana GN=GRF3 PE=1 SV=2 >sp|P46077|14334_ARATH 14-3-3-like protein GF14 phi OS=Arabidopsis thaliana GN=GRF4 PE=1 SV=2 >sp|P42645|14335_ARATH 14-3-3-like protein GF14 upsilon OS=Arabidopsis thaliana GN=GRF5 PE=1 SV=2 >sp|P48349|14336_ARATH 14-3-3-like protein GF14 lambda OS=Arabidopsis thaliana GN=GRF6 PE=1 SV=1 >sp|Q96300|14337_ARATH 14-3-3-like protein GF14 nu OS=Arabidopsis thaliana GN=GRF7 PE=1 SV=1 >sp|P48348|14338_ARATH 14-3-3-like protein GF14 kappa OS=Arabidopsis thaliana GN=GRF8 PE=2 SV=2 >sp|Q96299|14339_ARATH 14-3-3-like protein GF14 mu OS=Arabidopsis thaliana GN=GRF9 PE=1 SV=2 >sp|Q9LQ10|1A110_ARATH Probable aminotransferase ACS10 OS=Arabidopsis thaliana GN=ACS10 PE=2 SV=1 >sp|Q9S9U6|1A111_ARATH 1-aminocyclopropane-1-carboxylate synthase 11 OS=Arabidopsis thaliana GN=ACS11 PE=1 SV=1 >sp|Q8GYY0|1A112_ARATH Probable aminotransferase ACS12 OS=Arabidopsis thaliana GN=ACS12 PE=2 SV=2 >sp|Q06429|1A11_ARATH 1-aminocyclopropane-1-carboxylate synthase-like protein 1 OS=Arabidopsis thaliana GN=ACS1 PE=1 SV=2 >sp|Q06402|1A12_ARATH 1-aminocyclopropane-1-carboxylate synthase 2 OS=Arabidopsis thaliana GN=ACS2 PE=1 SV=1 >sp|Q43309|1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 OS=Arabidopsis thaliana GN=ACS4 PE=1 SV=1 >sp|Q37001|1A15_ARATH 1-aminocyclopropane-1-carboxylate synthase 5 OS=Arabidopsis thaliana GN=ACS5 PE=1 SV=1 >sp|Q9SAR0|1A16_ARATH 1-aminocyclopropane-1-carboxylate synthase 6 OS=Arabidopsis thaliana GN=ACS6 PE=1 SV=2 >sp|Q9STR4|1A17_ARATH 1-aminocyclopropane-1-carboxylate synthase 7 OS=Arabidopsis thaliana GN=ACS7 PE=1 SV=1 >sp|Q9T065|1A18_ARATH 1-aminocyclopropane-1-carboxylate synthase 8 OS=Arabidopsis thaliana GN=ACS8 PE=1 SV=1 >sp|Q9M2Y8|1A19_ARATH 1-aminocyclopropane-1-carboxylate synthase 9 OS=Arabidopsis thaliana GN=ACS9 PE=1 SV=1

  18. Pipe or Keyboard Program Pipe or Screen stdin stdout

  19. Special output channel for error messages stdout Pipe or Keyboard Program Pipe or Screen stdin stderr

  20. grepspuniprot_sprot_plants.fas> out.txt grep: uniprot_sprot_plants.fas: No such file or directory

  21. EMBOSS "European Molecular Biology Open Software Suite" http://emboss.sourceforge.net/ Toolbox with bioinformatics applications

  22. http://emboss.bioinformatics.nl/

  23. wossname "open reading frame" Finds programs by keywords in their short description SEARCH FOR 'OPEN READING FRAME' getorf Finds and extracts open reading frames (ORFs) plotorf Plot potential open reading frames in a nucleotide sequence

  24. wossname documentation Finds programs by keywords in their short description SEARCH FOR 'DOCUMENTATION' tfm Displays full documentation for an application

  25. tfmgetorf getorf Function Finds and extracts open reading frames (ORFs) Description This program finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF may be defined as a region of a specified minimum size between two STOP codons, or between a START and a STOP codon. The ORFs can be output as the nucleotide sequence or as the protein translation. Optionally, the program will output the region around the START codon, the first STOP codon, or the final STOP codon of an ORF. The START and STOP codons are defined in a Genetic Code table; a suitable table can be selected for the organism you are investigating. The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).

  26. Command line options All EMBOSS programs have a number of command line options. To get started: –help Get help –stdout Write to standard output –filter Read stdin, write stdout

  27. getorf -help Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Protein sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: -table menu [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmusobliquus); 23 (Thraustochytrium Mitochondrial)) -minsize integer [30] Minimum nucleotide size of ORF to report (Any integer value)

  28. cat example1.fasta | getorf -filter -find 1 >BTBSCRYR_1 [72 - 110] Bovine mRNA for lens beta-s-crystallin... MTAIATVQISTCT >BTBSCRYR_2 [11 - 544] Bovine mRNA for lens beta-s-crystallin... MSKAGTKITFFEDKNFQGRHYDSDCDCADFHMYLSRCNSIRVEGGTWAVYERPNFAGYMY ILPRGEYPEYQHWMGLNDRLSSCRAVHLSSGGQYKLQIFEKGDFNGQMHETTEDCPSIME QFHMREVHSCKVLEGAWIFYELPNYRGRQYLLDKKEYRKPVDWGAASPAVQSFRRIVE >BTBSCRYR_3 [159 - 590] Bovine mRNA for lens beta-s-crystallin... MKGPILLGTCTSYPGASILSTSTGWASTTASAPAGLFTCLVEASISFRSLRKGILMVRCM RPRKTALPSWSSSTCGRSTPVRCWRAPGSSMSCPTTEAGSTCWTRRSTGSPSTGVQLPQL SSLSAALWSDDTDAAKRWLALSSK >BTBSCRYR_4 [547 - 603] Bovine mRNA for lens beta-s-crystallin... MIQMRPNAGWPCHPNKHYK >BTBSCRYR_5 [618 - 445] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MPIVLFIMLIWMTRPASVWPHLYHHSTMRRKDWTAGEAAPQSTGFRYSFLSSRYCLPR >BTBSCRYR_6 [381 - 331] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MWNCSMMEGQSSVVSCI >BTBSCRYR_7 [337 - 197] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MHLTIKIPFLKDLKLILASTRQVNSPAGAEAVVEAHPVLVLRILAPG >BTBSCRYR_8 [192 - 73] (REVERSE SENSE) Bovine mRNA for lens beta-s-crystallin... MYMYPAKLGLSYTAQVPPSTLMELQRLRYMWKSAQSQSLS

  29. Exercise Make a pipeline that reports (only) the size in residues of the longest protein in this file: uniprot_sprot_plants.fasta It can be done using these applications as building blocks: sizeseq nthseq pepstats grep cut

  30. http://main.g2.bx.psu.edu/

More Related