1 / 47

Unix for Bioinformaticists: Unix Tools, Emacs, and Perl

Unix for Bioinformaticists: Unix Tools, Emacs, and Perl. helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation. Do I Have to Know/Use Unix?. Simple answer: no. Windows can do almost everything. Complicated answer: yes, if you

lin
Download Presentation

Unix for Bioinformaticists: Unix Tools, Emacs, and Perl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.

  2. Do I Have to Know/Use Unix? • Simple answer: no. • Windows can do almost everything. • Complicated answer: yes, if you • are lazy (would like to automate things) • are good at reading manuals and writing scripts • want to make better use of your machine • are as poor as I am (can not afford pricy windows software) • especially if you will be a bioinformaticist

  3. Why Unix Is Useful in Bioinformatics • Many tasks involve processing on large text based datasets. Unix tools in many cases are better than their windows counterparts. • You may need to use several tools to accomplish a task. Windows is not particularly good at gluing them. • When you need more CPU power, servers and clusters are usually *nix-based. • Many tools are available only under Unix-like systems.

  4. Outline • Unix in general • Unix tools • Emacs • Perl

  5. Unix Commands Single command: > sort –k1 file.txt Combine other commands: > sort –k1 file.txt | grep “Tag=Mouse” > output.txt Operate multiple files: > foreach file (*.txt) sort –k1 $file > $file:r_sorted.txt end

  6. More commands > rename .html .htm *.html There are many such convenient tools. Scripts can be used if you can not find one, > foreach f (*.html) mv $f $f:r.htm end

  7. More commands > wget -r -l1 --no-parent -A.tar.gz -Ppackages http://cran.r-project.org/src/contrib/PACKAGES.html download all .tar.gz files to packages directory, This command can do everything ‘teleport’ etc. under windows can do. > convert –rotate 90 file.jpg file.png Convert a .jpg file to .png format after rotating 90 degrees.

  8. A shell script: lyx2pdf > lyx2pdf myfile.lyx #!/bin/csh set file = $1:r lyx --export latex $file.lyx latex $file.tex dvips -o $file.ps $file.dvi ps2pdf $file.ps

  9. A Makefile %.html: %.tex latex2html -local_icons -no_subdir -split 0 $*.tex %.tex: %.lyx lyx2tex $*.lyx %.dvi: %.tex latex $*.tex %.ps: %.dvi dvips -o $*.ps $*.dvi %.pdf: %.ps ps2pdf $*.ps > make file.dvi > make file.ps > make file.pdf

  10. A Perl Script #!/usr/bin/perl # read all the things at once undef $/; # read in the file and look for /* */ ($comm) = <> =~ /.*\/\*(.*)\*\//ms; # print comments print $comm, "\n";

  11. crontab # do not forget to renew your library books 0 0 15 7 * mail bpeng@rice.edu %subject reminder Renew all the books! # backup your files to server every day at 6AM 6 * * * * /usr/local/bin/rsync -avz /home/bpeng thor.stat.rice.edu::backup > logfile

  12. Graphviz > dot –Tps try.dot –o try.eps File: try.dot digraph G { A->B->C B->D->C }

  13. Useful (and free) tools Servers: Apache, openssh, openldap Web: Mozilla/firefox, Konqueror, lynx Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution Text processing: tetex/lyx, open office, koffice Languages: gcc, Perl, python, gmake, kdevelop Scientific libraries and tools: GNU Scientific Library, bioPython, bioPerl, R, Graphviz, gnuplot, octave Misc: VNC, wget,

  14. Unix text-processing tools • Access to Unix • Mac OSX + developers kit • Linux • Stat and ruf/owlnet servers (Solaris) • Windows + cygwin • Tools - in contrast to Excel, faster, operate on larger files • Grep, Pipes, Sort, Comm, Diff, Join • Sed - regular expression substitution editor, replaced by perl in most contexts • Man - to list manual pages with options for most commands (if installed and concurrent version)

  15. Grep • Grab lines that match a text phrase • Only the line that matches • Lines before or after the matched line • Lines that do not match • Piping multiple searches

  16. GenBank Files

  17. Grab the Locus, Definition and Keyword lines phase2.txt.out temp

  18. Select Non-Human Definition Lines and Use Pipe kworley% grep -v Homo temp | grep DEF temp

  19. Specify Lines to return grep -1 grep -B1 grep -A1

  20. Sort • In dictionary (-d), month (-M), or numerical (-n) order • Ignore case (-f) • Specify output file (-o) • Specify the separator between fields (-t) • Unique lines only (-u) • Specify field on which to sort (-k POS1,[-POS2]), numbered starting from 0, can specify which character in the field (field.char) • Merge more than one sorted file (-m)

  21. Comm • Select or reject lines in common between two sorted files • Options suppress printing of columns • comm [-123] file1 file2 • Column 1 is lines only in file 1 • Column 2 is lines only in file 2 • Column 3 is lines in both files

  22. Diff • Compares two files (or sets of files in a directory) and output lines with differences • Compare as text (-a) • Ignore changes in white space (-b) or blank lines (-B), case difference (-i) • For directory comparisons • Report only files that differ not details (-q) • Compare subdirectories recursively (-r)

  23. Join • Combines lines from two files based on a common field (-1 field -2 field) • Specify the fields from each file and the order to output (-o file_number.field file_number.field file_number.field)

  24. What is Emacs? • A Unix text editor with additional functionality • Column functions • Settings for DNA mode • Settings for programming mode • Seamless integration with matlab, R, S-Plus, SAS etc.

  25. Emacs Demonstrations • Search and replace • By query • All • New lines • Counting things • Column functions • Select • Kill • Copy • Paste

  26. Query replace • Esc % • Replace phrase • With phrase • Designate carriage return with control Q control J • Y or N • ! To replace all

  27. Starting File

  28. Query Replace

  29. End file

  30. Rectangle functions • Mark, select rectangle • Control x r • r a • To register the rectangle as buffer a • k • To kill the rectangle • r i a • To insert previously registered rectangle a from buffer

  31. Select Rectangle, Kill

  32. Select Rectangle, Mark, Insert

  33. What is Perl? • A general purpose programming language. • Invented to replace awk, sed, and sh. • A scripting language. • Practical Extraction and Reporting Language • Pathologically Eclectic Rubbish Lister “There is more than one way to do it” TIMTOWTDI

  34. How to Use Perl • Perl “scripts” (programs) are text and are interpreted by the the perl program. • TIMTOWTDI: • You can put the script on the command line:>perl -e 'print "Hello, world!\n";' • You can pass it as an argument to perl:>perl my_program.pl • You can make the script self-executing:>my_program.pl

  35. print, ", ', \n 'print "Hello, world!\n";' • In most programming languages, "print" means "display" or "output". • The single and double quote characters ( " ' ) are used to set apart blocks of "text". In this example, the single quote sets apart the perl script, and the double quotes sets apart the text to display. (Perl has others ways to quote.) • The backslash, '\', is used to change the meaning of a character, e.g. to generate special characters. \n means "start a new line" (e.g. the Carriage Return, or Return, or Enter.)

  36. Example of a One Liner(Thanks to Dr. Wheeler) perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out.txt perl -nle '@f=split/\t/; print if ($f[2] > 95);' blast_tbl_in.txt >blast_tbl_out.txt

  37. A One Liner: TIMTOWTDI • perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out1.txt • perl -ne '@f=split; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out2.txt • perl -ane 'print if ($F[2] > 95 );' blast_tbl_in.txt > blast_tbl_out3.txt

  38. split, if, variables @f=split/\t/; print if ($f[2] > 95); • split is a function. It can be written with parens like in most languages, and takes UP TO three arguments:split( where_to_split, what_to_split, how_many_to_split) • split, like many Perl statements, uses defaults for missing arguments. • Special characters mark @whole_arrays, $array_members[1], %whole_hashes, $hash_members{'one'}, $simple_variables. • if acts like its common English meaning. It can go before a block or at the end of a statement (as above). • Perl converts between numbers and text. '>' is a numeric operator so 95 and $f[2] are treated as numbers. If gt replaced >, they would be treated as strings.

  39. FASTA to XML perl -pi.bak -e's"^>(.*)$"</seq><title>\1</title><seq>";'test.fa

  40. [localhost:~/test] steffen% ls test.fa test.fa.bak [localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa [localhost:~/test] steffen% ls test.fa test.fa.bak [localhost:~/test] steffen% more test.fa </seq><title>CSTAP1E0101A</title><seq> gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGT ACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTT GTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTT TCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCC AAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCA TCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTA AAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTG cacaacaatttttggtatgctaacttatacttgtgcctaatccttaagga aaagaaagagccatatacctaaaactgactttatttttcaaaaggta </seq><title>CSTAP1E0102A</title><seq> tttttgctggcgaactatcaggagactacagxaactacttttcagtxcga actcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaacc ctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagc tggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcg cagcctgaatggcgaatggcgcctgatgcggtattttctccttacgcttt caatgatgagcacttxtaaaggtctgx </seq><title>CSTAP1E0103A</title><seq> atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCAC CCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTC TCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAA GTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGG ATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACC ACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAG actgatgaactacttgcagtcgaactccaatcattactggccgtcgtttt aa

  41. Executing a Perl Script in a File $line = <>; $line =~ s">(.*)"<title>\1</title><seq>"; print $line; while( $line = <> ) { $line =~ s">(.*)"</seq><title>\1</title><seq>"; print $line; } print "</seq>\n";

  42. File Reading, Binding, while $line = <>; • <> reads one line from the "current file" $line =~ s">(.*)"<title>\1</title><seq>"; • =~ makes the preceding string the "current line" (Binding) while( $line = <> ) { print $line; } • Repeats the statements between { and } while there is another line.

  43. Self-executing Perl Scripts • You need to know the path to your Perl program:>which perl/usr/bin/perl • The first line of your script must be:#!/usr/bin/perl • Permissions need to allow execution >chmod 755 my_program.pl

  44. FASTA to XML Fleshed Out #!/usr/bin/perl # # fasta2xml by David Steffen 6/2/2004 # - Converts fasta file to mini-xml format $inpfile = shift( @ARGV ); if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) { die( "Input file, $inpfile, must be a fasta file and end in .fa\n" ); } $basefile = $1; open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" ); $outfile = '>' . $basefile . '.xml'; open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" ); $line = <INPFILE>; $line =~ s">(.*)"<title>\1</title><seq>"; print OUTFILE $line; while( $line = <INPFILE> ) { $line =~ s">(.*)"</seq><title>\1</title><seq>"; print OUTFILE $line; } print OUTFILE "</seq>\n";

  45. Running Other Programs from Perl $files = `ls`; The "backtic" (` `) characters execute the text in between as a command to the operating system, returning the output of that command (e.g. to the $files) variable. $error = system( "mv $file ${basefile}.abi" ); The system statement executes its argument as a command to the operating system, returning ERROR MESSAGES from that command. (Output is printed as usual.) There are other, subtle differences between ` ` and system.

More Related