Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 13928761660

Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660

www.cse.sc.edu/~jtang/BJFU

作业 • GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAATAAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA • 为什么用缺省blast出不来结果？需要如何选择？ • 相关物种的最新pubmed文章有哪些？

Working with Directories • Directories are a means of organizing your files on a Linux computer. • They are equivalent to folders on Windows and Macintosh computers • Directories contain files, executable programs, and sub-directories • Understanding how to use directories is crucial to manipulating your files on a Linux system.

File & Directory Commands • This is a minimal list of Linux commands that you must know for file management: • All of these commands can be modified with many options. Learn to use Linux ‘man’ pages for more information.

Navigation • pwd (present working directory) shows the name and location of the directory where you are currently working:> pwd /home/jtang • This is a “pathname,” the slashes indicate sub-directories • The initial slash is the “root” of the whole filesytem • ls (list) gives you a list of the files in the current directory: • > ls assembin4.fasta Misc test2.txt bin temp testfile • Use the ls -l (long) option to get more information about each file > ls -l total 1768 drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid -rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2 -rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta

Sub-directories • cd (change directory) moves you to another directory >cd Misc > pwd /u/browns02/Misc • mkdir (make directory) creates a new sub-directory inside of the current directory > ls assembler phrap space > mkdir subdir > ls assembler phrap space subdir • rmdir (remove directory) deletes a sub-directory, but the sub-directory must be empty > rmdir subdir > ls assembler phrap space

Create new files • nano • vi/vim • emacs

Programming • perl • python • c/c++ • R • Java

more • Use the command more to view at the contents of a file one screen at a time: > more t27054_cel.pep !!AA_SEQUENCE 1.0 P1;T27054 - hypothetical protein Y49E10.20 - Caenorhabditis elegans Length: 534 May 30, 2000 13:49 Type: P Check: 1278 .. 1 MLKKAPCLFG SAIILGLLLA AAGVLLLIGI PIDRIVNRQV IDQDFLGYTR 51 DENGTEVPNA MTKSWLKPLY AMQLNIWMFN VTNVDGILKR HEKPNLHEIG 101 PFVFDEVQEK VYHRFADNDT RVFYKNQKLY HFNKNASCPT CHLDMKVTIP t27054_cel.pep (87%) • Hit the spacebar to page down through the file • Ctrl-U moves back up a page • At the bottom of the screen, more shows how much of the file has been displayed • Similar command: less

Copy & Move • cp lets you copy a file from any directory to any other directory, or create a copy of a file with a new name in one directory • cp filename.ext newfilename.ext • cp filename.ext subdir/newname.ext • cp /u/jdoe01/filename.ext ./subdir/newfilename.ext • mv allows you to move files to other directories, but it is also used to rename files. • Filename and directory syntax for mv is exactly the same as for the cp command. • mv filename.ext subdir/newfilename.ext • NOTE: When you use mv to move a file into another directory, the current file is deleted.

Delete • Use the command rm (remove)to delete files • There is no way to undo this command!!! • We have set the server to ask if you really want to remove each file before it is deleted. • You must answer “Y” or else the file is not deleted. • But can use –f • rm –rf

View File Permissions $ ls -l total 2 -rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt -rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl • Use the ls -l command to see the permissions for all files in a directory: • The username of the owner is shown in the third column. (The owner of the files listed above is jtang) • The owner belongs to the group “None” • The access rights for these files is shown in the first column. This column consists of 10 characters known as the attributes of the file: r, w, x, and - rindicates read permission w indicates write (and delete) permission x indicates execute (run) permission - indicates no permission for that operation

Change Protections • Only the owner of a file can change its protections • To change the protections on a file use the chmod (change mode) command. [Beware, this is a confusing command.] • Taken all together, it looks like this: > chmod 644 data.txt This will set the owner to have read, write; add the permission for the group and the world to read 600, 755, 700,

Commands for Files • Files are used to store information, for example, data or the results of some analysis. • You will mostly deal with text files • Files on the RCR Alpha are automatically backed up to tape every night. • cat dumps the entire contents of a file onto the screen. • For a long file this can be annoying, but it can also be helpful if you want to copy and paste (use the buffer of your telnet program)

FTP/SCP is Simple • File Transfer Protocol is standard for all computers on any network. • The best way to move lots of data to and from remote machines: • put raw data onto the server for analysis • get results back to the desktop for use in papers and grants • Graphical FTP applications for desktop PCs • On a Mac, use Fetch, CyberDuck (!) • On a Windows PC, use WS_FTP, FileZilla • winscp

Some More Advanced Linux Commands • grep: searches a file for a specific text pattern • cut: copies one or more columns from a tab-delimited text file • wc: word count • | : the pipe — sends output of one command as input to the next • > : redirect output to a file

Perl

Why Write Programs? • Automate computer work that you do by hand - save time & reduce errors • Run the same analysis on lots of similar data files = scale-up • Analyze data, make decisions • sort Blast results by e-value &/or species of best mach • Build a pipeline • Create new analysis methods

Why Perl? • Fairly easy to learn the basics • Many powerful functions for working with text: search & extract, modify, combine • Can control other programs • Free and available for all operating systems • Most popular language in bioinformatics • Many pre-built “modules” are available that do useful things

Get Perl • You can install Perl on any type of computer • Download and install Perl on your own computer: www.perl.org

Programming Concepts • Program = a text file that contains instructions for the computer to follow • ProgrammingLanguage = a set of commands that the computer understands (via a “command interpreter”) • Input = data that is given to the program • Output = something that is produced by the program

Programming • Write the program (with a text editor) • Run the program • Look at the output • Correct the errors (debugging) • Repeat (computers are VERY dumb -they do exactly what you tell them to do, so be careful what you ask for…)

Basic Concepts • Variables and Assignment • Conditions • Loop • Input/Output (I/O) • Procedures/functions

Strings • Text is handled in Perl as a string • This basically means that you have to put quotes around any piece of text that is not an actual Perl instruction. • Perl has two kinds of quotes - single ‘ and double “ (they are different- single quote will print as is)

Print • Perl uses the term “print” to create output • Without a printstatement, you won’t know what your program has done • You need to tell Perl to put a carriage return at the end of a printed line • Use the “\n” (newline) command • Include the quotes • The “\” character is called an escape - Perl uses it a lot

Your First Perl Program • Open a new text file >nano prog1.pl • Type: #!/usr/bin/perl #my first perl program print "Hello world\n";

Program details • Perl programs always start with the line: #!/usr/bin/perl • this tells the computer that this is a Perl program and where to get the Perl interpreter • All other lines that start with# are considered comments, and are ignored by Perl • Lines that are Perl commands end with a ;

Run your Perl program >perl prog1.pl [#use the perl interpreter to run your script] >chmod 755 *.pl [#make the file executable] >./prog1.pl [run it]

#!/usr/bin/perl $DNA = 'ACGT'; # Next, we print the DNA onto the screen print $DNA, "\n"; print '$DNA\n'; print "$DNA\n"; exit;

Numbers and Functions • Perl handles numbers in most common formats: 456 5.6743 6.3E-26 • Mathematical functions work pretty much as you would expect: 4+7 6*4 43-27 256/12 2/(3-5)

Do the Math (your 2nd Perl program) #!/usr/bin/perl print "4+5\n"; print 4+5 , "\n"; print "4+5=" , 4+5 , "\n"; [Note: use commas to separate multiple items in a print statement, whitespace is ignored]

Variables • To be useful at all, a program needs to be able to store information from one line to the next • Perl stores information in variables • A variable name starts with the “$” symbol, and it can store strings or numbers • Variables are case sensitive • Give them sensible names • Use the “=”sign to assign values to variables $one_hundred = 100; $my_sequence = "ttattagcc";

You can do Math with Variables #!/usr/bin/perl #put some values in variables $sequences_analyzed = 200 ; $new_sequences = 21 ; #now we will do the work $percent_new_sequences =( $new_sequences / $sequences_analyzed) *100 ; print"% of new sequences = " , $percent_new_sequences; % of new sequences =952.381

String Operations • Strings (text) in variables can be used for some math-like operations • Concatenate (join) use the dot . operator $seq1= "ACTG"; $seq2= "GGCTA"; $seq3= $seq1 . $seq2; print $seq3; ACTGGGCTA

#!/usr/bin/perl # Storing DNA in a variable, and printing it out # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit. exit;

#!/usr/bin/perl -w $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; print "Here are the original two DNA fragments:\n\n"; print $DNA1, "\n"; print $DNA2, "\n\n"; # Using "string interpolation" $DNA3 = "$DNA1$DNA2"; print "Here is the concatenation of the first two fragments (version 1):\n\n"; print "$DNA3\n\n"; # An alternative way using the "dot operator": $DNA3 = $DNA1 . $DNA2; print “Here is the concatenation of the first two fragments (version 2):\n\n”; print "$DNA3\n\n"; exit;

#!/usr/bin/perl –w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit;

Exercises • Create a dir named Exercises in your home dir • Create a folder Class1 in your Exercises dir • Create three perl programs • Prog2: Cancatenate three DNAs • Prog3: Convert a DNA to one with lower cases • A->a, C->c, G->g, T->t • Chmod, Test and Debug

#!/usr/bin/perl -w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "$DNA\n\n"; $revcom = reverse $DNA; $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n";

#!/usr/bin/perl -w $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "$DNA\n\n"; $revcom = reverse $DNA; # See the text for a discussion of tr/// $revcom =~ tr/ACGTacgt/TGCAtgca/; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; exit;

Exercise • Change your previous program so that it can convert to lowercases easier

More • In Exercise, create a dir named Class2 • Using nano, create a file named NM_021964fragment.pep • Put some amino acid sequence into it • Save and quit

#!/usr/bin/perl -w # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file open(PROTEINFILE, $proteinfilename); $protein = <PROTEINFILE>; # Now that we've got our data, we can close the file. close PROTEINFILE; # Print the protein onto the screen print "Here is the protein:\n\n"; print $protein; exit;

More • Using nano, add two more lines to NM_021964fragment.pep • Save and quit

#!/usr/bin/perl -w $proteinfilename = 'NM_021964fragment.pep'; open(PROTEINFILE, $proteinfilename); # First line $protein = <PROTEINFILE>; print “\nHere is the first line of the protein file:\n\n”; print $protein; # Second line $protein = <PROTEINFILE>; print “\nHere is the second line of the protein file:\n\n”; print $protein; # Third line $protein = <PROTEINFILE>; print “\nHere is the third line of the protein file:\n\n”; print $protein; close PROTEINFILE; exit;

Exercise • Create a file named dna.fasta • Add two lines to this file: • >DNA1 • ATGCGGGATGGAGCGCGC • Write a program, open it, print the DNA name and the sequence • How to avoid the print of “>”?

#!/usr/bin/perl -w # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file open(PROTEINFILE, $proteinfilename); # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = <PROTEINFILE>; # Print the protein onto the screen print @protein; # Close the file. close PROTEINFILE; exit;

#!/usr/bin/perl -w # "scalar context" and "list context" @bases = ('A', 'C', 'G', 'T'); print "@bases\n"; $a = @bases; print $a, "\n"; ($a) = @bases; print $a, "\n"; exit;

Bioinformatics 生物信息学理论和实践唐继军 jtang@cse.sc 13928761660