1 / 37

Computational Skills Course - Week 1

This course covers basic concepts of computer-based data, the command line, and essential command line utilities. Topics include data and text files, simple command line tools, third-party bioinformatics tools, databases and SQL, 3GL's - industrial-strength computing, and scripts and pipelines.

linneac
Download Presentation

Computational Skills Course - Week 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Skills Courseweek 1 Mike Gilchrist NIMR May-July 2011

  2. Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...

  3. Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix

  4. WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.

  5. The terminal window

  6. At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migilsysbio 18K Mar 12 16:16 program [migil@INFO-LINUX2 gcc]$ hts_utils

  7. PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs

  8. Computer data a ‘byte’ 10001011010010110010001110110101011010100101000010100101... 128 64 32 16 8 4 2 1 = 35 (base 10) 00000000 = 0 11111111 = 255 256 possible characters in the simplest form of computer data

  9. The ascii ‘alphabet’ 00000000 = 0 \0 (the null terminator) 00001001 = 9 \t TAB 00001010 = 10 \n (line feed) 00001011 = 13 \r (return) 00100000 = 32 ‘ ‘ (space) 00110000 = 48 ‘0’ : : : : 00111001 = 57 ‘9’ 01000001 = 65 ‘A’ : : : : 01011010 = 90 ‘Z’

  10. A typical text data file @HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff @HWUSI-EAS582_0299:3:1:87:1949/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 _ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce @HWUSI-EAS582_0299:3:1:87:1688/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd @HWUSI-EAS582_0299:3:1:87:1773/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 gfgggggfggggfgggZggfddggggfag\fggafgdgggg @HWUSI-EAS582_0299:3:1:87:1000/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 ggeggegeggcgggfdaaffdgggggfggggggggggdggd @HWUSI-EAS582_0299:3:1:87:842/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 gggggggggggggggggggggggfgggggggfggggggggg @HWUSI-EAS582_0299:3:1:88:1826/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg

  11. The line ending/platform issue DOS (Windows) Unix/Linux (newer Macs) 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n (really) 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos <FILENAME>

  12. Command line utilities for manipulating data files http://sourceforge.net/projects/gnuwin32/files/ >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’

  13. grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG00000025789|ENSXETT00000019729|nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG00000004295|ENSXETT00000014760|neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG00000006455|ENSXETT00000014172|camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :

  14. grep grep reads the input file line by line and reports on each line containing the pattern F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:dev\sql>grep –c “>” my-sequence-file.fasta 28937 F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt

  15. sed stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc. F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr

  16. sed F:dev\sql>more tmp.txt >ENSXETG00000025789|ENSXETT00000019729|nodal2 >ENSXETG00000009009|ENSXETT00000019730|nodal3 >ENSXETG00000025789|ENSXETT00000019728|nodal2 >ENSXETG00000009008|ENSXETT00000019726|nodal1 >ENSXETG00000017442|ENSXETT00000037932|nodal5.2 >ENSXETG00000016779|ENSXETT00000036596|nodal5 >ENSXETG00000016778|ENSXETT00000036593|nodal6 >ENSXETG00000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_0000025789|ENSXETT00000019729|nodal2 >ENSXETG_0000009009|ENSXETT00000019730|nodal3 >ENSXETG_0000025789|ENSXETT00000019728|nodal2 >ENSXETG_0000009008|ENSXETT00000019726|nodal1 >ENSXETG_0000017442|ENSXETT00000037932|nodal5.2 >ENSXETG_0000016779|ENSXETT00000036596|nodal5 >ENSXETG_0000016778|ENSXETT00000036593|nodal6 >ENSXETG_0000023748|ENSXETT00000051228|nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal

  17. The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr F:dev\sql>grep “>” my-sequence-file.fasta|sed "s/nodal/xnr/" >ENSXETG00000025789|ENSXETT00000019729|xnr2 >ENSXETG00000009009|ENSXETT00000019730|xnr3 >ENSXETG00000025789|ENSXETT00000019728|xnr2 >ENSXETG00000009008|ENSXETT00000019726|xnr1 >ENSXETG00000017442|ENSXETT00000037932|xnr5.2 >ENSXETG00000016779|ENSXETT00000036596|xnr5 >ENSXETG00000016778|ENSXETT00000036593|xnr6 >ENSXETG00000023748|ENSXETT00000051228|xnr

  18. (n)awk Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. 3 06/04/2011 06/04/2011 Ashleigh Howes OK ahowes@nimr.mrc.ac.uk Immunoreg O'Garra 2 06/04/2011 06/04/2011 Leona Gabrysova OK lgabrys@nimr.mrc.ac.uk Immunoreg O'Garra 1 06/04/2011 06/04/2011 Eric Dang (ok) edang@nimr.mrc.ac.uk Immunoreg O'Garra 07/04/2011 07/04/2011 Guilherme Neves OK gneves@nimr.mrc.ac.uk Molecular Neuro Pachnis 5 08/04/2011 08/04/2011 Mary Wu OK mwu@nimr.mrc.ac.uk Systems Biol Smith 24 08/04/2011 08/04/2011 Jose Saldanha OK jsaldan@nimr.mrc.ac.uk Math Biol Goldstein 8 08/04/2011 08/04/2011 Laurent Dupays OK ldupays@nimr.mrc.ac.uk Dev Biol Mohun 7 08/04/2011 08/04/2011 Veronique Duboc OK vduboc@nimr.mrc.ac.uk Dev Biol Logan 22 09/04/2011 09/04/2011 Madhu Kadekoppala OK mkadeko@nimr.mrc.ac.uk Parasitology Holder 21 09/04/2011 09/04/2011 George Gentsch OK ggentsc@nimr.mrc.ac.uk Systems Biol Smith 13 10/04/2011 11/04/2011 Siggi Sato OK ssato@nimr.mrc.ac.uk Parasitology Holder 10/04/2011 10/04/2011 James Briscoe OK jbrisco@nimr.mrc.ac.uk Dev Neurobiol Briscoe 26 10/04/2011 17/04/2011 Alex Watson OK awatson@nimr.mrc.ac.uk Systems Biol Smith 10/04/2011 Mustafa Khokha OK mustafa.khokha@yale.edu YALE F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh ahowes@nimr.mrc.ac.uk Leona lgabrys@nimr.mrc.ac.uk Eric edang@nimr.mrc.ac.uk Guilherme gneves@nimr.mrc.ac.uk Mary mwu@nimr.mrc.ac.uk : F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilhermegneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :

  19. Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl).

  20. Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!

  21. Next week Download and install MySQL…

  22. WEEK TWO Relational databases and the structured query language (SQL), and an introduction to MySQL.

  23. What is a database? A collection of data tables, or other structures, managed by a computer program and queryable by some sort of (usually) standard language. We will look at relational databases where the data are help in strict column based tables, and queried with SQL – the structured query langauge. This is a standard across a wide range of products: MySQL, Oracle, Sybase, MS-SQLServer PostGres, etc. Database server: the computer program that manages everything Database: a defined area which contains a set of related data tables and other entities: rules, datatypes, etc. Table: a single defined object with a fixed (by you) number of columns and any number of rows Datatypes: out of which data structures are built, i.e. you must define the ‘type’ of your data: integer, float, varchar(), text, etc.

  24. A table course_applicants

  25. Another table research_roles

  26. Creating a table [this is SQL!] create table course_applicants ( date_applied datetime, first_name varchar(32), last_name varchar(32), status varchar(4) null, email varchar(32), divison varchar(32), lab varchar(32) null, role_id integer )

  27. Creating a table create table course_applicants ( cap_date_applied datetime, cap_first_name varchar(32), cap_last_name varchar(32), cap_status varchar(4) null, cap_email varchar(32), cap_divison varchar(32), cap_lab varchar(32) null, cap_role_id integer )

  28. Creating a table: at the prompt F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >create table course_applicants ( date_applied datetime, etc… >go >(will now create table in the database)

  29. Creating a table: in a script (better) table_course_applicants.sql use mydb go create table course_applicants ( date_applied datetime, first_name varchar(32), last_name varchar(32), status varchar(4) null, email varchar(32), divison varchar(32), lab varchar(32) null, role_id integer ) go F:\dev\sql>MySQL –U myId –P myPass < table_course_applicants.sql F:\dev\sql>(table is now created)

  30. Entering data into a table insert_applicants.sql use mydb go insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id ) select getdate(), ‘Mike’, ‘Gilchrist’, ‘mgilchr@nimr.mrc.ac.uk’, ‘Sys Bio’, 8 insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id ) select getdate(), ‘Mustafa’, ‘Khokha’, ‘mustafa.khokha@yale.edu’, ‘YALE’, 8 <etc.> <etc.> <etc.> go F:\dev\sql>MySQL –U myId –P myPass < insert_applicants.sql F:\dev\sql>(data is now added to table)

  31. Loading a table from a text data file computing-course-applicants.txt F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >LOAD DATA LOCAL FILE computing-course-applicants.txt >INTO TABLE course_applicants >go

  32. Using SQL to query/analyse/alter the data F:\dev\sql>MySQL –U myId –P myPass >use mydb >go >select * from course_applicants >go >(outputs data to screen) select * from course_applicants order by date_applied select count(*) ‘applicants’, count(distinct division) ‘divisions’ from course_applicants select lab, count(*) from course_applicants group by lab order by count(*) desc select lab, count(*) from course_applicants group by lab having count(*) > 1 order by lab select * from course_applicants where first_name like ‘J%’ select * from course_applicants where status != ‘OK’

  33. Joining tables for more complex queries select first_name, last_name, role from course_applicants, research_roles where course_applicants.role_id = research_roles.role_id (JOIN) select first_name, last_name, role from course_applicants a, research_roles r (table aliases) where a.role_id = r.role_id (JOIN) select cap_first_name, cap_last_name, rro_role from course_applicants, research_roles where cap_role_id = rro_role_id (JOIN) select rro_role ‘role’, count(*) ‘on course’ from course_applicants, research_roles where cap_role_id = rro_role_id (JOIN) group by rro_role order by rro_role

  34. delete and update statements /* CHANGE THE ROLE DESCRIPTION */ update research_roles set role = ‘CDF (Post Doc)’ where role = ‘Post Doc’ update research_roles set role = ‘CDF (Post Doc)’ where role_id = 2 /* REMOVE PIs */ delete course_applicants where role_id = 8 delete course_applicants from research_roles where course_applicants.role_id = research_roles.role_id and research_roles.role = ‘Principal Investigator’ : and research_roles.role like ‘P% I%’

  35. Catches Never use sub-queries: select ... from tableA a, tableB b where a.id = b.id and a.id in (select id from tableC where id like ‘NIMR%’ ) SQL can give you the wrong answer in a plausible way... If your query is taking forever you have probably missed out a join condition. You can always add rows of data easily, but adding columns is hard work, and means you will have to re-write code... Long text data (e.g. sequences) can be tough to handle.

  36. Exercises Design a table, or tables, to hold personal details (names, etc) and contact information (email, phone numbers, etc.) for people in an organisation. Download the set of gene loci and exon data for your organism from BioMart, design some tables for this data, load the data from your data files, and then query the data for the number of multi-exon vs single exon genes, and find the gene with the most number of exons, etc.

  37. Next week Download and install BLAST from NCBI Prepare a two minute verbal presentation of you project

More Related