Computational skills course week 1
Download
1 / 37

Computational Skills Course week 1 - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Computational Skills Course week 1. Mike Gilchrist NIMR May-July 2011. Main topics. The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Computational Skills Course week 1' - clive


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computational skills course week 1

Computational Skills Courseweek 1

Mike Gilchrist

NIMR May-July 2011


Main topics

The command line

Data and text files

Simple command line tools

Third party bioinformatics tools

Databases and SQL

3GL’s – industrial strength computing

Scripts and pipelines

What will not appear and why...


Axioms

Google it

You cannot break your computer

Familiarity breeds content

Bugs are always your fault

(and can always be found - eventually)

There are many ways of doing any give task

(but one of them may be much better)

Computational biologists tend to obfuscate

GNU = Gnu is Not Unix


WEEK ONE

Basic concepts of computer based data, the command line, and some essential command line utilities.



At the command line

D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt

program = program.exe

or

-rwxr-xr-x 1 migilsysbio 18K Mar 12 16:16 program

[[email protected] gcc]$ hts_utils


PATH

F:\data\projects\khokha\exon-capture\blast\data-12may11.txt

Also the PATH environment variable which tells the computer where to look for programs


Computer data

a ‘byte’

10001011010010110010001110110101011010100101000010100101...

128 64 32 16 8 4 2 1 = 35 (base 10)

00000000 = 0

11111111 = 255

256 possible characters in the simplest form of computer data


The ascii ‘alphabet’

00000000 = 0 \0 (the null terminator)

00001001 = 9 \t TAB

00001010 = 10 \n (line feed)

00001011 = 13 \r (return)

00100000 = 32 ‘ ‘ (space)

00110000 = 48 ‘0’

: : : :

00111001 = 57 ‘9’

01000001 = 65 ‘A’

: : : :

01011010 = 90 ‘Z’


A typical text data file

@HWUSI-EAS582_0299:3:1:87:82/1

GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT

+HWUSI-EAS582_0299:3:1:87:82/1

ffbffffff^UYbb\ffbee_M_f]UV^JYRXbSdf`bfff

@HWUSI-EAS582_0299:3:1:87:1949/1

CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT

+HWUSI-EAS582_0299:3:1:87:1949/1

_ebhhghghgedgceg\fggfghggg_]eZ\WWcacMb_ce

@HWUSI-EAS582_0299:3:1:87:1688/1

CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT

+HWUSI-EAS582_0299:3:1:87:1688/1

ffbdd`bfffbf`ZfZddddcbfZfff]ffaddYdeffedd

@HWUSI-EAS582_0299:3:1:87:1773/1

CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA

+HWUSI-EAS582_0299:3:1:87:1773/1

gfgggggfggggfgggZggfddggggfag\fggafgdgggg

@HWUSI-EAS582_0299:3:1:87:1000/1

ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC

+HWUSI-EAS582_0299:3:1:87:1000/1

ggeggegeggcgggfdaaffdgggggfggggggggggdggd

@HWUSI-EAS582_0299:3:1:87:842/1

GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA

+HWUSI-EAS582_0299:3:1:87:842/1

gggggggggggggggggggggggfgggggggfggggggggg

@HWUSI-EAS582_0299:3:1:88:1826/1

TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG

+HWUSI-EAS582_0299:3:1:88:1826/1

cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg


The line ending/platform issue

DOS (Windows)

Unix/Linux (newer Macs)

582_0299:3:1:87:82/1\n\r

GATGTCTGCCCTATCAACTTTCGAT\n\r

582_0299:3:1:87:82/1\n\r

fbee_M_f]UV^JYRXbSdf`bfff\n\r

582_0299:3:1:87:82/1\n

GATGTCTGCCCTATCAACTTTCGAT\n

582_0299:3:1:87:82/1\n

fbee_M_f]UV^JYRXbSdf`bfff\n

(really)

582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87..

Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform...

If you are moving a file from Linux -> Windows run unix2dos first.

>unix2dos <FILENAME>


Command line utilities for manipulating data files

http://sourceforge.net/projects/gnuwin32/files/

>grep

finds lines matching ‘patterns’

>sed

swaps one pattern for another

>awk

operates on ‘fields’ in a record

>cat

>tail

>more/less

>unix2dos/dos2unix

AND THEN THERE ARE PIPES... ‘|’

AND REDIRECTS... ‘>’


grep

A typical fasta sequence file (Xenopus tropicalis mRNAs)

:

ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC

AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT

GCTCCGAGCAGT

>ENSXETG00000025789|ENSXETT00000019729|nodal2

ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC

ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG

AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG

ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT

CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG

GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA

TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT

>ENSXETG00000004295|ENSXETT00000014760|neurod4

AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA

TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC

AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA

AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT

TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG

AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA

GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG

TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT

GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA

>ENSXETG00000006455|ENSXETT00000014172|camk2g

AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA

ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA

GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG

CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC

AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC

ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT

GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT

:


grep

grep reads the input file line by line and reports on each line containing the pattern

F:dev\sql>grep ACGTACGT my-sequence-file.fasta

GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG

TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA

TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA

GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA

:

:

F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta

76

F:dev\sql>grep nodal my-sequence-file.fasta

>ENSXETG00000025789|ENSXETT00000019729|nodal2

>ENSXETG00000009009|ENSXETT00000019730|nodal3

>ENSXETG00000025789|ENSXETT00000019728|nodal2

>ENSXETG00000009008|ENSXETT00000019726|nodal1

>ENSXETG00000017442|ENSXETT00000037932|nodal5.2

>ENSXETG00000016779|ENSXETT00000036596|nodal5

>ENSXETG00000016778|ENSXETT00000036593|nodal6

>ENSXETG00000023748|ENSXETT00000051228|nodal

F:dev\sql>grep –c “>” my-sequence-file.fasta

28937

F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt


sed

stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution.

E.g. Replace groups of space with a TAB, etc.

F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt

F:dev\sql>more tmp.txt

>ENSXETG00000025789|ENSXETT00000019729|nodal2

>ENSXETG00000009009|ENSXETT00000019730|nodal3

>ENSXETG00000025789|ENSXETT00000019728|nodal2

>ENSXETG00000009008|ENSXETT00000019726|nodal1

>ENSXETG00000017442|ENSXETT00000037932|nodal5.2

>ENSXETG00000016779|ENSXETT00000036596|nodal5

>ENSXETG00000016778|ENSXETT00000036593|nodal6

>ENSXETG00000023748|ENSXETT00000051228|nodal

F:\dev\sql>sed "s/nodal/xnr/" tmp.txt

>ENSXETG00000025789|ENSXETT00000019729|xnr2

>ENSXETG00000009009|ENSXETT00000019730|xnr3

>ENSXETG00000025789|ENSXETT00000019728|xnr2

>ENSXETG00000009008|ENSXETT00000019726|xnr1

>ENSXETG00000017442|ENSXETT00000037932|xnr5.2

>ENSXETG00000016779|ENSXETT00000036596|xnr5

>ENSXETG00000016778|ENSXETT00000036593|xnr6

>ENSXETG00000023748|ENSXETT00000051228|xnr


sed

F:dev\sql>more tmp.txt

>ENSXETG00000025789|ENSXETT00000019729|nodal2

>ENSXETG00000009009|ENSXETT00000019730|nodal3

>ENSXETG00000025789|ENSXETT00000019728|nodal2

>ENSXETG00000009008|ENSXETT00000019726|nodal1

>ENSXETG00000017442|ENSXETT00000037932|nodal5.2

>ENSXETG00000016779|ENSXETT00000036596|nodal5

>ENSXETG00000016778|ENSXETT00000036593|nodal6

>ENSXETG00000023748|ENSXETT00000051228|nodal

F:\dev\sql>sed "s/0/_/" tmp.txt

>ENSXETG_0000025789|ENSXETT00000019729|nodal2

>ENSXETG_0000009009|ENSXETT00000019730|nodal3

>ENSXETG_0000025789|ENSXETT00000019728|nodal2

>ENSXETG_0000009008|ENSXETT00000019726|nodal1

>ENSXETG_0000017442|ENSXETT00000037932|nodal5.2

>ENSXETG_0000016779|ENSXETT00000036596|nodal5

>ENSXETG_0000016778|ENSXETT00000036593|nodal6

>ENSXETG_0000023748|ENSXETT00000051228|nodal

F:\dev\sql>sed "s/0/_/g" tmp.txt

>ENSXETG______25789|ENSXETT______19729|nodal2

>ENSXETG_______9__9|ENSXETT______1973_|nodal3

>ENSXETG______25789|ENSXETT______19728|nodal2

>ENSXETG_______9__8|ENSXETT______19726|nodal1

>ENSXETG______17442|ENSXETT______37932|nodal5.2

>ENSXETG______16779|ENSXETT______36596|nodal5

>ENSXETG______16778|ENSXETT______36593|nodal6

>ENSXETG______23748|ENSXETT______51228|nodal


The pipe

F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt

F:\dev\sql>sed "s/nodal/xnr/" tmp.txt

>ENSXETG00000025789|ENSXETT00000019729|xnr2

>ENSXETG00000009009|ENSXETT00000019730|xnr3

>ENSXETG00000025789|ENSXETT00000019728|xnr2

>ENSXETG00000009008|ENSXETT00000019726|xnr1

>ENSXETG00000017442|ENSXETT00000037932|xnr5.2

>ENSXETG00000016779|ENSXETT00000036596|xnr5

>ENSXETG00000016778|ENSXETT00000036593|xnr6

>ENSXETG00000023748|ENSXETT00000051228|xnr

F:dev\sql>grep “>” my-sequence-file.fasta|sed "s/nodal/xnr/"

>ENSXETG00000025789|ENSXETT00000019729|xnr2

>ENSXETG00000009009|ENSXETT00000019730|xnr3

>ENSXETG00000025789|ENSXETT00000019728|xnr2

>ENSXETG00000009008|ENSXETT00000019726|xnr1

>ENSXETG00000017442|ENSXETT00000037932|xnr5.2

>ENSXETG00000016779|ENSXETT00000036596|xnr5

>ENSXETG00000016778|ENSXETT00000036593|xnr6

>ENSXETG00000023748|ENSXETT00000051228|xnr


(n)awk

Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields.

3 06/04/2011 06/04/2011 Ashleigh Howes OK [email protected] Immunoreg O'Garra

2 06/04/2011 06/04/2011 Leona Gabrysova OK [email protected] Immunoreg O'Garra

1 06/04/2011 06/04/2011 Eric Dang (ok) [email protected] Immunoreg O'Garra

07/04/2011 07/04/2011 Guilherme Neves OK [email protected] Molecular Neuro Pachnis

5 08/04/2011 08/04/2011 Mary Wu OK [email protected] Systems Biol Smith

24 08/04/2011 08/04/2011 Jose Saldanha OK [email protected] Math Biol Goldstein

8 08/04/2011 08/04/2011 Laurent Dupays OK [email protected] Dev Biol Mohun

7 08/04/2011 08/04/2011 Veronique Duboc OK [email protected] Dev Biol Logan

22 09/04/2011 09/04/2011 Madhu Kadekoppala OK [email protected] Parasitology Holder

21 09/04/2011 09/04/2011 George Gentsch OK [email protected] Systems Biol Smith

13 10/04/2011 11/04/2011 Siggi Sato OK [email protected] Parasitology Holder

10/04/2011 10/04/2011 James Briscoe OK [email protected] Dev Neurobiol Briscoe

26 10/04/2011 17/04/2011 Alex Watson OK [email protected] Systems Biol Smith

10/04/2011 Mustafa Khokha OK [email protected] YALE

F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt

Ashleigh [email protected]

Leona [email protected]

Eric [email protected]

Guilherme [email protected]

Mary [email protected]

:

F:dev\sql>nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed "s/@/ AT /“

Ashleigh ahowes AT nimr.mrc.ac.uk

Leona lgabrys AT nimr.mrc.ac.uk

Eric edang AT nimr.mrc.ac.uk

Guilhermegneves AT nimr.mrc.ac.uk

Mary mwu AT nimr.mrc.ac.uk

:


Now let’s go and get some data...

Download the set of gene locus coordinates for your model organism from BioMart (Ensembl).


Catches and other platform specific stuff

Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows

Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all.

Unix/Linux is case sensitive: Hello != hello

^C (control-C) is your get out of jail free card!


Next week

Download and install MySQL…


WEEK TWO

Relational databases and the structured query language (SQL), and an introduction to MySQL.


What is a database?

A collection of data tables, or other structures, managed by a computer program and queryable by some sort of (usually) standard language.

We will look at relational databases where the data are help in strict column based tables, and queried with SQL – the structured query langauge. This is a standard across a wide range of products: MySQL, Oracle, Sybase, MS-SQLServer PostGres, etc.

Database server: the computer program that manages everything

Database: a defined area which contains a set of related data tables and other entities: rules, datatypes, etc.

Table: a single defined object with a fixed (by you) number of columns and any number of rows

Datatypes: out of which data structures are built, i.e. you must define the ‘type’ of your data: integer, float, varchar(), text, etc.


A table

course_applicants


Another table

research_roles


Creating a table

[this is SQL!]

create table course_applicants

(

date_applied datetime,

first_name varchar(32),

last_name varchar(32),

status varchar(4) null,

email varchar(32),

divison varchar(32),

lab varchar(32) null,

role_id integer

)


Creating a table

create table course_applicants

(

cap_date_applied datetime,

cap_first_name varchar(32),

cap_last_name varchar(32),

cap_status varchar(4) null,

cap_email varchar(32),

cap_divison varchar(32),

cap_lab varchar(32) null,

cap_role_id integer

)


Creating a table: at the prompt

F:\dev\sql>MySQL –U myId –P myPass

>use mydb

>go

>create table course_applicants ( date_applied datetime, etc…

>go

>(will now create table in the database)


Creating a table: in a script (better)

table_course_applicants.sql

use mydb

go

create table course_applicants

(

date_applied datetime,

first_name varchar(32),

last_name varchar(32),

status varchar(4) null,

email varchar(32),

divison varchar(32),

lab varchar(32) null,

role_id integer

)

go

F:\dev\sql>MySQL –U myId –P myPass < table_course_applicants.sql

F:\dev\sql>(table is now created)


Entering data into a table

insert_applicants.sql

use mydb

go

insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id )

select getdate(), ‘Mike’, ‘Gilchrist’, [email protected], ‘Sys Bio’, 8

insert course_applicants ( date_applied, first_name, last_name, email, divison, role_id )

select getdate(), ‘Mustafa’, ‘Khokha’, [email protected], ‘YALE’, 8

<etc.>

<etc.>

<etc.>

go

F:\dev\sql>MySQL –U myId –P myPass < insert_applicants.sql

F:\dev\sql>(data is now added to table)


Loading a table from a text data file

computing-course-applicants.txt

F:\dev\sql>MySQL –U myId –P myPass

>use mydb

>go

>LOAD DATA LOCAL FILE computing-course-applicants.txt

>INTO TABLE course_applicants

>go


Using SQL to query/analyse/alter the data

F:\dev\sql>MySQL –U myId –P myPass

>use mydb

>go

>select * from course_applicants

>go

>(outputs data to screen)

select * from course_applicants order by date_applied

select count(*) ‘applicants’, count(distinct division) ‘divisions’ from course_applicants

select lab, count(*) from course_applicants

group by lab order by count(*) desc

select lab, count(*) from course_applicants

group by lab having count(*) > 1

order by lab

select * from course_applicants where first_name like ‘J%’

select * from course_applicants where status != ‘OK’


Joining tables for more complex queries

select first_name, last_name, role

from course_applicants, research_roles

where course_applicants.role_id = research_roles.role_id (JOIN)

select first_name, last_name, role

from course_applicants a, research_roles r (table aliases)

where a.role_id = r.role_id (JOIN)

select cap_first_name, cap_last_name, rro_role

from course_applicants, research_roles

where cap_role_id = rro_role_id (JOIN)

select rro_role ‘role’, count(*) ‘on course’

from course_applicants, research_roles

where cap_role_id = rro_role_id (JOIN)

group by rro_role

order by rro_role


delete and update statements

/* CHANGE THE ROLE DESCRIPTION */

update research_roles set role = ‘CDF (Post Doc)’

where role = ‘Post Doc’

update research_roles set role = ‘CDF (Post Doc)’

where role_id = 2

/* REMOVE PIs */

delete course_applicants

where role_id = 8

delete course_applicants

from research_roles

where course_applicants.role_id = research_roles.role_id

and research_roles.role = ‘Principal Investigator’

:

and research_roles.role like ‘P% I%’


Catches

Never use sub-queries:

select ... from tableA a, tableB b

where a.id = b.id

and a.id in (select id from tableC where id like ‘NIMR%’ )

SQL can give you the wrong answer in a plausible way...

If your query is taking forever you have probably missed out a join condition.

You can always add rows of data easily, but adding columns is hard work, and means you will have to re-write code...

Long text data (e.g. sequences) can be tough to handle.


Exercises

Design a table, or tables, to hold personal details (names, etc) and contact information (email, phone numbers, etc.) for people in an organisation.

Download the set of gene loci and exon data for your organism from BioMart, design some tables for this data, load the data from your data files, and then query the data for the number of multi-exon vs single exon genes, and find the gene with the most number of exons, etc.


Next week

Download and install BLAST from NCBI

Prepare a two minute verbal presentation of you project


ad