Introduction to ACNUC: Querying and Building Biological Sequence Databases

Plan • Introduction • Querying sequence databases (60%) • Building your own sequence databases (30%) • Use of API (10%) • Further

Introduction • History • Un système de base de données et un outil d’interrogation • Principe général d’ACNUC • Accès aux programmes et aux bases • Déroulement de l’atelier

IntroductionHistorique ACNUC est un système de gestion de bases de données dédié à la gestion des séquences biologiques, en particulier génomiques. • Son développement a débuté en 1980. • Il sert à la fois d'outil d'interrogation et de couche basse pour le développement de logiciel. • Il reste le seul logiciel permettant l'interrogation, transparente pour l'utilisateur, des sous-séquences des séquences présentent dans les banques. • Des développements récents avec Stéphane Delmote permettent d’interroger les banques à distance via un serveur de sockets

IntroductionPrincipe Le principal géneral d’ACNUC repose sur l’indexation des fichiers de séquences annotées (EMBL, GenBank, SwissProt ...) Les différents champs des annotations sont indexés dans des fichier d’index (NOMS, ESPECES, MOT-CLEFS, etc) qui sont mis en relation via des pointeurs.

IntroductionAccès aux programmes et aux bases Les programmes, les bases de données et la documentation sont accessibles sur le site du PBIL: http://pbil.univ-lyon1.fr/

IntroductionWorkshop progress Several exercises and examples of applications will be discussed during the workshop. This presentation and several scripts are available at: ftp://pbil.univ-lyon1.fr/pub/in2p3/formation_acnuc/ GENERAL DOCUMENTATION: http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html QUERY LANGAGE DOCUMENTATION LANGUAGE: http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

Query sequence databases • First steps with ‘QueryWin’ • The query language • simple query • séquences and sub-sequences • complicated query • Data extraction • several formats • extract peculiar part of the sequences • Using ‘query’ • simle scripts • complex scripts • Using ‘seqinR’ • query databases from R

First steps with QueryWin « QueryWin » works on all platforms : Unix/Linux, Mac, Windows 2 versions are availble: the « local version» works on local databases the « client version » works on distant databases Available at PBIL: http://pbil.univ-lyon1.fr/software/query_win.html Documentation available at PBIL http://pbil.univ-lyon1.fr/software/doclogi/docacnuc/acnucwin/acnwian/aquerywin.html

First steps with QueryWin • Lauch Query_Win - Mac version: click on the application

First steps with QueryWin • Launch Query_Win - on the clusters (local version) • launch query_win on EMBL: >query_win embl

First steps with QueryWin command buttons • Lauch Query_Win - on the clusters (local version) • launch query_win on EMBL: >query_win embl command window - query language

First steps with QueryWin • Two ways (not exclusives) of querying tthe database: • using buttons and menus • using the query language Exercise 1 :select mouse sequences in EMBL • method 1: Click on the buttons select then species and type « mus » in the opening window . Choose option « build query » Have a look on the command window. Execute Try again with the option « make list »

First steps with QueryWin Exercice 1 suite • method 2: type « sp=mus » in the command window IMPORTANT : Queries done with method 1 are displayed as a query langage in the command window This is an excellent way to learn the query language From now, try to answer the question with the buttons and menus and observe thow it is translated in query language. Little by little ,you may tru to use directly the query language. Another thing: A « HELP » mode is available in Query_Win

The query languagesimple queries • All operations are possible with query_win (by clicking on buttons or using the query language) • Some simple examples : • query a sequence according to its name • query a sequence according to its accession number • query a sequence according to its species or taxon • query a sequence according to a keyword Other examples : • Which species is associated to this sequence ? • Which keywords are associated to this sequence ?

The query languagesimple queries ACNUC query language is described here: http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE Exercise 2 : Query SwissProt Retrieve sequences of cat (Felis cattus) using the buttons Retrieve sequences of cat (Felis cattus) using the query language Compare the results Exercise 2bis : Query SwissProt Retrieve sequences with the taxonomic ID (TaxonID) of the felis genre (tid=9682)

The query languagesimple queries ACNUC query language is described here: http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE Exercise 3 : Query SwissProt Retrieve sequences associated to the keyword « adenylate cyclase » using the buttons Retrieve sequences associated to the keyword « adenylate cyclase » using the query language Check the different annotation fields. Where is adenylate cyclase? Do the same with GenBank

The query languagesimple queries ACNUC query language is described here: http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE Exercise 4 : Query GenBank Retrieve sequences associated to the BTG1 gene Check the different annotation fields. Where is the information on the gene ? Do the same with SwissProt Help the gene name is a keyword

The query languagesimple queries Use of « wild card » : @ To retrieve keyword beginning with « toto », search for toto@ . Exercize 5 : Retrieve sequences associated to keyword beginning with BTG Note You may use the wild card for species and sequence name

ID ESCOL3_3; SV 2; circular; genomic DNA; GRV; PRO; 5498450 BP. XX AC BA000007_GR; XX blah blah blah XX CC This Genome Reviews entry was created from entry BA000007.2 in the CC EMBL/Genbank/DDBJ databases on 03 March 2009. XX FH Key Location/Qualifiers FH FT source 1..5498450 FT /organism="GR Escherichia coli" FT /strain="Sakai = O157:H7 = RIMD 0509952 = EHEC" FT /mol_type="genomic DNA" FT /chromosome="Chromosome" FT /db_xref="taxon:386585" FT .5F1 5'ncr 1..189 FT /cds_name="ESCOL3_3.PE1 " FT .PE1 CDS 190..273 FT /codon_start=1 FT /gene_name="thrL" FT /locus_tag="ECs0001" FT /protein_id="BAB33424.1" FT /transl_table=11 FT /translation="MKRISTTITTTITTTITITITTGNGAG" FT .3F1 3'ncr 274..353 FT /cds_name="ESCOL3_3.PE1 " FT misc_structure 215..328 FT /gene_name="Thr_leader" FT /db_xref="Rfam:RF00506" FT .5F2 5'ncr 274..353 FT /cds_name="ESCOL3_3.PE2 " FT .PE2 CDS 354..2816 FT /codon_start=1 FT /gene_name="thrA" FT /locus_tag="ECs0002" FT /product="Aspartokinase I, homoserine dehydrogenase I " FT /function="NADP or NADPH binding" FT /function="amino acid binding" FT biosynthetic process" FT /protein_id="BAB33425.1" FT /db_xref="GO:0004072" FT /db_xref="UniProtKB/TrEMBL:Q8XA84" FT /transl_table=11 etc etc The query language sequences & sub-sequences One of the main strength of ACNUC is the definition and the use of sequences and sub-sequences. 5’ 3’ CDS 2 3 1 5’ncr 1 3 2 3’ncr 3 2 1

The query language sequences & sub-sequences ACNUC defines sequences and sub-sequences. A sequence may contain many sub-sequences. For example, a chromosome and its CDS are respectively a sequence containing several sub-sequences A sub-sequence may be of several type Exercise 6 : Query HOGENOMDNA (complete genomes) Retrieve sequences of Escherichia coli o157:h7 str. sakai Question: what are these sequences ? Retrieve sub-sequences of chromosome ESCOL3_3 Question: which type are these sequences ? Retrieve the CDS of chromosome ESCOL3_3 Back to the séquence ESCOL3_3: check for the CDS in the annotations

The query language sequences & sub-sequences Séquences are associated to one species. All its sub-sequences are associated to this species. It is not the case of keywords. A keyword may be associated to a sequence or only to one of its sub-sequence. Exercise 7 : Query SwissProt Retrieve sequences associated to the BTG1 gene Do the same in GenBank What are these sequences? Help gene name is a keyword

The query language complex queries • Combinations of criteria: • Operations AND, OR, NOT, AND NOT • Use of parenthesis • Crossing results list: Exercice 8 : Query SwissProt Retrieve mammalian sequences Retrieve sequences associated to BTG1 Cross these 2 list : list1 AND list2 Retrieve mammalian sequences associated to BTG1 in a single query Retrieve mammalian sequences associated to BTG1,BTG2,BTG3 and BTG4 in a single query. How many sequences you obtained? Indice beware OR and AND

The query language complex queries Other criteria: year of publication ex: y<1986 author of publication au=marley idem journal molecule m=mRNA organelle o=MITOCHONDRION type t=CDS hôte h=homo sapiens status (not for GenBank) st=EST

The query language complex queries Modify a sequences list according to the sequences date or sequence lengths Exercise 10:: Query SwissProt Retrieve sequences from mus Select sequences with more than 300 aa Select sequences which have been added after Y2K

The query language complex queries Exercise 11: Query SwissProt Wich are species in witch BTG1 is found in sequence annotations? (it does not mean that other species do not present this gene) Solution :retrieve sequences associated to the gene then retrieve the species associated to these sequences) Exercise 11bis Do the same in one command line Help projecting species ps Exercise 12 Retrieve the name of all the strains of E. coli found in EMBL Exercice 12bis Retrieve the list of eukaryots in HOGENOMDNA. Retrieve the list of fungi.

The query language browsing taxonomy and keywords Both taxonomy and keytwords are organised in a hierarchy. It is possibleto browse these hierarchies with the button browse of Query_win A keyword may have « parent ». For example, EC-numbers are keyword, all descending of the keyword « EC_Number » This is very useful to sort and select keywords. You may select a parent keywords in Query_Win by selecting the button « by name », then enter the word and click « exec » then « done

The query language browsing taxonomy and keywords Exercise 13 : Query SWISSPROT Retrieve all keywords associated to human There is too many keywords! We only want EC numbers: Retrieve descending keyowrds of de « EC_NUMBERS » How many are they? Exercise 13 bis: Retrieve EC_NUMBERS associated to human Vocabulaire pk list kd list (nk=)

The query language complex queries Vocabulaire fk file un lmist ps list Use of de files You may use of files containing: sequence names sequence accession number keywords species Exercise 14 : In Uniprot retrieve the human EC numbers from the file created in exercise 13bis. What are the mouse sequences associated to these EC numbers.

The query language scan of annotations It is possible to scan the annotations. Interesting of the word to scan is not indexed and if the list of sequences to scan is not too big

Data extractionseveral formats Exercise 15 : Query HOGENOMDNA Selectionner sequences of yeast (saccharomyces) Extract sequences of chromosomes in FASTA format Extract sequences of CDS translated into protein in FASTA format

Data extraction extract part of sequences Exercise 16 : In HOGENOMDNA Selectionner sequences of yeast (saccharomyces) Extract sequences of CDS in FASTA format Extract sequences of CDS in EMBL format Extract 5’non coding sequences in FASTA format Extract the 1000 first residus of each chromosome in FASTA format Extract the 500 residus preceding the CDS in FASTA format

Use of query • « query » is the command line version of query_win • Its interest relies on the possibilty of using scripts. • This helps the automation of th processing, which is very useful in the following cases: • long suite of queries boring re-write each time: less errors, save time • use of workflows • use of generic scripts for different uses • use on clusters and farms.

Use of query launching As Query_Win , 2 versions are available: local version ( installed on pbil, pbil-dev, et les workers pbil-debX) client version (query distant databases) Both available for Linux/Unix, MacOS, Windows. Locale version : query embl >query embl Client version : queryr embl >raa_query then choose database, or directly: >raa_query pbil.univ-lyon1.fr:5558/embl

Use of query instructions • « query » use the same query language as query_win. • However, there are small differences, especially in the managment of lists. • Do not hesitate to consult help by typing HELP. Exercise 17 Query HOGENOMDNA (complete genomes) Retrieve sequences of Escherichia coli o157:h7 str. sakai Retrieve sub-sequences of chromosome ESCOL3_3 Retrieve CDS of chromosome ESCOL3_3

Use of query instructions Solution exercise 17 Query HOGENOMDNA (complete genomes) Retrieve sequences of Escherichia coli o157:h7 str. sakai Retrieve sub-sequences of chromosome ESCOL3_3 Retrieve CDS of chromosome ESCOL3_3 Save CDS query hogenomdna sel sp=Escherichia coli O157:h7 str. sakai mod list1 5 sel n=ESCOL3_3 et t=cds save list3 list_cds stop select a list ( defaut :list1) selection criterium modify list list to be modified type of modification selec a new list ( default: list3) selection criterium save list list to be saved file exit query

Use of query instructions

Use of query use of scripts • A script is used as it follows • query banque << EOF • instructions • instructions • instructions • instructions • EOF

Use of query use of scripts Execute precedng exercise with a script. Moreover, extract CDS in FASTA format source exemple_script_1.csh or csh exemple_script_1.csh terminal no Exercice 18

Use of query use of scripts source exemple_script_2.csh ou csh exemple_script_2.csh Exercice 19 sel/l=plant This script select homologous gene famiies ( HOGENOM families) shared by plants and cyanobacteria but not by animals. CDS of Arabidopsis present in these families are saved and extracted in FASTA format giving a name to the list helps the writing and understanding of the script

Use of query use of scripts Use of a script with arguments csh exemple_script_3_bis.csh viridiplantae cyanobacteria metazoa

Use of seqinR It is possible to query ACNUC databases from the R software. Use the seqinR package Exercise 17ter with R: Query HOGENOMDNA (complete genomes) Retrieve the CDS of Escherichia coli o157:h7 str. sakai Plot the histogram of CDS lengths

Use of seqinR Solution Exercise 17 install.pacakges(« seqinr ») library(« seqinr ») choosebank(« hogenomdna ») query("cds","sp=Escherichia coli o157:h7 str. sakai et t=cds") lengths<-lapply( cds$req,getLength) hist(unlist(lengths))

Build your own ACNUC database Why ? • To stock and access to sequences of interest. • selection and modification of a sub-set of a generalist database • sequencing • Allowing complex queries • Create your own keywords and associated hierarachy • Automation of queries • Share and diffusion

Build your own ACNUC database How to select a local database: index are in /ma_banque/index flat files are in /ma_banque/flat_files Define environnement variables acnuc et gcgacnuc setenv mabase « /ma_banque/index /ma_banque/flat_files » query mabase

Build your own ACNUC database Build a database from annotated data Exercise 20 build a database in SWISSPROT format script build_uniprot.csh initf : create indexes acnucgener: indexation of sequences Documentation: http://pbil.univ-lyon1.fr/databases/acnuc/acnuc_gestion.html

Build your own ACNUC database Build a database from annotated data Exercise 21 build a database in EMBL format script build_embl.csh initf : create indexes acnucgener: indexation of sequences Documentation: http://pbil.univ-lyon1.fr/databases/acnuc/acnuc_gestion.html

Build your own ACNUC database Defining new keywords (EMBL/GenBank only) By default, many fields are used to define the keywords However it is possible to specify supplementary fields to define keywords. Example search for keyword HBG298754 in the previously created embl database. The keyowrd is nout found.. However the field /gene_family="HBG298754"exists (cf séquence ECODH_1.PE2) Exercise 22 Rebuild the database with build_embl_customized.csh... Query for the keyword again.

Build your own ACNUC database Defining new keywords (EMBL/GenBank only) Use the file « custom_policy » which should be in the directory $acnuc (index) fichier custom_policy Qualifier = GENE_FAMILY Use_Value = True Parent_Keyword = GENE_FAMILY Qualifier = DB_XREF Use_Value = True Parent_Keyword = CROSS REFERENCES Qualifier = PROTEIN_ID Use_Value = True Parent_Keyword = PROTEIN IDS Qualifier = %(C+G) Use_Value = True Parent_Keyword = CG_CONTENTS Qualifier = LOCUS_TAG Use_Value = True Parent_Keyword = LOCUS_TAG ECODH_1.PE2 Location/Qualifiers (length=2463 bp) FT CDS 337..2799 FT /codon_start=1 FT /gene_family="HBG298754" FT /evidence="4: Predicted" FT /gene_id="IGI03726849" FT /gene_name="thrA" FT /locus_tag="ECDH10B_0002" FT /product="Fused aspartokinase I and homoserine FT dehydrogenase I" FT /function="NADP or NADPH binding" FT /function="amino acid binding" FT /function="homoserine dehydrogenase activity" FT /biological_process="aspartate family amino acid FT biosynthetic process" FT /protein_id="ACB01207.1" FT /db_xref="GO:0004072" FT /db_xref="InterPro:IPR001048" FT /db_xref="UniProtKB/TrEMBL:B1XBC7" FT /transl_table=11 FT /%(C+G)="CG<60%" FT /note="C+G content in third codon positions = 57.6 % " //

Build your own ACNUC database Enrich annotations et create keywords Yoy may enrich the annotations with adapted keywords. For example, the following lines FT /gene_family="HBG298754" FT /%(C+G)="CG<60%" FT /note="C+G content in third codon positions = 57.6 % " have been added to allows to query the database according to the GC contents or the gene family.

Build your own ACNUC database Enrich annotations et create keywords Exercise 23 Modify custom_policy to generate different keywords 2 examples custom_qualifier_policy.hogenom custom_qualifier_policy.tp Going further: Modify the annotations and create an associated custom_qualifier_policy file.

Introduction to ACNUC: Querying and Building Biological Sequence Databases

Introduction to ACNUC: Querying and Building Biological Sequence Databases

Presentation Transcript

plan

Plan

plan