Component Review

Component Review Sequence Retriever

DNA Retriever • Retrieves sequences up to -2k to +2k of the marker • Input: Marker • Output: Sequence from -2k to +2k

DNA cont’d … • populateSequenceCache(); • Reads file All.NC.-2k+2k.txt from: http://amdec-bioinfo.cu-genome.org/html/caWorkBench/data • Who is curating this file? Do we want to be the curators? Can’t we pull the data from a known repository (ie. Unigene)?

DNA cont’d… • The only organism supported is Homo sapien. Again this is because the file All.NC.-2k+2k.txt is Homo sapien specific. >1000_at|MAPK3|-|chr16 30032926 30042039|mRNA:NM_002746|TSS:30042039|Homo sapiens tccagctgctggggtcaagcaatcagcctgcctcagcctcccaaagtgct ggaatcacagaggtgagccaccacgaccagccagcatcattttccagtta aatgttaatagtatctacttcatatggctgctgtgaggattacaggaggt agtgtacataaaagtgcctggcaAGCCAGGTGccgtaatctcagcactct gggaggccaatgtgggtggatcacctgaggtcaagagtttgagaccagcc tggccaacatggtgaaacctcgtctctactaaaaatacaaaaattagccg ggcgtggtggctggcgcctgtaatcctagctacttgggaggctgaggtgg gagaatgacttgaacccaggaggtaggggttgcagtgagcttgcactcct

DNA cont’d… • getCachedPromoterSequence(..) • if (sequence != null) { return sequence.getSubSequence(UPSTREAM - upstream - 1, sequence.length() - DOWNSTREAM + fromStart - 1); } • The max and min upstream and downstream regions are hard-coded to 2K (this is based on the file All.NC.-2k+2k.txt)

Protein Retriever • Given a marker, find the protein sequences (can be more than 1). • For each protein id, make a SOAP call to http://www.ebi.ac.uk/ws/services/Dbfetch and retrieve the sequence.

Protein cont’d … • getAffyProteinSequences(..) • Input = affyId, bufferedWriter • For each uniprot id, get the sequence. This is being stored in a String[]. Results are not filtered. • for (int count = 0; count < result.length; count++) { result[count] = result[count].replaceAll(">", ">" + affyid + "|"); br.write(result[count]); // need to write to // a sequenceDB. br.newLine(); }

Proposed Fixes • DNA • Investigate getting data from a public repository. • Parameterize UPSTREAM and DOWNSTREAM min and max values, and catch an out-of-bounds in the code instead of hardcoding these values (2000). • Support for more than 1 organism • In the interim, if we do plan on using the file All.NC.-2k+2k.txt do we need to get this over the wire, since it is ours?

Proposed Fixes cont’d … • Protein • Filter the results to avoid invalid results in the result set. If you don’t, you may end up with something like: (not putting stacktrace here because it is too verbose) • Code Style • Refactor “dead logic” (ie. line 100 of PromoterSequenceFetcher) • Code reuse – The dna and protein sequence retriever are solving similar tasks. The code should reflect this by reusing similar logic. Something as simple as an abstract BaseSequenceRetriever class, which would have an abstract method such as connectToResource(Url url) implemented in the extending classes could help out here. • Hardcoded strings should be removed and put into a properties file • Can do something like: ResourceBundle rb = ResourceBundle.getBundle( “sequenceRetriever" ); String mageClassesString = rb.getString( “upstream.region" )

Proposed Fixes cont’d … • Standard javadoc comments • Use commons logging. Do not do things like: if (result.length == 0) { System.out.println("hmm...something wrong :-(\n"); } This should be something like: Log log = LogFactory.getLog(Classname.class); Log.debug(“no results: ” result.length); • Optimize loops like this (in this case, only result[count] will contain “>”. No need to keep looping for all entries in array result[]. A potential performace hit for arrays of large length: • for (int count = 0; count < result.length; count++) { • result[count] = result[count].replaceAll(">", • ">" + affyid + "|"); • br.write(result[count]); // need to write to • // a sequenceDB. • br.newLine(); • }

Proposed Fixes cont’d … • Other • Need a way to kill the process of retrieving sequences (ProgressBarMonitor?) • Display the names of protein sequences (we currently just display the affy id for protein sequences) • Populate a drop-down with databases we can search. • Hyperlinks to information about the protein/DNA sequence.

Component Review