# Problems on Strings and Strands - PowerPoint PPT Presentation

1 / 16

Problems on Strings and Strands. Find CGTC…CGTC, i.e ., repeated sequence separated by specified number of nucleotides What about any four letter sequence repeated? What about any four letters repeated How do we do this? Find palindromic motif/sequence CTAGGATC or CTAGATC.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Problems on Strings and Strands

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Problems on Strings and Strands

• Find CGTC…CGTC, i.e., repeated sequence separated by specified number of nucleotides

• What about any four letter sequence repeated?

• What about any four letters repeated

• How do we do this?

• Find palindromic motif/sequence

• CTAGGATC or CTAGATC

### What's a regular expression?

• A way to specify a pattern of text

• A "machine" will recognize the pattern by matching it to a string

• The matching process is very efficient, basically look at every character once (can we do better?)

• Perl is heavily based on regex's, Java has a regular expression library (java.util.regex)

• We'll use it to do APT like things

• Used in English to find "words"

• [aeiou]

• a.*b.*c.*

• [a]+

• ^a.*z\$

• (…)(\1)

### Tagged regexp

• We identify part of a regexp by enclosing with parentheses

• (….)(xxxx)

• We can refer back to the tagged part by an escaped number, e.g., \1 or \2

• Entire regexp is usually represented by \0

• “Efficient” querying mechanism

• Who pays for efficiency?

### Storing and Retrieving Values

• Read words from a file and store them somewhere

• Dictionary of words to analyze in English

• All dimers in a DNA sequence

• All occurrences of TATA in …

• All words that have 'e' before 'I'

• We don't want to print, we want to store, why?

• If we print…

• If we store …

### Collections of String, int, Sequences,…

• We can create an array of String values, but we must know how big to make the array

String[] list = new String[20]; // store how many?

for(int k=0; k < list.length; k++){

list[k] = "hello";

}

• What if we're not sure how many objects to store?

• Create "big" array, and don't fill it, problems?

• Create one array, when fill, create another

### Arrays that grow

// read file and store words

int count = 0;

String[] list = new String[20];

while (sc.hasNext()){

String word = sc.next();

if (count < list.length){

list[count] = word;

count++;

}

else {

String[] copy = new String[count*2];

for(int k=0; k < count; k++){

copy[k] = list[k];

}

list = copy;

}

}

### Ask not what you can do…

• The ArrayList class is a growable array

• Add things to the end, it grows when needed

• Useful when we don't know how big to make things

• What's the cost of doubling?

• If it costs \$0.01 to create a place for a string and \$0.01 to copy a string, how much does it cost to start with an array of size 1 and then read and store 500 strings?

### What is Information? http://dictionary.com

• Knowledge derived from study, experience, or instruction.

• Knowledge of specific events or situations that has been gathered or received by communication; intelligence or news.

• A collection of facts or data: statistical information.

• The act of informing or the condition of being informed; communication of knowledge: Safety instructions are provided for the information of our passengers.

• Computer Science. Processed, stored, or transmitted data.

• A numerical measure of the uncertainty of an experimental outcome.

• Law. A formal accusation of a crime made by a public officer rather than by grand jury indictment.

### Data on Audio CD, Human Genome

• (see howstuffworks.com)

• Sample 44.1 KHz, sampling 16 bits (2 bytes), and in stereo:

• 44,100*2*2 = 176,400 bytes/second

• We need some overhead on a CD, so we get approx.

• 747 Megabytes on one CD or 10Mb/minute, uncompressed

• Human Genome is approx. 3.2 Billion base-pairs

• 1BP is either C, G, A, T, we need 2 bits for this

• 6.4 Billion bits is about 0.8 Billion bytes, call it 1GByte

• No annotations which are definitely needed

• Lessons from this comparison?

### Craig Venter http://www.sfgate.com (September 19, 2004)

Venter is one of the most creative minds in biology, and one of the most entertaining. Supremely immodest, he surges forth like a force of nature. The possibility that this man might push over the brink into Frankenstein territory is part of the fascination.

"Is it possible that you don't know what you're doing?" I ask.

"Sure, it's possible, but people have said that before about me. And they've been wrong."

### The Scientist http://www.the-scientist.com/yr1995/july/tigr_950724.html

Noting your continuing interest in ethics and policy, what are your thoughts on the relationship between scientific and bioethical concerns surrounding genomic research?

First, I think the worst thing for science and society is to leave the, quote, ethical issues up to the professional ethicist, because while there are some very good ones that have their feet on the ground and things based in reality, a lot of discussions that I've heard at ethical meetings are totally hypothetical situations that have nothing to do with facts or reality or even probability.

I feel that the scientists doing the work - and I seem to have somewhat of a unique philosophy in this - and leading the discoveries should be some of the people most involved in the ethical discussions, not the ones sitting on the sideline, leaving it up to the professional ethicist.

### Sir John Sulston (Nobel 2002)

… I think the important purpose of science is to explore, discover and understand. I’m glad if I’ve been able to contribute a little to that process, and hugely grateful to all my colleagues, both here and elsewhere, for their achievements and for the fun. And I hope that we can apply our ever increasing knowledge wisely, for the good of all.

### From Wired Interview, June 2003

If The Common Thread can be said to have a villain, it's Craig Venter.

SULSTON: … If there's a villain, it's an economic system that allows individuals to control information that belongs to everyone. The genome sequence is a discovery, not an invention.

Could genetic patents be a necessary evil? Venter has called them the best way to get drug company researchers to tackle certain diseases.

Diseases are not neglected for lack of patients, but for a lack of markets. Malaria has a thicket of patients but no market, because the victims are mostly poor. Thus 90 percent of the world's disease burden receives 10 percent of the research effort. Commercial incentives offer no way of redressing this; global funds from governments and charities are necessary to do so.

### Craig Venter at Woods Hole

• http://www.mblwhoilibrary.org/services/lecture_series/venter/index.html

• What is Woods Hole? Who is David Baltimore?

• What is the role of computing and Computer Science?

• What is the difference between shotgun techniques

• What is the reference to sailboats about? Is this important?

• What kinds of jokes do scientists laugh at?

• What does it mean to interpret genetic code?

• What is cDNA, what is RNA?, what is DNA? ESTs?