1 / 1

Haha  SSAHA

-SSAHA stores individual nucleotide base information as 2-bits of information per base: A is stored as 00 C is stored as 01 G is stored as 10 T is stored as 10 -This allows for fast and efficient storage and processing, using less memory and less processor power

chin
Download Presentation

Haha  SSAHA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. -SSAHA stores individual nucleotide base information as 2-bits of information per base: A is stored as 00 C is stored as 01 G is stored as 10 T is stored as 10 -This allows for fast and efficient storage and processing, using less memory and less processor power -However, the drawback is that only 4 characters are possible, and sometimes real DNA data has gaps and ambiguous characters -Other search engines treat these ambiguities as separate letters, like R or N (N means it could be any base) -SSAHA reverts all ambiguities to A, since AAAAA… is the most common ‘word’ in the genome, and the search can easily recognize it as uninformative ACT N N GCA 0001100000100100 OPTIONALLOGO HERE OPTIONALLOGO HERE Haha  SSAHA Kelvin Gu, Tiffany Lin, Nick Altemose, Kevin TaoDuke University, Trinity College of Arts and Sciences Introduction How Does a Hash Table Work? APT Sample Search -SSAHA is a search method with one goal in mind: speed. -SSAHA likes to keep things organized, like an old lady who puts everything in labeled Tupperware, which she then organizes into labeled drawers. -SSAHA takes a database of data, like the 3 Gigabase human genome reference sequence, and organizes it into a fast searchable index, known as a hash table. -The hash table is like a dictionary to the English language, or an index to an encyclopedia. -You only have to make a hash table once for a given database, kind of like you only have to tidy your room up once, then you can find things extremely easily thereafter. Ssaha uses hash tables, which is what makes it fast. It’s like a lookup table: phonebook, or bookshelf, basically insert in a key (piece of info) and get info related to it. This article explains hash tables. Hash tables are 2 dimensional tables represented by keys and values. Hash tables take information, assigns a key to it, and puts it in a certain location. The location is determined by hash-code or hashing function or mapping function or randomization technique, or key-to-address transform. A major problem with hash tables is collisions, because there are often more identifiers than addresses. Two pieces of information share the same address in the hash table. Luckily, we have solutions to collision, which are: 1) make a list of all data that all have the same home address, and then when the address is looked up, look through the list to find all the synonyms. 2) if the first address is already occupied, generate a new address, and put the data there. 3) each address is a bucket, with a capacity to hold many items. 4) use a function that does not make repeats. Hash tables are inflexible, if you want to increase the size, all hash values must be recalculated. Do a pictorial example of a SSAHA search. Given a collection of strands, and a target string, re-arrange the strands in the collection so that they are in order based on how they compare to the target string.  The measurement of similarity is the percentage of base pairs of the shorter string that match up with the longer string.  The shorter string(pattern string) is shifted left and right to find the best match, and the similarity is measured there.  The percentage is calculated based on the length of the shorter string.When two strings have the same percentage similarity, the longer string should come first.When two strings have the same percentage similarity, and length, alphabetical order should take precedence. Definition Class: StringMatcher Method: orderList Parameters: String, String[] Returns: String[] Method signature: String[] orderList(String target, String[] patterns) (be sure your method is public) Class public class StringMatcher {     public String[] orderList(String target, String[] patterns)     { // fill in code here     } } Constraints A strand’s similarity should be based on the percentage of matching base pairs The pattern strands must be shorter or equal to the length of the target strand The pattern strand should be allowed to find its best match at any point in the target strand If two or more strands share the same percentage similarity, list in order of length If two or more strands share the same percentage similarity AND length, list in alphabetical order Examples TARGET STRING = “actgcataccata” PATTERN STRINGS = “actgca”, “act”, “gga”, “cgta” Should return: String[] containing “actgca”, “act”, “cgta”, “gga” TARGET STRING = “aaaaagggttcccatatatata” PATTERN STRINGS = “aata”, “gggttc”, “tta”, “cccat” Should return: String[] containing “gggttc”, “cccat”, “aata”, “tta” SSAHA Step by Step Make a flowchart here Talk about the steps… fill the rest of this space. One Advantage of SSAHA Advantages and Conclusions In conclusion: Contact information

More Related