Python Mapping Report for SAM file Processing

TheAssignment for Programming Training Fu Yu

Usage • “python mapping_reportINPUT_FILE”. The INPUT_FILE is the SAM file that you want to process. • Please keep in mind that you should first change the current directory to the folder that contains the Python script because the results will be put in the current directory.

Results What’s more, my script deals with a SAM file that is more than 2G in only 356.87s Ultrafast!!! In about 18s, the script finishes dealing with a SAM file that is approximately 160megebytes. And in the current directory there is a file named “gross result” that includes all the result. And there is a “Distribution_of_scores.pdf” in which you can find the quality score.

Multithreading • Besides, I tried to use multithreading technology to boost the program. But Python does not seem to be good at this field. It takes about 40s to finish the task. So I give up multithreading.

Data source • SRR037828.fastq was selected randomly from those .fastq files. It was mapped back to the

Q1: Mapping report • Generate a report about the number and percentage of tags that have been mapped back to genome, and the total number of all tags.

Step1 - About SAM files • Use the flag field to deicide it is mapped back or not.

Step1 - Data • Use unmapped to record the number of tags that are not mapped back and chr_mapped_num to store how many tags have found their locations back to the genome. This dictionary might look redundant, but it actually helps in later steps.

Step1 - RegEx • It utilizes the regular expression to get the the name of each chromosome and to get the length of each chromosome.

Step1 - Getting the header To get everything that the header contains. Besides, it handles possible exceptions in case the SAM file is corrupted.

Step1 - Read in all the tags

Q2: Quality score report • Draw a distribution graph about the FASTQ quality score distribution within all mapped tags with R.

Step2 - Loop Put the score of each tag in to the “f_out_quality_score”, thus I can use rscript to deal with the score and draw the distribution.

Step2 - R Here, this Python script creates an R script and call it in the terminal so that we do not have to run the rscript by ourselves.

Step3&4 They share the same loop because they use identical loop. This way, I can improve the efficiency of the script.

Q3: Unique mapped tag • Count the number of tags that each of them is mapped back to only one genomic location.

Step3

Step3 • This step uses a dictionary: the key here is chr + symbol + loc (e.g. chr1+112233) and the number of repeats is the value. If the some key has a value of 2 or more, then we count it out. All the keys that have value of 1 is totaled. And this is the result. The image above shows how the program handles + strands. In the try block, if the line does not have a 19th field, then the program goes into exception (which actually does nothing). Nonetheless, if it does, then keep it in the dictionary for later use.

Q4: Unique mapping location • Count the number of genomic locations that only have one tag mapped.

Step4 – using the XA field Use the XA field to decide how many genomic locations there are and what are the exact place the tags are back.

Step4 If a line has ‘0’ or ’16’, together with the 19th field, then it is a tag that fulfills the condition given. Count the number and we get the result.

Time complexity • This script uses several loops. Step One relies on a loop that has repeats N times. (N is the number of tags. So its comlexity is O(N); • Similarly, Step Three’s complexity is O(N); • However, Step Two and Step Four needs N*l (l is the number of bases in each tag.) So the time complexity of the script is N*l.

Time • I use the “time” module to time the who process. And it takes about 20s for my script to cope with a SAM file that is approximately 160 megabytes.

All in a single run • All the 4 steps are done within the Python script. So we do not have to run “Rscriptxxx.r” outside the script.

Summary • Multithreading • Identify the meaning of each optional fields • Using dictionaries to count the number of tags • Using RegEx to capture the necessary information. • Loop: trying to decrease the number of nested loops as much as possible.

Thank you!

Python Mapping Report for SAM file Processing

Python Mapping Report for SAM file Processing

Presentation Transcript

CBMAC T raining

A gent T raining P resentation

T RAINING

S pring T raining S essions

RADIATION SAFETY T RAINING

t raining i ssues

National T raining P rogram

W INDOWS 8 T RAINING

C urricular P ractical T raining

RADIATION SAFETY T RAINING

A gent T raining P resentation

Programming assignment

Tips for Programming Assignment #1

T ASTING A GENCIES T RAINING

P ART IV T RAINING THE S ALES T EAM

A Thief in T raining

New EMS Equipment T raining

Team T-REX: The Raining Eggs Experiment

V ocational T raining T eams