the assignment for programming t raining n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Assignment for Programming T raining PowerPoint Presentation
Download Presentation
The Assignment for Programming T raining

Loading in 2 Seconds...

play fullscreen
1 / 26

The Assignment for Programming T raining - PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on

The Assignment for Programming T raining . Fu Yu. Usage. “python mapping_report INPUT_FILE”. The INPUT_FILE is the SAM file that you want to process.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Assignment for Programming T raining' - stash


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
usage
Usage
  • “python mapping_reportINPUT_FILE”. The INPUT_FILE is the SAM file that you want to process.
  • Please keep in mind that you should first change the current directory to the folder that contains the Python script because the results will be put in the current directory.
results
Results

What’s more, my script deals with a SAM file that is more than 2G in only 356.87s

Ultrafast!!!

In about 18s, the script finishes dealing with a SAM file that is approximately 160megebytes. And in the current directory there is a file named “gross result” that includes all the result. And there is a “Distribution_of_scores.pdf” in which you can find the quality score.

multithreading
Multithreading
  • Besides, I tried to use multithreading technology to boost the program. But Python does not seem to be good at this field. It takes about 40s to finish the task. So I give up multithreading.
data source
Data source
  • SRR037828.fastq was selected randomly from those .fastq files. It was mapped back to the
q1 mapping report
Q1: Mapping report
  • Generate a report about the number and percentage of tags that have been mapped back to genome, and the total number of all tags.
step1 about sam files
Step1 - About SAM files
  • Use the flag field to deicide it is mapped back or not.
step1 data
Step1 - Data
  • Use unmapped to record the number of tags that are not mapped back and chr_mapped_num to store how many tags have found their locations back to the genome. This dictionary might look redundant, but it actually helps in later steps.
step1 regex
Step1 - RegEx
  • It utilizes the regular expression to get the the name of each chromosome and to get the length of each chromosome.
step1 getting the header
Step1 - Getting the header

To get everything that the header contains. Besides, it handles possible exceptions in case the SAM file is corrupted.

q2 quality score report
Q2: Quality score report
  • Draw a distribution graph about the FASTQ quality score distribution within all mapped tags with R.
step2 loop
Step2 - Loop

Put the score of each tag in to the “f_out_quality_score”, thus I can use rscript to deal with the score and draw the distribution.

step2 r
Step2 - R

Here, this Python script creates an R script and call it in the terminal so that we do not have to run the rscript by ourselves.

step3 4
Step3&4

They share the same loop because they use identical loop. This way, I can improve the efficiency of the script.

q3 unique mapped tag
Q3: Unique mapped tag
  • Count the number of tags that each of them is mapped back to only one genomic location.
step31
Step3
  • This step uses a dictionary: the key here is chr + symbol + loc (e.g. chr1+112233) and the number of repeats is the value. If the some key has a value of 2 or more, then we count it out. All the keys that have value of 1 is totaled. And this is the result. The image above shows how the program handles + strands. In the try block, if the line does not have a 19th field, then the program goes into exception (which actually does nothing). Nonetheless, if it does, then keep it in the dictionary for later use.
q4 unique mapping location
Q4: Unique mapping location
  • Count the number of genomic locations that only have one tag mapped.
step4 using the xa field
Step4 – using the XA field

Use the XA field to decide how many genomic locations there are and what are the exact place the tags are back.

step4
Step4

If a line has ‘0’ or ’16’, together with the 19th field, then it is a tag that fulfills the condition given. Count the number and we get the result.

time complexity
Time complexity
  • This script uses several loops. Step One relies on a loop that has repeats N times. (N is the number of tags. So its comlexity is O(N);
  • Similarly, Step Three’s complexity is O(N);
  • However, Step Two and Step Four needs N*l (l is the number of bases in each tag.) So the time complexity of the script is N*l.
slide23
Time
  • I use the “time” module to time the who process. And it takes about 20s for my script to cope with a SAM file that is approximately 160 megabytes.
all in a single run
All in a single run
  • All the 4 steps are done within the Python script. So we do not have to run “Rscriptxxx.r” outside the script.
summary
Summary
  • Multithreading
  • Identify the meaning of each optional fields
  • Using dictionaries to count the number of tags
  • Using RegEx to capture the necessary information.
  • Loop: trying to decrease the number of nested loops as much as possible.