1 / 16

Work with fastq

Work with fastq. Manufacturers of DNA sequencers Roche Illumina Life Technologies Beckman Coulter Pacific Biosciences Oxford Nanopore. FASTQ. @ 開頭,描述資訊 :. @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +

millerr
Download Presentation

Work with fastq

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work with fastq

  2. Manufacturers of DNA sequencers • Roche • Illumina • Life Technologies • Beckman Coulter • Pacific Biosciences • Oxford Nanopore

  3. FASTQ @開頭,描述資訊: • @SEQ_ID • GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT • + • !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 序列: +開頭,描述資訊: Quality: Quality value 有不同的定義,常使用的 Sanger 分數: p 代表的是每個base對應的錯誤機率 例如: p=0.01代表錯誤率為百分之一,換算成quality value則為20 -10 * log100.01 = -10 * (-2) = 20

  4. FASTQ Sanger/Illumina 1.8 format can encode a Phredquality score from 0 to 93 using ASCII 33 to 126 0 1 2 3 4 20

  5. FastQC - quality visualization tool 10% 平均值 20表示1%的錯誤率,30表示0.1% 中位數 (小到大排列後第50%) Q1 (25%) Q3 (75%) 90%

  6. # X軸 Position in read (bp) • # Y軸 Q = -10*log10(error P)即20表示1%的錯誤率,30表示0.1% • # 每一個boxplot,都是該位置的所有序列的測序品質的一個統計, • 上面的bar是90%分位數,下面的bar是10%分位數,箱子的中間的橫線是50%分位數,箱子的上邊是75%分位數,下邊是25%分位數 • # 圖中藍色的細線是各個位置的平均值的連線 https://zhuanlan.zhihu.com/p/20731723

  7. Q20過濾 (final project要求) 根據quality score篩選,把包含 Phredquality score 小於 20 (reads 包含ASCIIcode “DEC33~52”) 的所有reads刪除,重新產出一份fastq檔案 注意:序列為成對的二個檔案(R1 與 R2),若是 R1 符合刪除條件,R2 不符合,則 二個都要刪除。 註:本例為簡化版,真實情形,Q20 過濾是指序列的剪取;由右讀至左,把DEC33~52的序列剪掉,保留符合的其他序列

  8. 補充:真實的Q20過濾 註: 本例為簡化版,真實情形,Q20 過濾是指序列的剪取;由右讀至左,把DEC 33~52的序列剪掉,保留符合的其他序列。 @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + BCDBCABCDBACACBBaBBBBBBBBBBBBmcdaBBAACDA54321123121111111111 @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCA + BCDBCABCDBACACBBaBBBBBBBBBBBBmcdaBBAACDA5

  9. http://pipe-tuxedo.readthedocs.io/en/latest/explain_qc.html

  10. Final project option 1: database system for Big data management • A database system offers sequence retrieving, quality displaying, and visualization abilities. • A web-interface. • Data retrieving (key in one ID, return R1 and R2 sequences) • Four quality filters must be offered (Q10, Q20, Q30, or Q40), and your system must generate filtered fastq files for downloading, plots for visualization. • Plots: • A FastQC-style (or equal function) quality visualization plot • Di-nucleotide content(%) plot (AA, AC, AG, AT, CC, CG, CT, GG, GT, …) • Tri-nucleotide content(%) plot (AAA, AAC, AAG, ….)

  11. Hint. • AWS cloud (EC2) or multiple computers can be used to lower the data retrieving time. • If your system cannot handle such big-size file, you can use a smaller subset to fit the specification. However, the score will be limited as well. • Database systems (MySQL, SQLite, File-based, …), programming languages (SQL, Python, R, …), and data management strategies are welcome. • A pre-processing procedure is allowed. • Zipped (tar.gz or tgz, or zip) file is recommended for data downloading. • 請盡早跟老師copy檔案

  12. Final project option 2: PTT鄉民活躍度 • 需收錄至少 100,000 筆資料,排除文章過少的看版 • 選擇看版、時間(年 或 月) • 畫出版面的鄉民活躍(發文、回文)分布圖 • 各版的活躍帳號 (活躍發文、活躍回文) • 輸入帳號、時間區間,查詢歷史動態(發文、回文摘要) • 偵測疑似轉手的帳號

  13. 2人一組 • 口頭報告 (投影片) • 結果介紹、完成度 • 遇到的問題與如何克服 • 小組分工情況 • 書面文件 (投影片、程式碼、其他相關檔案) • Demo

More Related