Constrained Multiple Structure Feature Alignment (CMSFA) 限制性之多重結構特徵排比

Constrained Multiple Structure Feature Alignment (CMSFA)限制性之多重結構特徵排比 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering, National Taiwan Ocean University 2006.10.30

目標 • 能夠快速地針對多條蛋白質進行三維結構排比(multiple structure alignment) • 對同一蛋白質家族尋找相似的特徵與獨立的特徵

動機 • 蛋白質結構資料庫已快速成長。 • PDB (Protein Data Bank)已存放超過39600筆資料。 • 蛋白質結構的資訊比一維序列資訊還多。 • 儘管有些功能相同的一維蛋白質序列具備高度差異性，但是在結構上仍可以持有相似的結構，僅有在環狀(loop)區域上有一些變化。 • 藉由蛋白質結構的比較可預測蛋白質的功能。 • 分析蛋白質的親緣性(homology)。 • 對蛋白質家族，辨識其獨特區域(unique region)。

背景 • 目前的蛋白質結構排比演算法： • Pairwise • DALI, FAST, CE, LGA, K2SA, and SuperPose, etc. • Multiple • MASS, CE-MC, Multiprot, and MASS, etc. • 都需要複雜的運算 • 本系統提供一個新的方法，基於一維序列上之重要的特徵，即可快速進行多重結構排比。

系統流程圖 Clustering and Combinatorial Feature Finding PDB Sequences Subgroup Sequences Consensus Motif Searching Hierarchical Clustering Combinatorial Feature Analysis CMFSA Alignment Key Residues analysis Constrained Multiple Structure Feature Alignment (CMSFA) Unique Peptides Searching

搜尋共同子字串(Consensus motif searching) • 使用 Ladderlike Interval Jumping Searching Algorithms (LIJSA). • 編碼->排序->比對 • Pai, T.W., M.D.T. Chang, J.H. Chu, and H.L. Tai, Ladderlike Stepping and Interval Jumping Searching Algorithms for DNA Sequences. APBC, p.93-98, 2004. • 能夠快速搜尋出多條蛋白質序列中在限定長度範圍內且具有容忍度的共同子字串。

階層式分群法 (Hierarchical clustering, optional) • 輸入的蛋白質可能含有其它相似度較低之序列，須先進行分類 • 依共同子字串之數量作為兩序列間相似度之依據 • 採用agglomerative clustering algorithms • Simple linkage方式。 • William H. Day and Herbert Edelsbrunner. 1984. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification. Volume 1, pp. 1-24 • 結果會分群出的每一群一維序列相似度較高的蛋白質序列，再各組重新搜尋共同子字串

組合特徵分析(Combinatorial feature analysis) • 在共同子字串中，擷取出重要的組合特徵。 • 步驟一: 多重序列排比 • 將之前LIJSA找到具有容忍度相似字串之編碼進行排比 • 以pairwise alignment的方法為基礎，以一序列為中心與其它序列兩兩排比，再組合起來. • 步驟二:取得組合特徵 • 將被aligned的片段，將有overlap的地方合併(merge).

範例(Combinatorial feature analysis) • For Rnase A superfamily • 基本搜尋共同子字串長度：5 • 忍容2個位置可變 • 100%出現率組合特徵

關鍵殘基分析(Key residue analysis) • 對組合特徵中每一個residue，計算權重分數。 • 依據胺基酸的特性：親緣性、帶電性、親水性。 • 親緣性 (homology) • 被共同aligned的組合特徵中，相同的胺基酸。 • 帶電性(charge) • Asp, Glu, His, Lys, Arg • 親水性(hydrophilicity) • Ser, Thr, Tyr, Glu, Gln, Asn, Asp, Arg, Gly, Cys, Lys, His • 在每一個組合特徵中，所有分數最高的胺基酸，視為重要的residues。 • 這些重要的residues將做為結構排比的重要關鍵點。

範例(Key residue analysis) • For Rnase A superfamily Key residues

限制性之多重結構特徵排比(Constrained Multiple Structure Feature Alignment) • 在同一個組合特徵中，對所有的key residues,計算它們在對應三維結構上的幾何中心點。 • 任意選擇三個組合特徵的幾何中心點，做為結構排比的基礎點。 • 結構排比： • 三個點即可形成一個平面 • 固定一個蛋白質結構，其它的蛋白質結構皆可利用各自三點，對固定蛋白質的三個幾何中心點直接做結構排比。

限制性之多重結構特徵排比(cont.) • 三點排比演算法 pseudocodes Assume the selected three points of the target protein (no change) are C0, C1, C2 in sequential order and C0’, C1’, C2’ for the aligned protein (to rotate and translate), the steps of the alignment: Step1: translate C0’ to C0. Step2: fixed C0’(C0), rotate C0’C1’ to C0C1 according to the C0C1’C1 plane. Step3: translate C1’ to C1. Step4: fixed C1’(C1), rotate C1’C2’ to C1C2 according to the C1C2’C2 plane. ( the translation and rotation operations represent the whole protein rigidly moved)

時間複雜度 • m條蛋白質結構資料 N為結構最大長度(C-alpha) • LIJSA+clustering：O(mN+N2) • Combinatorial features finding: O(mN2) • Key residue analysis +CMSFA: O(mN+4) • 總和：O(mN2) in worse case

序列相似度之限制 • 若序列相似度低，經由組合特徵排比之結果會較不精確。 • 系統限制：序列相似度高於30%。 • 系統除了採用具有容忍度的搜尋，加上hierarchical clustering的方法來改善此問題之外生外，另採用了分子表面的比較方法取得分子表面的相似特徵點進行多重排比。

Surface Comparison • 利用morphology operation取得分子表面的原子 • 以分類的方式，取得蛋白質間具有相同胺基酸內容、相似二級結構且型態距離相近的三個特徵點，再進行搜尋，取得較佳之排比結果

結果討論 • 以Rnase A、P450、Ricin A protein family為例，展示structure alignment後RMSD值、被align起來的原子個數與3維結構排比顯示。 • 與其它演算法之比較 • 搜尋的unique peptide motifs

序列相似度(For Rnase A superfamily) • Sequence Identity/Similarity percentages (%)

排比結果(For Rnase A superfamily) • RMSD(unit:angstrom)/(numbers of aligned atoms/target sequence length) • RMSD值平均是介於0.9~1.2之間

排比結果(For Rnase A superfamily,cont.) Align後 Align前 Show by Ribbon

序列相似度(For P450) • SequenceIdentity/Similarity percentages (%)

排比結果(For P450) • RMSD/(numbers of hits/sequence length) • RMSD值平均是介於0.8~1.3之間

排比結果(For P450,cont.) Align後 Align前 Show by Backbone

排比結果(For P450,cont.)

排比結果(For 1 Ricin A+ 31near neighbor proteins)

排比結果(For Ricin A group1,cont.) Align後 Align前

與現有演算法的比較 (pairwise) For 10 ‘difficult’ protein pairs for structural analysis:

與現有演算法的比較(multiple) Distance cutoff: 3.0(Å) 註：它們所定義的aligned residue並非表示其距離會小於Distance Cutoff. 因此CE-MC的alignment length會特別長。

Searched Unique Peptide Motifs for Human Rnase A superfamily Comparison of unique peptides identified by CMSFA and epitopes by DNAStar in Rnase3.

Constrained Multiple Structure Feature Alignment (CMSFA) 限制性之多重結構特徵排比

Constrained Multiple Structure Feature Alignment (CMSFA) 限制性之多重結構特徵排比

Presentation Transcript

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Structure alignment

Multiple Sequence Alignment

Regular Expression Constrained Sequence Alignment

Feature-Based Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Sequence alignment

Multiple Alignment –

Multiple alignment

Multiple Alignment

Constrained Multiple Structure Feature Alignment (CMSFA)

Multiple Sequence Alignment

Multiple Sequence Alignment