Protein localization & drug subcellular distribution ---- Data Integration

Lab Meeting of Rosania Lab Protein localization & drug subcellular distribution ---- Data Integration Jingyu, Yu 03/17/2008

1. Sequences 2. Structure 3. Expression level 4. Function 5. Partners 6. Location Critical to understanding function Many are working on this field using image! Protein Subcellular Location Image Database http://murphylab.web.cmu.edu/ Things about Proteins

What are we asking? • Certain compounds/drugs bind specific proteins. • Cheminformatics, SAR/QSAR, CADD,etc • Proteins are NOT distributed evenly in cell. • Question: Will protein localization will play a role in the subcellular distribution, PK of drug/ligand? How? Can we model it?

Starting Point Map/Crosslink these two database Ligand <-> Protein Binding MOAD Protein <-> Localization Organelle DB ?

Organelle DB • Anuj Kumar Lab (MCDB, LSI) • Protein localization data • 30188 genes,138 organisms with emphasis on the major model systems (including human proteins) • Gene names have been obtained from the appropriate model organism database (e.g., SGD, MGI, FlyBase, WormBase, TAIR,etc)

Binding MOAD • Heather A. Carlson Lab(Medchem) • The largest database of protein-ligand complexes • All relevant entries from the Protein Databank (PDB) • 9837 entries, 3151 unique protein systems, andbinding affinity data for 2950 complexes

Data Overview A. Components of Organelle DB: 1.Sequence data (FASTA) 2.Protein information ( Accession_ID, Standard Name, Systematic Name,etc) 3.Localization information (GO Term, Localization information). B. Components of Binding MOAD: 1.Sequence data (FASTA) 2. PDB file (structure information, etc)

How to map/crosslink two database? • Structure Based Method DALI, SSM, Combinatorial extension • Sequence Based Method: BLAST √

Importance of sequence • The sequence of each protein determines where it localized in cells • The subsequences(“motifs”) within a protein’s sequence are responsible for targeting it to one( or more) locations or organells. • PSORT : Predicts a probable localization site to a protein given an amino acid sequence alone (Kenta Nakai in 1991)

Sequence Alignment • Stand alone BLAST ( most people use web based search ) http://www.ncbi.nlm.nih.gov/blast/Blast.cgi) • Set up local database on windows platform (UNIX will be faster) using Organelle FASTA • Using MOAD sequences (FASTA) as query to search the Organelle database. • OVER 30188*20270 pair-wise alignments done in 4 hrs. • Output TEXT file: 180M

Data process • Program coding in C++ was used to automat extracting information we need from the output text file(180M) • Only the best match (highest identities and lowest E value) was selected. • Merge the Organelle DB and the result information. • Sample output:

Summary of alignment

Benchmark of alignment • Random select 10-30 PDB proteins from each level of similarity group. • PDB: http://www.rcsb.org/pdb/home/home.do • Find information about the PDB protein( function, source organism, etc) • Identical at 100%, similar from 99-20% similarity. • Difference are trivial at all levels (except for 2-12%). • Different source organism, same function with different substrate, and same family different subtype. Example: 20% similarity PDB 1cy0: DNA topoisomerase I Organelle 141797: DNA topoisomerase III

About Database • No specific protein localization information is given in PDB. • Information about function of certain proteins in Organelle DB is putative based on sequence or structure based alignment, which should not be considered in benchmark. • Things to be done: • Identify how many proteins are overlap by ClustalW2 • Extracting ligand information from MOAD (automate by programming) SMILES->Structures->Chemaxon/MOE

Perspective • Model subcellular distribution of compounds based on structure of ligand and protein, 1CellPK and the protein localization. • Challenge: 1. No quantitative distribution data available 2. A and B are mitochondria proteins. Do they have same distribution in outer, inner membrane and other compartments?

Acknowledgement • Dr. Gus Rosania • Graduate Students: Xinyuan Zhang, Jason Baik, Nan Zheng

Protein localization & drug subcellular distribution ---- Data Integration