1 / 23

Evidence for exon-mediated domain shuffling in the formation of complex eukaryotic genes

Evidence for exon-mediated domain shuffling in the formation of complex eukaryotic genes. Albert DG de Roos, Syncyte BioIntelligence albert.de.roos@syncyte.com December 2008. Introduction.

thyra
Download Presentation

Evidence for exon-mediated domain shuffling in the formation of complex eukaryotic genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evidence for exon-mediated domain shuffling in the formation of complex eukaryotic genes Albert DG de Roos, Syncyte BioIntelligence albert.de.roos@syncyte.com December 2008

  2. Introduction • From an engineering perspective, a simple way to build a complex protein is to use small building blocks and recombine them into larger functional units • Exons would potentially make a good molecular candidate to form the basis of a modular gene building system • Exons could be recombined into functional protein modules (=exon shuffling) • Functional modules made up of exons could be recombined into multimodular genes (=module shuffling) • Exon contain splice site recognition sequences at the ends of exons that define them as a distinct unit (de Roos, 2005) • Splice site sequences could act as concensus sites for recombination of exons • This study investigates whether the formation of complex genes may reflect an assembly using exons as building blocks

  3. Materials & Methods • ExInt database (Sakharkar 2002) was used to search for genes with duplicated exons using standard query language SQL • An SQL query was made that compared the first and the last residues of an exon with all other exons in the same gene. • Exons were considered to be duplicate when at least 5 out of 10 residues on both ends of the exon were identical • The high degree of similarity (10 out of 20 residues in total) ensured that there were no false positives. • From the genes that contained duplicate exons, genes were analyzed that contained a conserved domain in the conserved domain database (CDD) • From this set, 10 random genes were analyzed in detail for their intron-exon structure • Intron-exon structure was determined from ExInt and aligned with conserved domains from CDD (conserved domain database). • In general, a default expectation value of 0.01 was used in the CDD search, which identified most conserved domains in the gene • Database and queries can be obtained through syncyte.com

  4. 1. Chicken ovoinhibitor A 1 472 100 200 300 400 KAZAL KAZAL KAZAL KAZAL KAZAL KAZAL KAZAL intron phase 1 0 2 0 2 0 2 0 2 0 2 0 2 0 2 B Non conserved region close to external exon MRTARQFVQVALALCCFA|dIAFGIE| vNCSLYASGIGKDGTSWVACPRNLKPVCGTDGSTYSNECGICLYN|rEHGANVEKEYDGECRPKHVM| iDCSPYLQVVRDGNTMVACPRILKPVCGSDSFTYDNECGICAYN|aEHHTNISKLHDGECKLEIGS| vDCSKYPSTVSKDGRTLVACPRILSPVCGTDGFTYDNECGICAHN|aEQRTHVSKKHDGKCRQEIPE| iDCDQYPTRKTTGGKLLVRCPRILLPVCGTDGFTYDNECGICAHN|aQHGTEVKKSHDGRCKERSTP| lDCTQYLSNTQNGEAITACPFILQEVCGTDGVTYSNDCSLCAHN|iELGTSVAKKHDGRCREEVPE| lDCSKYKTSTLKDGRQVVACTMIYDPVCATNGVTYASECTLCAHN|lEQRTNLGKRKNGRCEEDITK| eHCREFQKVSPICTMEYVPHCGSDGVTYSNRCFFCNAY|vQSNRTLNLVSMAAC 0 2 Non conserved region close to external exon External exons conserved position and same phase Internal introns same relative position and phase

  5. 2. Histone deacetylase 6 A B Many similar exons and shared introns

  6. 3. l(2)01289 gene long form A 1 1500 1703 250 500 750 1000 1250 PDI TRX PDI PDI TRX PDI PDI PDI HyaE PDI PDI PDI Introns of repeating unit same relative position and phase B 1 1 2 2 1 1 1 3 1 0 4 1 Intron sliding event? 5 1 6 1 7 Repeating unit roughly corresponds to the isomerase domains 8 1 1 1 9 10 1 1 11 1 12 1 1 13 1 1 2 14 1 2 15 1 1

  7. 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene A 1 3000 4370 1000 2000 4000 LamB SEA EGF_Lam LDLa LamG IGcam HSPG2 is made out of many repeating units in various forms

  8. 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene (contd) B |dDLGSGDLGSGDFQM|vYFRALVNFTRSIEYSPQLEDAGSREFREVSEAVVDT|lESEYLKIPGDQVVSVVFI|kELDGWVFVELDVGSEGNADGAQIQEMLLRVISSGSVASYVTSPQGFQFRRLGT|vPQFPRFPRACTEAEFA* Overlap between domains on exons: rare C (171) (382) 1 150 212 50 100 200 LDLa LDLa LDLa intron phase 1 1 1 1 1 All external introns phase 1 Single exon domain D cd00112:LDLa2CPPGEFQCKN-GRCIPLSWVCDGVDDCGDGSDEENC36 exon 6: |vPQFPRACTEAEFACHSYNECVALEYRCDRRPDCRDMSDELNC exon 7: |eEPV(46)RPLPCGPQEAACRN-GHCIPRDYLCDGQEDCEDGSDELDC exon 9: |pTKRPEEVCGPTQFRCVSTNMCIPASFHCDEESDCPDRSDEFGC|

  9. 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene (contd) E All domains part of duplication (1671) (552) 1 750 1120 250 500 1000 LamB LamB LAmB : EGFLam (smart000180) : EGFLam (pfam0053)

  10. 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene (contd) F Single intron gain, or multiple loss? Many introns conserved intron phase 0 1 0 2 0 0 2 0 0 0 0 0 0 0 Intron sliding event? 0 2 2 0 0 2 0 0 1 2 0 0 2 1 2 2 2 1 Many introns conserved 1 2 2

  11. 2 1 |mPPQVVTPPRESIQASRGQTVTFTCVAIGVPTPIINWRLNWGHIPSHP|rVTVTSEGGRGTLIIRDVKESDQGAYTCEAMNARGMVFGIPDGVLELVPQR| 2 |tNQAPLVVEVHPARSIVPQGGSHSLRCQVSGSPPHYFYWSREDGRPVPSGTQQRHQ|gSELHFPSVQPSDAGVYICTCRNLHQSNTSRAELLVT| 3 |eAPSKPITVTVEEQRSQSVRPGADVTFICTAKSK|sPAYTLVWTRLHNGKLPTRAMDFNGILTIRNVQLSDAGTYVCTGSNMFAMDQGTATLHVQ| 4 |aSGTLSAPVVSIHPPQLTVQPGQLAEFRCSATGSPTPTLEWT|gGPGGQLPAKAQIHGGILRLPAVEPTDQAQYLCRAHSSAGQQVARAVLHVH| 5 |gGGGPRVQVSPERTQVHAGRTVRLYCRAAGVPSATITWRKEGGSLPPQ|aRSERTDIATLLIPAITTADAGFYLCVATSPAGTAQARIQVVVLS| 6 |aSDASPPPVKIESSSPSVTEGQTLDLNCVVAGSAHAQVTWYRRGGSLPPHTQ|vHGSRLRLPQVSPADSGEYVCRVENGSGPKEASITVSVLHGTHSGPSYTP| 7 |vPGSTRPIRIEPSSSHVAEGQTLDLNCVVPGQAHAQVTWHKRGGSLPARHQ|tHGSLLRLHQVTPADSGEYVCHVVGTSGPLEASVLVTIEASVIP| 8 |gPIPPVRIESSSSTVAEGQTLDLSCVVAGQAHAQVTWYKRGGSLPARHQ|vRGSRLYIFQASPADAGQYVCRASNGMEASITVTVTGTQGANLAY| 9 |pAGSTQPIRIEPSSSQVAEGQTLDLNCVVPGQSHAQVTWHKRGGSLPVRHQ|tHGSLLRLYQASPADSGEYVCRVLGSSVPLEASVLVTIEPAGSVP| 10 |aLGVTPTVRIESSSSQVAEGQTLDLNCLVAGQAHAQVTWHKRGGSLPARHQ|vHGSRLRLLQVTPADSGEYVCRVVGSSGTQEASVLVTIQQRLSGSH| 11 |sQGVAYPVRIESSSASLANGHTLDLNCLVASQAPHTITWYKRGGSLPSRHQ|iVGSRLRIPQVTPADSGEYVCHVSNGAGSRETSLIVTIQGSGSSH| 12 |vPSVSPPIRIESSSPTVVEGQTLDLNCVVARQPQAIITWYKRGGSLPSRHQ|tHGSHLRLHQMSVADSGEYVCRANNNIDALEASIVISVSPSAGSPS| 13 |aPGSSMPIRIESSSSHVAEGETLDLNCVVPGQAHAQVTWHKRGGSLPSHHQ|tRGSRLRLHHVSPADSGEYVCRVMGSSGPLEASVLVTIEASGSSAVHVP| 14 |aPGGAPPIRIEPSSSRVAEGQTLDLKCVVPGQAHAQVTWHKRGGNLPARHQ|vHGPLLRLNQVSPADSGEYSCQVTGSSGTLEASVLVTIEPSSPGPIP| 15 |aPGLAQPIYIEASSSHVTEGQTLDLNCVVPGQAHAQVTWYKRGGSLPARHQ|tHGSQLRLHLVSPADSGEYVCRAASGPGPEQEASFTVTVPPSEGSSY| 16 |rLRSPVISIDPPSSTVQQGQDASFKCLIHDGAAPISLEWKTRNQELE|dNVHISPNGSIITIVGTRPSNHGTYRCVASNAYGVAQSVVNLSVH| 17 |gPPTVSVLPEGPVWVKVGKAVTLECVSAGEPRSSARWTRISSTPAKLEQRTYGLMDSHAVLQ|iSSAKPSDAGTYVCLAQNALGTAQKQVEVIVDTGAMA 18 PGAPQVQAEEAELTVEAGHTATLRCSAT|gSPAPTIHWSKLRSPLPWQHRLEGDTLIIPRVAQQDSGQYICNATSPAGHAEATIILHVE| 19 |sPPYATTVPEHASVQAGETVQLQCLAHGTPPLTFQWSRVGSSLPGRATARNELLHFERAAPEDSGRYRCRVTNKVGSAEAFAQLLVQ| 20 |gPPGSLPATSIPAGSTPTVQVTPQLETKSIGASVEFHCAVPSDRGTQLRWFKEGGQLPPGHSVQDGVL|rIQNLDQSCQGTYICQAHGPWGKAQASAQLVIQ| 21 |aLPSVLINIRTSVQTVVVGHAVEFECLALGDPKPQVTWSKVGGHLRPGIVQSGGVVRIAHVELADAGQYRCTATNAAGTTQSHVLLLVQ| 22 |aLPQISMPQEVRVPAGSAAVFPCIASGYPTPDISWSK|lDGSLPPDSRLENNMLMLPSVRPQDAGTYVCTATNRQGKVKAFAHLQVP| 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 0 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene (contd) Many duplications G (383) (481) (1651) (3638) 1 98 1 1500 1988 500 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 intron phase 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 All external introns conserved in phase and position H Sliding events with phase shift? Many introns conserved

  12. 4. Human Heparan Sulfate Proteoglycan (HSPG2) gene (contd) I Domain duplications (3639) (4370) 1 500 1120 125 250 375 625 LamG LamG LamG E No apparent intron conservation J intron phase 1 1 2 0 1 0 1 1 1 0 E 1 2 0 2 2 1

  13. |kHEAYAE |cYKMTCLEIAGPAGPKGYRGQK |gAKGNMGEPGSPGLKGRQ |gDPGIEGPIGYPGPK |gVPGLKGEK |gEIGSDGRR |gAAGLAGRNGTDGQK |gKLGRIGPPGCKGDRGDK |gPDGYPGDAGDQGERGDEGMK |gDPGRPGRSGPPGPPGEKGSP |gIPGNPGAQGPGGTKGRKGETGPPGPKGEP |gRRGDPGTKGSKGGPGAKGER |gDPGPEGPRGLPGEVGNKGAR |gDQGLPGPRGPTGAVGEPGNI |gSRGDPGDLGPRGDAGPPGPK |gDRGRPGFS-YPGPRGPQ |gDKGEKGQPGPK |gGRGELGPKGTQGTKGEKGEP |gDPGPRGEPGTRGPPGEAGPE |gTPGPPGDPGLT |dCDVMTYVRETCGCC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5. Col6A2 gene for type VI collagen subunit alpha2 Domain on single exon A 1 1022 250 500 750 vWFA vWFA vWFA 1 0 1 1 1 1 GXY repeat over 21 exons all introns phase 0 0 Single intron gain, or multiple loss? B All introns phase 0 intron phase

  14. I 1 *QPGNFSADEAGAQLFAQSYNSSAEQVLFQSVAASWAHDTNITAENARRQ II 1 |dLVTDEAEASKFVEEYDRTSQVVWNEYAEANWNYNTNITTETSKIL I 2 |eEAALLSQEFAEAWGQKAKELYEPIWQNFTDPQLRRIIGAVRTLGSANLPLAKRQQ II 2 |lQKNMQIANHTLKYGTQARKFDVNQLQNTTIKRIIKKVQDLERAALPAQELEE I 3 |yNALLSNMSRIYSTAKVCLPNKTATCWSLDP II 3 |yNKILLDMETTYSVATVCHPNGSCLQLEP I 4 |dLTNILASSRSYAMLLFAWEGWHNAAGIPLKPLYEDFTALSNEAYKQD II 4 |dLTNVMATSRKYEDLLWAWEGWRDKAGRAILQFYPKYVELINQAARLN I 5 |gFTDTGAYWRSWYNSPTFEDDLEHLYQQLEPLYLNLHAFVRRALHRRYGDRYINLRGPIPAHLL II 5 |gYVDAGDSWRSMYETPSLEQDLERLFQELQPLYLNLHAYVRRALHRHYGAQHINLEGPIPAHLL I 6 |gDMWAQSWENIYDMVVPFPDKPNLDVTSTMLQQ II 6 |gNMWAQTWSNIYDLVVPFPSAPSMDTTEAMLKQ I 7 |gWNATHMFRVAEEFFTSLELSPMPPEFWEGSMLEKPADGREVVCHASAWDFYNRKDF II 7 |gWTPRRMFKEADDFFTSLGLLPVPPEFWNKSMLEKPTDGREVVCHASAWDFYNGKDF I 8 |rIKQCTRVTMDQLSTVHHEMGHIQYYLQYKDLPVSLRRGANPGFHEAIGDVLALSVSTPEHLHKIGLLDRVTNDT II 8 |rIKQCTTVNLEDLVVAHHEMGHIQYFMQYKDLPVALREGANPGFHEAIGDVLALSVSTPKHLHSLNLLSSEGGSD I 9 |eSDINYLLKMALEKIAFLPFGYLVDQWRWGVFSGRTPPSRYNFDWWYL I 9 |eHDINFLMKMALDKIAFIPFSYLVDQWRWRVFDGSITKENYNQEWWSL I 10 |rTKYQGICPPVTRNETHFDAGAKFHVPNVTPYI II 10 |rLKYQGLCPPVPRTQGDFDPGAKFHIPSSVPYI I 11 |rYFVSFVLQFQFHEALCKEAGYEGPLHQCDIYRSTKAGAKL II 11 |rYFVSFIIQFQFHEALCQAAGHTGPLHKCDIYQSKEAGQRL I 12 |rKVLQAGSSRPWQEVLKDMVGLDALDAQPLLKYFQPVTQWLQEQNQQNGEVLGWPEYQWHPPLPDNYPEGI II 12 |aTAMKLGFSRPWPEAMQLITGQPNMSASAMLSYFKPLLDWLRTENELHGEKLGWPQYNWTPNS| 1 0 0 0 0 1 1 1 1 1 1 0 0 2 1 2 1 1 2 2 2 2 2 1 2 6. Angiotensin I converting enzyme precursor (DCP1) gene One gene with duplicated domain A 1 1306 250 500 750 1000 1250 Peptidase M2 Peptidase M2 II I 12 exons 12 exons B All exons similar in size and phase: complete duplication

  15. 7. Hexokinase Duplicated hexokinse domain, consisting itself of two domains (1 and 2) A 1 921 125 250 375 500 625 750 875 Hexokinase_1 Hexokinase_2 Hexokinase_1 Hexokinase_2 COG5026 hexokinase COG5026 hexokinase 0 0 1 0 0 0 1 2 2 2 1 0 0 0 1 2 2 2 No intron betwen hexokinase domains. Intron lost? B intron phase 1 0 0 0 1 2 2 2 On introns between hexokinase subdomains: lost ancient introns? Overall intron-exon structure conserved

  16. 8. Werner syndrome gene (WRN) No overlap between domains on exon A 1 1432 250 500 750 1000 1250 35EXOc DEXHc HELICc RQC HRDC intron phase 0 2 1 0 0 1 2 0 0 0 1 2 1 2 2 1 0 2 0 2 2 2 0 0 2 0 0 2 0 0 1 0 B |sMSLSDGDVVGFDMEWPPLYNRGKLGKVALIQLCVSESKCYLFHVSSMS|vFPQGLKMLLENKAVKKAGVGIEGDQWKLLRDFDIKLKNFVELTDVANKK|lKCTETWSLNSLVKHLLGKQLLKDKSIRCSNWSKFPLTEDQKLYAATDAY|aGFIIYRNLEILDDTVQRFAINK| Duplicated exon (no clear function |hLSPNDNENDTSYVIESDEDLEMEMLK|hLSPNDNENDTSYVIESDEDLEMEMLK| |pVQWKVIHSVLEERRDNVAVMATg|YGKSLCFQYPPVYVGKIGLVISPLISLMEDQVLQL|kMSNIPACFLGSAQSENVLTDIK|lGKYRIVYVTPEYCSGNMGLLQQLEADI|gITLIAVDEAHCISEWGHDFRDSFRKLGSLKTALPM|vPIVALTATASSSIREDIVRCLNLRNPQITCTGFDRPNLYLEVRRKTGNILQDLQPFLVKT| |sSHWEFEGPTIIYCPSRKMTQQVTGELRKLNLSCGTYHAGMSFSTRKDIHHRFVRDEIQ|cVIATIAFGMGINKADIRQVIHYGAPKDMESYYQEIGRAGRDGLQSSCHVLWAPADINLN| |rLDHCYSMDDSEDTSWDFGPQAFKLLSAVDILGEKFGIGLPILFLRGS|nSQRLADQYRRHSLFGTGKDQTESWWKAFSRQLITEGFLVEVSRYNKFMKICALTKK|gRNWLHKANTESQSLILQANEELCPKKFLLP| |sIMVQSPEKAYSSSQPVISAQEQETQ|iVLYGKLVEARQKHANKMDVPPAILATNKILVDMAKM|rPTTVENVKRIDGVSEGKAAMLAPLLEVIKHFCQTNSVQ|

  17. All introns phase except indicated 1 9. DMBT1 candidate tumour suppressor gene Many SR domains, similar intron-exon structure A 1 2426 500 1000 1500 2000 SR SR SR SR SR SR SR SR SR SR SR SR SR CUB SR CUB Zona_pellucida 2 2 0 2 B MGISTVILEMCLLWGQVLST|gGWIPRTTDY|aSLIPSEVPLDTTVAE|gSPFPSELTLESTVAE|gSPISLESTLESTVAE|gSLIPSESTLESTVAE|gSDSGLALRLVNGDGRCQGRVEILYRGSWGTVCDDSWDTNDANVVCRQLGCGWAMSAPGNAWFGQGSGPIALDDVRCSGHESYLWSCPHNGWLSHNCGHGEDAGVICS|aAQPQSTLRP|eSWPVRISPPVPTE|gSESSLALRLVNGGDRCRGRVEVLYRGSWGTVCDDYWDTNDANVVCRQLGCGWAMSAPGNAQFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLTHNCGHSEDAGVICS|aPQSRPTPSP|dTWPTSHASTA|gPESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDSWDTSDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGYESYLWSCPHNGWLSHNCQHSEDAGVICS|aAHSWSTPSP|dTLPTITLPASTV|gSESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDSWDTNDANVVCRQLGCGWAMLAPGNARFGQGSGPIVLDDVRCSGNESYLWSCPHNGWLSHNCGHSEDAGVICS|dTLPTTTLPASTV|gPESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDSWDTNDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPNNGWLSHNCGHHEDAGVICS|aAQSRSTPRP|dTLSTITLPPSTV|gSESSLTLRLVNGSDRCQGRVEVLYRGSWGTVCDDSWDTNDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHHEDAGVICS|vSQSRPTPSP|dTWPTSHASTA|gSESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDSWDTSDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGYESYLWSCPHNGWLSHNCQHSEDAGVICS|aAHSWSTPSP|dTLPTITLPASTV|gSESSLALRLVNGGDRCQGRVEVLYQGSWGTVCDDSWDTNDANVVCRQLGCGWAMSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHSEDAGVICS|aSQSRPTPSP|dTWPTSHASTA|gSESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDYWDTNDANVVCRQLGCGWAMSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHHEDAGVICS|aSQSQPTPSP|dTWPTSHASTA|gSESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDYWDTNDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHHEDAGVICS|aSQSQPTPSP|dTWPTSHASTA|gSESSLALRLVNGGDRCQGRVEVLYRGSWGTVCDDYWDTNDANVVCRQLGCGWATSAPGNARFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHHEDAGVICS|aSQSQPTPSP|dTWPTSRASTA|gSESTLALRLVNGGDRCRGRVEVLYQGSWGTVCDDYWDTNDANVVCRQLGCGWAMSAPGNAQFGQGSGPIVLDDVRCSGHESYLWSCPHNGWLSHNCGHHEDAGVICS|aAQSQSTPRP|dTWLTTNLPALTV|gSESSLALRLVNGGDRCRGRVEVLYRGSWGTVCDDSWDTNDANVVCRQLGCGWAMSAPGNARFGQGSGPIVLDDVRCSGNESYLWSCPHKGWLTHNCGHHEDAGVICS|aTQINSTTT|dWWHPTTTTTA|rPSSNCGGFLFYASGTFSSPSYPAYYPNNAKCVWEIEVNSGYRINLGFSNL|kLEAHHNCSFDYVEIFDGSLNSSLLLGKICNDTRQIFTSSYNRMTIHFRSDISFQNTGFLAWYNSFPS|dATLRLVNLNSSYGLCAGRVEIYHGGTWGTVCDDSWTIQEAEVVCRQLGCGRAVSALGNAYFGSGSGPITLDDVECSGTESTLWQCRNRGWFSHNCNHREDAGVICS|gNHLSTP|aPFLNITRPN|tDYSCGGFLSQPSGDFSSPFYPGNYPNNAKCVWDIEVQNNYRVTVIFRDV|qLEGGCNYDYIEVFDGPYRSSPLIARVCDGARGSFTSSSNFMSIRFISDHSITRRGFRAEYYSSPSNDST|nLLCLPNHMQASVSRSYLQSLGFSASDLVISTWNGYYECRPQITPNLVIFTIPYSGCGTFKQ|aDNDTIDYSNFLTAAVSGGIIKRRTDLRIHVSCRMLQNTWVDTMYIANDTIHVANNTIQVEEVQYGNFDVNISFYTSSSFLYPVTSRPYYVDLNQDLYVQAEILHSDAVLTLFVDTCVASPYSNDFTSLTYDLIRS|gCVRDDTYGPYSSPSLRIARFRFRAFHFLNRFPSVYLRCKMVVCRAYDPSSRCYRGCVLRSKRDVGSYQEKVDVVLGPIQLQTPPRREEEPR No overlap between domains on exon

  18. 10. E-selectin gene All domains on a single exon A 1 609 125 250 375 500 CCP CCP CCP CCP CLECT_selectins_like EGF.. CCP CCP intron phase 1 1 1 1 1 1 1 1 1 1 2 All (external) exons same phase B |vLLLFKEGGAWSYNASTEAMTFDEASTYCQQRYTHLVAIQNQEEIKYLNSMFSYTPTYYWIGIRKVNKKWTWIGTQKLLTEEAKNWAPGEPNNKQNDEDCVEIYIKRDKDSGKWNDERCDKKKLALCYT| |aACTPTSCSGHGECVETVNNYTCKCHPGFRGLRCEQ| |vVTCQAQEAPEHGSLVCTHPLGTFSYNSSCFVSCDKGYLPSSTEATQCTSTGEWSASPPACN |vVECSALTNPCHGVMDCLQSSGNFPWNMTCTFECEEGFELMGPKRLQCTSSGNWDNRKPTCK |aVTCGAIGHPQNGSVSCSHSPAGEFSVRSSCNFTCNEGFLMQGPAQIECTAQGQWSQQVPVCK |aSQCKALSSPERGYMSCLPGASGSFQSGSSCEFFCEKGFVLKGSKTLQCGLTGKWDSEEPTCE |aVKCDAVQQPQDGLVRCAHSSTGEFTYKSSCAFSCEEGFELHGSAQLECTSQGRWSQEVPSCQ |vVQCSSLAVSGKMNISCSGEPVFGAVCAFACPEGWTLNGSAALMCDATGHWSGMLPTCE

  19. Discussion • In general, the analyzed genes structure could be represented by exon-limited building blocks spanning one or more exons. • Duplicated domains showed frequent conservation of intron position (and phase), indicating duplication as a unit. • Most single-exon domains had symmetrical and conserved exon phase when duplicated, indicating transfer as a unit. • Domain boundaries showed a good correlation with external exons with most domains ending at or just before an external exon. • Different functional modules did not share exons, an important prerequisite for behavior of modules as a unit. • The short sequence of non conserved domain sequences at either end of the exon module suggest that it functions as a spacer • Data fits generally well in a model in which intradomain intron-loss would be the main process • If we look at the number of identical internal introns in most domains, it seems that intron-loss is not a common phenomenon. • Intron shifts within a module was often accompanied by a change in phase, suggesting intron-sliding events.

  20. Conclusion • The data is in line with a general gene structure in which domains are located on a set of exons that can be transposed as a whole • The genome seems to show all the hallmarks of modular design • Potential for active recombination of exon modules by site-specific (splice-site) recombination enzymes • Intra-domain intron loss or the loss of recombination sites would prevent some exons to recombine, thereby defining groups of exon as a unit • Enzyme-mediated exon and domain shuffling could represent a powerful design pattern for directed or active evolution Exons represent the building blocks of the genome

  21. Addendum: Legends Figures 1-3 Fig. 1. Ovoinhibitor gene (chicken). A. The ovoinhibitor gene consists of 7 Kazal-type proteinase inhibitor domains (CDD cd01327; taxonomy Euteliostomi). Numbers indicate amino acid residue number, introns are indicated by the vertical lines. B. The sequence of the ovoinhibitor gene with Kazal domains as identified by the conserved domain database (CDD) colored red. All domains are flanked by by phase 0 introns and contain a single internal intron. Phase is indicated (circles). Fig. 2. Histone deacetylase 6 (Homo sapiens). A. The intron-exon and domain structure of histone deacetylase protein, also known as GATA-binding protein 1, with two histone deacetylase domains (CDD id: pfam00850; taxonomy: cellular organisms) and a single C-terminal ubiquitin-hydrolase Zn-finger (pfam02148, root). B. Alignment of the histone deacetylase domains (red), introns are shown by vertical lines. C. The amino acid sequence of the gene with the domain colored according to A., showing the domains relative to the exons in detail. Fig. 3. The gene ‘lethal (2) 01289’ (Drosophila). A. Overall intron-exon structure with domains. The lethal (2) 01289 gene consists of multiple PDI-a (protein disulfide isomerase family) domains (CDD id: cd02961; taxonomy: Eukaryota) and the related redox active TRX domains (cd02947; cellular organisms). Domains shown in white with red lining show up in a CDD search with an expectation value of 1 (instead of the standard value used of 0.01) B. Alignment of the conserved domains with intron position and phase indicated. Putative exon components with conserved sequence of the domains are shown with different colors, no color indicates a sequence with no other similarity within the gene. Vertical lines indicate introns. Intron phase is as indicated and the entire continuous gene is represented.

  22. Addendum: Legends Figures 4-6 Fig. 4.A. Heparan Sulfate Proteoglycan (HSPG2) gene (Homo sapiens). The HSPG2 gene contains at least 36 conserved domains from 6 different families. These comprise SEA enterokinase (SEA; CDD id: smart00200; taxonomy Bilateria), low density lipoprotein receptor class A domain (LDLa; cd00112; Metazoa), immunoglobulin domain cell adhesion molecule subfamily (IGcam; cd00931; Bilateria), laminin B domain (LamB; smart00281; Bilateria), laminin-type epidermal growth factor-like domain (EGF_Lam; cd00055; Eumetazoa) and laminin G domain (LamG; cd00110; cellular organisms). Numbers indicate amino acid residue. B. The SEA domain, its sequence and intron-exonj structure. C. The three LDL domains including intron phase. Numbers in brackets show the absolute position within the HSPG2 gene. D. Alignment of the domains show the sequence conservation and intron positions. Colored sequences indicate conserved domain, color code according to the CDD. E. Relative position of the domains and introns. Numbers in brackets show the absolute position within the HSPG2 gene. F. Alignment of the different domains and their introns and intron phase. Colored characters indicate the conserved domain sequences. G. The relation between introns an IGcam domains. Longer vertical lines indicate external introns, while the shorter represent internal introns. The first IGcam domain is a single domain near the N-terminus site of the protein, while the remaining 21 domains lie head-to-toe. All domains contain flanking phase 1 external introns and contain a single internal introns. H. Internal introns, intron phase and sequence of the IGcam domains. Colored (blue) amino acid sequences indicate the conserved sequences that were determined by a CDD query. I. The relation between domains and exons. J. Aligned domains and their intron-exon structure. Intron phase is indicated in the circles. The encircled e indicates the end of the HSPG2 gene. Fig. 5. The Col6A2 gene for type VI collagen subunit alpha2 (chicken). A. General structure of the vWA domains (CDD id: cd01480); taxonomy: Bilateria) and the intron-exon structure. The gene contains a GXY repeat over 21 exons (B) with all introns phase 0. Fig. 6. Conserved domains of the angiotensin I converting enzyme precursor (DCP1) gene (Homo sapiens). A. the DCP1 gene consists of two Peptidase M2 domains (CDD id: pfam1401; taxonomy: Euarchontoglires) indicated in red with each 12 exons, and separated by an intron. B. Alignment of the two domains shows a similar intron structure.  

  23. Addendum: Legends Figure 7-10 Fig. 7. Hexokinase gene (Homo sapiens) A. General structure of the human hexokinase with two hexokinase domains (CDD id: COG5026; taxonomy: cellular organisms) that each consist of two structurally similar domains, hexokinase_1 (pfam00349, Eukaryota) and hexokinase_2 (pfam03727, Eukaryota). There are no introns between the two main hexokinase domains nor between the subdomains. B. Alignment of the two hexokinase domains showing internal intron position and phase. Fig. 8. The WRN gene (Homo sapiens). A. Overall structure of the WRN gene with 5 different single domains: a 3'-5' exonuclease (35EXOc; CDD id: cd00007; taxonomy: root), a helicase superfamily domain (HELICc; cd00079, root), a DEXH-box helicases (DEXHc, cd00269, root), a RecQ family enzyme domain (RQC; pfam09382, cellular organisms) and a helicase and RNase D domain (HRDC; smart00341, cellular organisms). Introns with their phase are shown below. B. Detailed look at the introns and exons in the different domains. Colored sequences indicate a CDD domain, the grey sequences were not represented by a CDD domain and reflect the duplicated exon on which the query was based. Fig. 9. DMBT1 candidate tumour suppressor gene (Homo sapiens). A. Overal intron-exon structure of the DMBT1 gene with 14 scavenger receptor domains (SR; CDD id: smart0020, taxonomy: metazoan), two CUB domains (cd00041, root) and a zona pellucida domain (pfam00100, Bilateria) B. Complete sequence of the DMBT1 gene with sequences of domains colored and introns. Fig. 10. E-selectin gene (Canis familiaris). A. Intron-exon structure of the E-selectin gene related to its protein domains. The gene has a selectins-like domain (CLECT; CDD id: cd03592, taxonomy: Bilateria), a calcium-binding EGF-like domain (EGF_CA; cd00054, Eukaryota), and a complement control protein domain (CCP; cd00033, root). B. Domains in the direct context of their sequence with introns indicated. Sequence shown is continuous.

More Related