1 / 6

Annotation Parsing

Annotation Parsing. Affymetrix File Format. Comma seperated file containing lots of data: UniGene, Ensembl, Entrez and SwissProt ID’s Genome Version, Chromosonal Location, Alignment info Gene Ontology Info Pathway Membership Protein Families and Domains Looks like:.

orli
Download Presentation

Annotation Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation Parsing

  2. Affymetrix File Format • Comma seperated file containing lots of data: • UniGene, Ensembl, Entrez and SwissProt ID’s • Genome Version, Chromosonal Location, Alignment info • Gene Ontology Info • Pathway Membership • Protein Families and Domains • Looks like: "1000_at","Human Genome U95Av2 Array","Homo sapiens","Dec 18, 2005","Exemplar sequence","GenBank","X60188mRNA","X60188 /FEATURE=mRNA /DEFINITION=HSERK1 Human ERK1 mRNA for protein serine/threonine kinase","X60188","---","Hs.861","May 2004 (NCBI 35)","chr16:30032927-30042040 (-) // 93.03 // p11.2","mitogen-activated protein kinase 3","MAPK3","chr16p12-p11.2","full length","ENSG00000102882","5595","P27361 /// Q9BWJ1 /// Q7Z3H5 /// Q8NHX0 /// Q8NHX1","EC:2.7.1.-","601795","NP_002737.1","NM_002746","---","---","---","---","---","---","74 // regulation of progression through cell cycle // non-traceable author statement /// 6468 // protein amino acid phosphorylation // inferred from direct assay /// 6468 // protein amino acid phosphorylation // inferred from electronic annotation /// 7049 // cell cycle // inferred from electronic annotation","---","166 // nucleotide binding // inferred from electronic annotation /// 4674 // protein serine/threonine kinase activity // inferred from electronic annotation /// 4707 // MAP kinase activity // non-traceable author statement /// 4713 // protein-tyrosine kinase activity // inferred from electronic annotation /// 5515 // protein binding // inferred from physical interaction /// 5524 // ATP binding // non-traceable author statement /// 16740 // transferase activity // inferred from electronic annotation /// 4672 // protein kinase activity // inferred from electronic annotation /// 4707 // MAP kinase activity // inferred from electronic annotation /// 5524 // ATP binding // inferred from electronic annotation /// 16301 // kinase activity // inferred from electronic annotation","MAPK_Cascade // GenMAPP /// S1P_Signaling // GenMAPP /// TGF_Beta_Signaling_Pathway // GenMAPP","ec // A2S7_HUMAN // (Q96Q40) Serine/threonine-protein kinase ALS2CR7 (EC 2.7.1.37) (Amyotrophic lateral sclerosis 2 chromosomal region candidate gene protein 7) // 1.0E-77 /// ec // A2S7_HUMAN // (Q96Q40) Serine/threonine-protein kinase ALS2CR7 (EC 2.7.1.37) (Amyotrophic lateral sclerosis 2 chromosomal region candidate gene protein 7) // 2.0E-85 /// hanks // 3.1.1 // CMCG Group; CMGC I Cyclin-dependent (CDKs) and close relatives; CDC2Hs // 1.0E-85 /// hanks // 3.1.1 // CMCG Group; CMGC I Cyclin-dependent (CDKs) and close relatives; CDC2Hs // 1.0E-79","---","IPR000719 // Protein kinase","---","---","This probe set was annotated using the Matching Probes based pipeline to a Entrez Gene identifier using 3 transcripts. // false // Matching Probes // A","BC000205(15),BX537897(15),NM_002746(16)","NM_002746 // Homo sapiens mitogen-activated protein kinase 3 (MAPK3), mRNA. // refseq // 16 // --- /// CR603463 // full-length cDNA clone CS0DN005YA14 of Adult brain of Homo sapiens (human). // gb // 15 // --- /// ENSESTT00000097559 // --- // ensembl_est // 15 // --- /// ENST00000263025 // cdna:known-ccds chromosome:NCBI35:16:30032928:30042042:-1 gene:ENSG00000102882 CCDS10672.1 // ensembl_transcript // 15 // --- /// BC000205 // Homo sapiens, clone IMAGE:3350666, mRNA, partial cds. // gb // 15 // --- /// BX537897 // Homo sapiens mRNA; cDNA DKFZp686O0215 (from clone DKFZp686O0215). // gb // 15 // ---","ENSESTT00000097558 // ensembl_est // 4 // Cross Hyb Matching Probes /// AK096992 // gb // 1 // Cross Hyb Matching Probes"

  3. WorkBench Model • Automatically identify chip type by specific marker presence • Parse and filter appropriate annotation file to produce a smaller version of annotations, called idx file. • Store all annotations in Map from marker ID to annotation line. • For future accesses, skip filtering step 2.

  4. Issues • A lot of hardcoded values in parser. Chip names, annotation names, etc. (100, 147, 393) • Hardcoded list of included annotations. (393) • Chip type map fragile – dependent on specific markers being present. • Annotations stored in memory in an unparsed state. Forces annotation line to be parsed for every element access. (368, 511) • All included annotations stored in memory. (42) • Would benefit from a Singleton pattern, could then avoid file access in static constructor, methods wouldn’t be static, etc. • Includes GUI elements, causing difficulty with test cases and programmatic usage (108).

  5. Annotation Sizes

  6. Proposed fixes • Determine and specify relationship between Microarray data objects and Annotation information. What will be the impact if annotations not available? • User requested annotation loading – separate step. • Allow for multiple annotation formats, support non-Affymetrix and custom. • Do not create custom index file. • Allow user specified filtering of annotations. • Explore open source disk based indexes and databases. For example, Berkley DB, hsqldb. • Proper MVC structure, AnnotationParser class simply for loading and parsing data, does not cause GUI events. (Although this can be said of many Data classes, see CSExprMicroarraySet.java:125).

More Related