200 likes | 287 Views
Learn about the basics of DNA, collecting data, and the human genome project. Explore the power of Perl for data extraction and manipulation. Dive into applying AI for exon identification and splice junctions. Join us on this educational journey!
E N D
Applying AI to Human Genome Part 1 : Collecting data Prof. M. Embrechts Robert Bress Bram Heyns
Overview • Basics of DNA • Collecting the data • Collection : my application • Perl • Goal
Basics of DNA • DNA = polymer of 4 molecules : bases or nucleotides • A = Adenine , C = Cytosine , G = Guanine , T = Thymine • Replication ( copying ) and translation ( reading ) => double helix : AT , GC ( copying ) • 3 letter combination = codon • RNA : U = Uracil in place of T => Transcribing • Protein = polymer composed of 20 amino acids ( reading ) => more complex structure than DNA
Intron – Exon - Splicejunction • exon 200 characters intron thousands • 30,000 genes identified out of possible 100,000 • Identification gene patent
Summary • Human : 23 chromosomes • Chromosomes thousands of genes • Gene info : exons , comments : introns • Exons and introns codons • Codon bases
Datacollection • Human Genome Project • NCBI website : http//www.ncbi.nlm.nih.gov • Entrez-Nucleotide.htm • NCBI Sequence Viewer.htm
Datacollection • Human Genome Project • NCBI website : http//www.ncbi.nlm.nih.gov • Entrez-Nucleotide.htm • NCBI Sequence Viewer.htm
Perl Practical Extraction and Report Language POD – files -> web Portability Free – CPAN modules String manipilation Extremely powerfull regex-engine Glue language designed for short and simple tasks, not equal to lack of power or “serious” features Tutorial : http://www.netcat.co.uk/rob/perl/win32perltut.html
Regular Expression – Pattern Matching • Practical Extraction and Report Language • Scan through data and extract useful information • m/PATTERN/ s/PATTERN/REPLACEMENT/ • 1 line Perl = 100 lines C or Java • Complex, but easy
Regex examples • /[KCZ]arl^sa/ • /<I>/(.*?)<\/I>/i • $1,$2,… • i , g , c , … • . , * , + , ? • /([0-9a-zA-Z])+/ or /([\w])+/ • s/us[^a-z]/them/g or s/us\W/them/g • /([acc|act][ttt|ttc|att])/ • TIMTOWTDT
Part 2 : Applying AI • Our choice : evolutionary computing • First part : identify exon part • Second part : identify splicejunctions • Third part : combine previous parts • Hope to reach +90% accuracy