Understanding Data-Intensive Computing in Bioinformatics: Insights from Human Genetics

Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy

Human Genetics • Genomics • Human Genome project • Proteomics • Diseasome • Tree of life project • Phylogenetics

Human cell • Base pair of DNA: CG, AT • C – cytosine, G – guanine, A – adenine , T - thymine • Each human cell contains approximately 3 billion base pairs. • The DNA of a single cell contains so much information that if it were represented in printed words, simply listing the first letter of each base would require over 1.5 million pages of text! • If laid end-to-end, the DNA strand measures about 2 – 3 meters. • DNA is a single large molecule at the nucleus of cell • It is coiled a double helix • Each strand of the DNA molecule is made of A, C, G and T: example: AAAGTTCTTAATTA that will be matched on the other strand by the matching base: TTTCAAGAATTAAT • These string of alphabets contain • Ref: www.ehd.org • Ref text: Bioinformatics: Databases, tools and algorithms, by. O. Bosu and S.K. Thukral

More details • Sequence of base pairs are grouped to make sense: genes • When a gene inside needs to be activated, the DNA molecule at the cell nucleus uncoils and unfurls to the right extent to expose that gene • From the exposed ends of the DNA a RNA is formed. • mRNA or messenger RNA is formed that carries with it the “print” of the open DNA section (Map process?) • RNA and DNA differ in one respect: RNA does not contain T or thymine but it has uracil (U). RNA is short-lived (like intermediate data in MapReduce) • Once mRNA is formed open sections of the DNA close off.

Protein formation • mRNA travels to the cytoplasm where it meets the ribosome (rRNA) • Ribosome reads the code in the mRNA (codon) and form the amino acids. • Twenty amino acids are prevalent in human cells. Ex: codon GCU GCC GCA correspond to alanine • In effect ribosome is a process control computer that takes in as input codons and produces amino acids as output. • Amino acids polymerize and form polypeptide chains called proteins • Proteins fold and form the basic structures such as skin and hair. • Even though brain controls major human functions at the cell level it the DNA that has the command and control. • DNA is fixed code for a given human. (WORM characteristics)

Life’s processes • DNA is “program” that controls functions, operations and structure of a cell and in turn that of our life processes. • Life processes are in fact dependent of the program in a DNA and the hundreds of millions of ribosomes. • Life in this context appears as an immense distributed system.

Bioinformatics • Can we study, understand and analyze the complexity of the immensely complex system? It structure and programs? • University of Arizona’s tree of life project (ToL): http://tolweb.org • Human Genome project (NIH and DOE): collecting approximately 30,000 genes in human DNA and determining the sequences three billion bases that make up the human DNA. • Out of the 30000 genes we do not know the functions of more than 50% of them. • 99.9% of the nucleotide sequence is same for all of us • 0.1% is attributed to individual differences such as race, color of skin, disposition to diseases • High throughput sequencing is generating ultra scale biological data: how to analyze this data? • That is a data-intensive problem.

Existing solutions? • Traditional databases: store, retrieve, analyze and/or predict huge biological data • Software tools for implementing algorithms, and developing applications for in-silico experiments • Visualization tools, user interfaces, web accessibility for search through data • Machine learning and data mining methodologies.

Databases • Taxonomy DB • Genomics • Sequence db • Structure db • Proteomic database (PDB) • Micro-array db • Expression db • Enzyme db • Disease db • Molecular biology db

Tools • Data analysis tools • MySQL • Perl • Prediction tools • Clustering • Modeling tools • Surface prediction, predicting area of interest, protein-protein interaction • Alignment tools

How can we help? • How can we leverage our knowledge of large scale data management to address bioinformatics problems? DC methods. • Large number of tools and data: how we standardize the efforts so that they are complementary or repetitive? Cloud computing.

Text Mining vs Genetic Sequence Mining (Dot plot)

Understanding Data-Intensive Computing in Bioinformatics: Insights from Human Genetics

Understanding Data-Intensive Computing in Bioinformatics: Insights from Human Genetics

Presentation Transcript

Data-Intensive Distributed Computing

Data-Intensive Computing

Data-Intensive Distributed Computing

Petascale Data Intensive Computing

Data Intensive Computing

CPS216: Data-intensive Computing Systems

Data-intensive Computing Case Study Area 2: Financial Engineering

Extreme Data-Intensive Scientific Computing

Data -Intensive Computing Systems

CS216: Data-Intensive Computing Systems