Data Structures and Visualization

Data Structures and Visualization

Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?

U.S. dairy population

We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high • Corn averaged $6/bu in May • Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase

Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources

Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility and reproduction • Longevity • Some genotypes

Data not routinely available • Farm and herd management • Geography and climate • Housing systems • Feed intake • Milk composition • Fatty acids, casein variants • Conductivity, lactose, MUN • DNA data • Cow SNP genotypes, DNA sequence data Photo: NOAA

Data “trapped” on the farm • Fertility and reproduction • Insemination information • Use of estrus synchronization • Cow health and longevity • Body condition scores • Birth weights and mature weights • Disease occurrence data

Electronic milk meters • Currently can provide— • Milk yield • Milking speed • Electrical conductivity • May possibly supply— • Progesterone levels • Milk temperature • Fat and protein concentrations Photo: afimilk

Other sources of data • RFID tags have lower IDerror rates associated with meter data • Pedometers are useful fordetecting estrus, theonset of calving, andsome early-stagedisease Top: Allflex; Bottom: afimilk

Current sources of data PDCA NAAB DHI AIPL CDCB Universities AIPLAnimal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)

Sources of genomic data Requester (Ex: AI, breeds) nominations samples evaluations Genomic Evaluation Lab Dairy producers samples samples genotypes DNA laboratories

Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC • Mostly C, some Fortran 90 • Data stored in a relational database

Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins

How do we assess data quality • Consistency • e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability

Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks • Association • Clustering • Classification • Regression

Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking

Association analysis • Discover interesting relationships among variables in large databases • e.g., predicting protein function and identifying SNP-disease associations • Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives • Measures co-occurence, not causality

Clustering • Place items into distinct groups such that • Items in a group are similar • Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches

Partitional clustering

Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples • Relationships among species • Evolutionary history of proteins

Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard BFGL Genome Assemblies Nelore Water Buffalo BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu

Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples • Identify cows with subclinical mastitis • Mate assignment

Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines

Decision tree classification Pinzón-Sánchez et al., 2011, JDS, 94:1873-1892.

Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret

Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example— • Breeding values • Feed intake • Milk and components yields • Very mature analytical tools

Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques • Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)

One image, millions of points 43,382 SNP solutions × 4,064 animals = 176,304,448 data points

Use size to denote importance Markers are proportional in area to SNP effect sizes.

O-Style Haplotypes (chromosome 15)

Correlations among calving traits

Provide multiple cues Lines are differentiated by color and pattern. Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.

Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.

Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg

Supercomputer performance • Cray-1 (1976) — 136megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken

Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte

Data storage technologies • Storage costs are now as low as $100/TB • Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware

Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82

Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung

Software • Complexity is increasing • Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find • Top programmers many times more productive than average workers

Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware

Questions?

Data Structures and Visualization

Data Structures and Visualization

Presentation Transcript

Data Visualization

DATA VISUALIZATION

Visualization and Data Mining

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization

Data Visualization and Libraries

Data Visualization

Data Loading and Visualization

Data Visualization

Data Visualization