430 likes | 552 Views
Data Structures and Visualization. Introduction. We’re drowning in information Genetics are viewed as a commodity We need to get better data from fewer cows Do we have the resources we need?. U.S. dairy population. We need to do more with less.
E N D
Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?
We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high • Corn averaged $6/bu in May • Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase
Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources
Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility and reproduction • Longevity • Some genotypes
Data not routinely available • Farm and herd management • Geography and climate • Housing systems • Feed intake • Milk composition • Fatty acids, casein variants • Conductivity, lactose, MUN • DNA data • Cow SNP genotypes, DNA sequence data Photo: NOAA
Data “trapped” on the farm • Fertility and reproduction • Insemination information • Use of estrus synchronization • Cow health and longevity • Body condition scores • Birth weights and mature weights • Disease occurrence data
Electronic milk meters • Currently can provide— • Milk yield • Milking speed • Electrical conductivity • May possibly supply— • Progesterone levels • Milk temperature • Fat and protein concentrations Photo: afimilk
Other sources of data • RFID tags have lower IDerror rates associated with meter data • Pedometers are useful fordetecting estrus, theonset of calving, andsome early-stagedisease Top: Allflex; Bottom: afimilk
Current sources of data PDCA NAAB DHI AIPL CDCB Universities AIPLAnimal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)
Sources of genomic data Requester (Ex: AI, breeds) nominations samples evaluations Genomic Evaluation Lab Dairy producers samples samples genotypes DNA laboratories
Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC • Mostly C, some Fortran 90 • Data stored in a relational database
Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins
How do we assess data quality • Consistency • e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability
Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks • Association • Clustering • Classification • Regression
Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking
Association analysis • Discover interesting relationships among variables in large databases • e.g., predicting protein function and identifying SNP-disease associations • Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives • Measures co-occurence, not causality
Clustering • Place items into distinct groups such that • Items in a group are similar • Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches
Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples • Relationships among species • Evolutionary history of proteins
Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard BFGL Genome Assemblies Nelore Water Buffalo BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu
Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples • Identify cows with subclinical mastitis • Mate assignment
Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines
Decision tree classification Pinzón-Sánchez et al., 2011, JDS, 94:1873-1892.
Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret
Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example— • Breeding values • Feed intake • Milk and components yields • Very mature analytical tools
Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques • Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)
One image, millions of points 43,382 SNP solutions × 4,064 animals = 176,304,448 data points
Use size to denote importance Markers are proportional in area to SNP effect sizes.
Provide multiple cues Lines are differentiated by color and pattern. Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
Supercomputer performance • Cray-1 (1976) — 136megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken
Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
Data storage technologies • Storage costs are now as low as $100/TB • Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung
Software • Complexity is increasing • Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find • Top programmers many times more productive than average workers
Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware