1 / 43

Data Structures and Visualization

Data Structures and Visualization. Introduction. We’re drowning in information Genetics are viewed as a commodity We need to get better data from fewer cows Do we have the resources we need?. U.S. dairy population. We need to do more with less.

Download Presentation

Data Structures and Visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Structures and Visualization

  2. Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?

  3. U.S. dairy population

  4. We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high • Corn averaged $6/bu in May • Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase

  5. Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources

  6. Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility and reproduction • Longevity • Some genotypes

  7. Data not routinely available • Farm and herd management • Geography and climate • Housing systems • Feed intake • Milk composition • Fatty acids, casein variants • Conductivity, lactose, MUN • DNA data • Cow SNP genotypes, DNA sequence data Photo: NOAA

  8. Data “trapped” on the farm • Fertility and reproduction • Insemination information • Use of estrus synchronization • Cow health and longevity • Body condition scores • Birth weights and mature weights • Disease occurrence data

  9. Electronic milk meters • Currently can provide— • Milk yield • Milking speed • Electrical conductivity • May possibly supply— • Progesterone levels • Milk temperature • Fat and protein concentrations Photo: afimilk

  10. Other sources of data • RFID tags have lower IDerror rates associated with meter data • Pedometers are useful fordetecting estrus, theonset of calving, andsome early-stagedisease Top: Allflex; Bottom: afimilk

  11. Current sources of data PDCA NAAB DHI AIPL CDCB Universities AIPLAnimal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)

  12. Sources of genomic data Requester (Ex: AI, breeds) nominations samples evaluations Genomic Evaluation Lab Dairy producers samples samples genotypes DNA laboratories

  13. Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC • Mostly C, some Fortran 90 • Data stored in a relational database

  14. Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins

  15. How do we assess data quality • Consistency • e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability

  16. Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks • Association • Clustering • Classification • Regression

  17. Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking

  18. Association analysis • Discover interesting relationships among variables in large databases • e.g., predicting protein function and identifying SNP-disease associations • Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives • Measures co-occurence, not causality

  19. Clustering • Place items into distinct groups such that • Items in a group are similar • Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches

  20. Partitional clustering

  21. Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples • Relationships among species • Evolutionary history of proteins

  22. Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard BFGL Genome Assemblies Nelore Water Buffalo BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu

  23. Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples • Identify cows with subclinical mastitis • Mate assignment

  24. Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines

  25. Decision tree classification Pinzón-Sánchez et al., 2011, JDS, 94:1873-1892.

  26. Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret

  27. Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example— • Breeding values • Feed intake • Milk and components yields • Very mature analytical tools

  28. Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques • Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)

  29. One image, millions of points 43,382 SNP solutions × 4,064 animals = 176,304,448 data points

  30. Use size to denote importance Markers are proportional in area to SNP effect sizes.

  31. O-Style Haplotypes (chromosome 15)

  32. Correlations among calving traits

  33. Provide multiple cues Lines are differentiated by color and pattern. Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.

  34. Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.

  35. Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg

  36. Supercomputer performance • Cray-1 (1976) — 136megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken

  37. Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte

  38. Data storage technologies • Storage costs are now as low as $100/TB • Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware

  39. Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82

  40. Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung

  41. Software • Complexity is increasing • Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find • Top programmers many times more productive than average workers

  42. Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware

  43. Questions?

More Related