190 likes | 192 Views
Lecture 3. Statistical Vocabulary & data management. Zihaohan Sang Sept 10, 2019. Week2 !. Basic statistical vocab + data management Exploratory graphics. Statistical vocab. Research topic: drought resistance in Trembling Aspen. Here is the distribution of lodgepole pine.
E N D
Lecture 3. Statistical Vocabulary & data management Zihaohan Sang Sept 10, 2019
Week2 ! • Basic statistical vocab + data management • Exploratory graphics
Here is the distribution of lodgepole pine. Does these samples (from starts) represent the population?
Take home message: • make sure samples can fully represent the population you want to study; • To avoid uncertainty caused by random chance, more general the better.
Date types in R Numeric: Categorical: Discrete: Integer (1, 5, 100) Continuous: Integer + digits (1.1, 5.0, 100.3) Nominal: character or Factor (species, locations) Ordinal: Order factors (‘Good’, ‘Med’, ‘Poor’) levels: Poor < Med < Good Logical: True/False
Notes: • use as.factor() or as.numeric() to force a variable into the type you want; • read.csv() function would automatically read character column as factor (levels is alphabetically) • Add one or more letters into a column, R would automatically classify it as character or factor
Golden rules for data tables • A row represents a unit • All measurements of a unit should normally be in the same row. • Different units must be in different rows. • Important to think about what your units are
Golden rules for data tables 2. If in doubt, add more rows • If possible, use categorical (character) variables to indicate the independent effects (treatments, environments). • Repeat measurements are normally added as rows, with two independent variables “Time” and “Individual”. • It is always easy to convert a long table to a wide table (Excel Pivot), but not vice versa.
Golden rules for data tables 3. Use strong IDs
Golden rules for data tables 4. Modify your raw data entries with R scripts • Easy to do a change something and re-run the analysis (e.g. with or without outliers) • Hunting down and fixing errors is efficient, because script leaves a perfect trail of what you did. • Save yourself from repetitive tasks (that likely introduce errors)
Golden Rules - File Management • Keep all files you need for a particular analysis in one folder (.RData-shortcut, data.xls, data.csv, script.r, script.sas, documentation.txt) • New folders for new tasks, analysis (numbered and descriptive folder names are useful) • Use many folders but shallow folder hierarchy (2-4 subdirectories deep but many folders) • Zip previous folders (analysis steps) for backup