Everybody Learns:

Everybody Learns: Faculty and Student Perspectives on Interdisciplinary Undergraduate Research

Center for Interdisciplinary Research • College Statistics Course -> Statistics Concentration • Statistics Concentration -> Graduate School • Create a Statistics Community • Promote connections between researchers on campus

Linguistics & Statistics I • Approached by Linguistics Faculty, Dr. Rika Ito for assistance with linguistics analysis software • Haley, statistics student, also pursuing a linguistics concentration

Learning about Linguistics • Initial meeting of faculty and students • Description of the field of linguistics • Description of the particular problem of interest • Readings in linguistics • Seminal reading concerning the overall field • Readings addressing the problem of interest • Linguistics Workshop • Weekly meetings of faculty & students

Learning about Statistics • Done on an as needed basis • Faculty model interdisciplinary communication • Stating and re-stating the problem • Hypothetical results • Students eventually do the communicating • Weekly meetings

Example • Linguistic software outputs • Statisticians connected this to logistic regression • This connection allowed for increased flexibility in modeling linguistics data

Gray-/e/ as in “make” • Navy blue-/aw/ “loud” • Orange-/ai/ “bite” • Light blue-/>/ “caught” • Light green-/oi/ “toy” • Magenta-/o/ “hope” • Maroon-/u/ “boot” • Legend • Red-// as in “pen” • Black-/æ/ “apple” • Blue-/^/ “bun” • Gold-/a/ “Bob” • Pink-/I/ “tin” • Purple-/U/ “good” • Green-/i/ “sleep” • Figure 3 Legend • // vowel • /æ/ vowel • female • male • young • old • middle class • working class Figure 1: Diagram of the Northern Cities Vowel Shift.1 Each vowel gradually moves to the position indicated by the arrows in the specified order. /æ/ is the first to move. // is considered more stable because it is fourth to move. Figure 3: Point estimates for mean vowel frequencies by subgroups of gender, age, and economic class. The plot for gender displays much more significant differences than those of age and economic class. Figure 2: Sample vowel space of a working class male. Note that the x and y axes are inverted to better reflect the relationship between formants and tongue position. A point at the top of the graph reflects a high tongue position. A point on the left of the graph reflects a forward tongue position. Statistical Analysis of Phonetic Vowel Shifts Haley Hedlin and Stacey Wood; Faculty Advisors Rika Ito and Julie Legler Introduction Linguists analyze speech characteristics by measuring the frequencies associated with the production of sound. These frequencies are decomposed into different components called formants. In this study, we focus on the first and second formants, F1 and F2. F1 and F2 relate to the vertical and horizontal position of the tongue, respectively. Linguists then plot (F1, F2) on a coordinate system, which represents the individual's vowel space. Vowel spaces differ across geographic regions and also over time. Of particular interest is the movement of one vowel, /æ/ as in “cat,” relative to a more stable vowel, // as in “bed.” To ascertain the existence of a shift, linguists typically use a two-sample t-test. We explore the use of a more flexible approach, a random effects linear regression model, and compare these two methods. • Background • A new urban speech pattern, the Northern Cities Vowel Shift (NCVS), has received attention by spreading into rural areas1 • NCVS involves a chain shift of vowels within the vowel space (see Figure 1) • Begins when the vowel /æ/ shifts from below and behind the vowel // to above and in front of the vowel // • // is assumed to be a more stable vowel because it is later in the chain shift • For these reasons, we decided to focus on the position of the vowel /æ/ relative to the vowel // to assess the existence of NCVS • Linguistic Methods • 36 subjects from rural Northern Michigan pronounced 102 words, which focused on different vowel sounds2 • For each word, F1 and F2 frequencies were recorded for all individuals • Frequencies were plotted to determine an individual’s vowel space (see Figure 2) • Used linguistics methods to normalize the vowel spaces of all subjects3 • Calculate G (geometric mean) and S (individual speaker mean) according to the following: • where p=number of speakers, m=number of formants, n=number of words for a particular speaker, and F=measured frequency. Subscripts i, j, and k refer to particular speakers, formants and words, respectively. • Calculate a uniform scaling factor, K, unique to each individual speaker • Results • T-test results • No significant difference between /æ/ and // vowels within individuals • Unable to make inferences across people • Does not control for other factors affecting variability • Lack of power resulting from repeated testing and limited data within subjects • Random effects model results • Significant difference in individuals’ vowel space centers • Significant difference in frequency between genders but not between ages or economic classes after accounting for elements of study design and uniqueness of individual vowel spaces (see Figure 3) • Significant difference in frequency between the /æ/ and // vowels after controlling for differences among social variables, formant, word, and the uniqueness of individual vowel spaces • Vowel shift more advanced among females than males • Research Questions • Is there evidence of NCVS in rural Northern Michigan? • If so, which subpopulations exhibit this shift? asd • Conclusion • Our research suggests that the use of random effects models provides a more powerful and flexible option for linguists than t-tests • References • Labov, William. (2006) http://www.ling.upenn.edu/phono_atlas/ICSLP4.html#Heading4 • Ito, Rika. 1999. “Diffusion of Urban Sound Change in Rural Michigan: A Case of the Northern Cities Shift.” East Lansing, MI: Michigan State University dissertation. • Labov, William. (2003) Plotnik 7.0 Documentation. • Statistical Methods • Relative position of /æ/ and // were compared in order to determine the existence of a shift • Classic approach: t-tests on normalized data • Tests each speaker separately • Proposed approach: random effects regression model predicting frequency using raw data • Each subject receives a random intercept to account for differences in individual vowel spaces • Model accounts for elements of design by including formant, vowel, and word variables • Model accounts for social factors such as gender, age, and economic class • Model includes terms representing the important interactions between factors • Future Directions • Expand model to include other vowels • Explore the effect of different consonants surrounding the vowel • Create confidence regions around point estimates with bootstrapping Acknowledgements Special thanks to Rika Ito for inviting us to join this research project and Julie Legler for all her statistical advice and guidance along the way. We would also like to thank the Center for Interdisciplinary Research and the National Science Foundation (Grant DMS-0354308) for providing us with the funding and the facilities to conduct our research. Contact Information Haley Hedlin: hedlin@stolaf.edu Stacey Wood: wood@stolaf.edu

Linguistics & Statistics II • Linguist Dr. Maggie Broner received reviews asking her to use logistic regression to re-analyze the data in her manuscript • Previous year’s student educated the incoming student about the study of linguistics

Advanced Methods • Data structure required advanced methods in statistics - methods new to both the students and the faculty • Method suggested by one of the students from Summer Internship at NIH

Figure 1: The frequency of English and Spanish utterances by interlocutor for Marvin Combo Interlocutor On/off Content 1 Peer Off task Non-language related 2 Peer Off task Language related 3 Marvin Off task Non-language related 4 Peer On task Non-language related 5 Marvin Off task Language related 6 Peer On task Language related 7 Marvin On task Non-language related 8 Marvin On task Language related Figure 4: Histogram of combination of situations and predicted probabilities with confidence intervals for Leonard’s predicted probability of speaking Spanish in the 8 different situations Figure 2: Unadjusted maximum likelihood surface Figure 3: Adjusted maximum likelihood surface • The Firth penalized likelihood method uses the equation2: • This method adjusts the likelihood surface using the second derivative of the log likelihood of the coefficients, , resulting in a new likelihood surface where the maximum exists and represents the best estimate for coefficients (see Figure 3) Broadening the Use of Statistical Analysis in Second Language Research Kirsten Eilertson, Haley Hedlin, Mark Holland, Maggie Broner and Julie Legler St. Olaf College • Introduction: • Research Problem • Arose from an applied problem in second language research in children1 • Examines when children in an immersion school use their native language or second language, Spanish • Data • Environmental and linguistic explanatory variables • Response variable is language of utterance • Firth penalized likelihood method • One of the assumptions behind logistic regression is that there are no empty or nearly empty cells (see Figure 1—note that Marvin speaks only Spanish to adults in our data set) A statistical summary like Leonard’s below (Figure 4) was generated for each student illustrating the probability of that student speaking Spanish in varying situations described by the predictors interlocutor, on/off task, and content. • Purpose: • Consider the environmental factors influencing the use of Spanish • Adjust for various phenomena in the data, such as interactions and complete separation of variables • With the instances of Spanish and English usage so unevenly distributed across interlocutors, it is difficult to find unique coefficients. • In our case the likelihood surface’s continual increase, i.e. monotone likelihood, implies that there is no maximum and thus we have a failure to converge. (see Figure 2) • Results: • Using penalized log likelihood we were able to successfully build models that incorporated our data involving the teacher and other adult interlocutors. • New models confirmed the hypothesis that factors such as the interlocutor, type of task, and whether the student was on or off task all have varying effects on the students’ usages of Spanish and English. • Using Stata 8.2 we were able to explore potential interactions of predictors used to model a student’s probability of using Spanish. Specifically, this proved to be important for Leonard. • Methods: • Linguistic Methods • Data collected from three fifth grade students in a full K-5 Spanish immersion school • Two were selected at random; the third was studied for his unusual propensity to speak Spanish • The children were provided with wireless lapel microphones for the taping sessions • Transcribed from thirteen 25 to 80 minute classroom sessions taped and annotated by an observer • Dependent variable coded as English, Spanish, Mix-English base or Mix-Spanish base • Statistical Methods • Analysis was done using Stata 8.2 and R • Used a package within R that maximizes a penalized likelihood to adjust for monotone likelihood in one or more of the predictors3 References: 1. Broner, M.A. (In preparation) “A variationist view of first and second language use in full immersion contexts.” 2. Heinze, G. and Schemper, M. (2002) “A solution to the problem of separation in logistic regression.” 3. Ploner, M.; Dunkler, D.; Southworth, H.; and Heinze, G. (2005). logistf: Firth's bias reduced logistic regression. R package version 1.03. http://www.meduniwien.ac.at/msi/biometrie/programme/fl/index.html Acknowledgements: Special thanks to Maggie Broner for inviting us to join this research project and Julie Legler for all her statistical advice and guidance along the way. We would also like to thank the Center for Interdisciplinary Research for providing us with the funding and the facilities to conduct our research.

Linguistics & Statistics III • Back to Rika • Objective: Characterize the differences between vowel spaces • Communication challenge • Our understanding of the field • Lack of readily available statistical tools • Discipline-specific tradition

Comprehensive Program Connecting with High Schools Continue study of statistics as undergrad Graduate School in Statistics or related field Post-doc Return to teach in undergraduate institution

Center for Interdisciplinary Research (CIR) • CIR Fellows awarded Stipend & Credit Operon Prediction in the Tuberculosis Genome

Center for Interdisciplinary Research (CIR) • Physical Location Adolescent Mothers & Infants in a School-based Intervention Program

Center for Interdisciplinary Research (CIR) • Weekly Research Skills Seminar with Meals Assessing Baseball Performance using Hierarchical Models

Center for Interdisciplinary Research (CIR) • Interdisciplinary Research Teams The Use of Moral Schemas in Decision-making

Comprehensive Program • Post-docs Quantitative Analysis of Admission Trends

Everybody Learns! • Promotes interdisciplinary research involving faculty and students from across campus Modeling Bluebird Predation

Everybody Learns: