180 likes | 230 Views
Explore advanced text analysis methods using STATA, R, and Hyperbase. Learn how to process and analyze textual data, conduct lexicometry, text mining, factor & cluster analyses, and more. References and practical applications included for in-depth learning.
E N D
Instructor: Prof. Louis Chauvel Advanced Statistical Analysis:Text analysis with Stata txttool, R R.temis and Hyperbase (15FEB2019)
This session • General references • What’s the matter? • Text(ual) analysis, lexicometry, textmining, … • STATA tools • R tools • Hyperbase
References: • Set of references • « As usual »: • STATA ADVANCED MANUAL: • http://www.louischauvel.org/stata_manuel_advanced.pdf • Plus more recent …
Main references Find them online on http://www.a-z.lu/ ALMOST NONE, recently, apart: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications Author: Gary Miner, , John, IV Elder, , Andrew Fast, , Thomas Hill, , Robert Nisbet, , and DursunDelen Too much, too heavy, too general, but it is the reference …
SEE ALSO Find this online at : https://mhealth.jmir.org/2018/4/e101/
Whatis to bedone? • Open, long, answer to a question / issue / interview • Typical case: 30-100 interviews of several minutes to 1 hours+ • In general you personally know your sample • And you have some additional indicators on what they are • Description of contents of speech / content / matter / style … • to understand major cleavages through what people say • A quantitative extension of qualitative research
Typicalprocessing: • Data management • Clean (lower case, punctuation, quotes, ???) • Format the data (different in each software) • Import the data in the software (many issues) • “Stopwords” and lemmatization (suppress grammatical flection) • "Stemmization" (see Porter Stemmer ) • Data processing • Dictionary and sub-counts of words what they speak of and who • Concordance / Correspondence of words coherence of words / people • Factor & cluster analyses contrasts and grouping of words / people • … interpretations … https://tartarus.org/martin/PorterStemmer/ https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Typicalprocessing: • Data processing • Main issue proximity of words and of texts • we need a matrix of relations (proximity) • many solutions (differences, ratios) • relatively common solution • consider the row=wordscolumn=person table of frequency • add .5 frequency per cell (zeros) • log the frequency • compute the mean of log(fr) per row • keep the differencebetweenlog(fr) and mean of log(fr) per row • + and – are a good indicator of attraction/replusion of word/pers • Factor & cluster analyses
Softwares Qualitative research oriented Nvivo Atlas-ti Etc Quantitative method oriented STATA and R commands Hyperbase (FR) WordStat(for STATA) TextAnalyst
Neil Gorsuch Typical case https://www.youtube.com/watch?v=RlJEXiZONrQ 22 U.S. Senators R&D Opening Declaration in U.S. Ass. Justice Hearings https://www.congress.gov/115/chrg/shrg28638/CHRG-115shrg28638.htm
Neil Gorsuch Typical case • 22 U.S. Senators Opening Declaration in U.S. Ass. Justice Hearings • 22 extensive transcript (10 minutes) • We know their names • https://eugdpr.org/ • GDPR issues? = NO, it is public… • In the dataset: name+ d/r = political partyand transcript • You love U.S. politics ? • https://en.wikipedia.org/wiki/Neil_Gorsuch_Supreme_Court_nomination#Committee • https://en.wikipedia.org/wiki/United_States_Senate_Committee_on_the_Judiciary#Members,_115th_Congress • See texts here • http://www.louischauvel.org/Gorsuch.doc https://www.youtube.com/watch?v=RlJEXiZONrQ
PART 1 STATA and textanalysis • Have a Stata 13 minimum … • The long string text has almost no limitation • Copy-Paste is a simple way to import data • So… • Important STATA ssc install module: • ssc install txttool • Provides Porter stemming option (stem) and counts of words (bag) • The rest is usual multidimensional descriptive analysis • (factor and cluster) • Exemple : STATA syntaxhttp://www.louischauvel.org/gorsuch.do WE PROCEED NOW!
PART 2 R and textanalysis • R.temis, a new (V2) R Package (TExtMIning Solution) • https://cran.r-project.org/web/packages/R.temis/R.temis.pdf • First install the latest version of R-Studio • (with the latest version of R) • Install the package R.temis • Additionalformattingrequirements • Exemple : R-script http://www.louischauvel.org/gorsuch.R http://www.louischauvel.org/Rtemis_FR.docx https://rtemis.hypotheses.org/ https://cran.r-project.org/web/packages/R.temis/index.html WE PROCEED NOW!
PART 3 HYPERBASE and textanalysis available for free here for free http://ancilla.unice.fr/ The + Free, robust, appropriate for multilingualcontexts But old, French, and at some point you have to go back to part I = STATA
Main references Find them online on http://www.a-z.lu/ https://www.stata-journal.com/sjpdf.html?articlenum=dm0077 http://ancilla.unice.fr/bases/manuel.pdf