1 / 16

Instructor: Prof. Louis Chauvel

Instructor: Prof. Louis Chauvel. Advanced Statistical Analysis: Text analysis with Stata txttool , R R.temis and Hyperbase (15FEB2019). This session. General references What’s the matter? Text ( ual ) analysis , lexicometry , text mining , … STATA tools R tools Hyperbase.

jsteven
Download Presentation

Instructor: Prof. Louis Chauvel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instructor: Prof. Louis Chauvel Advanced Statistical Analysis:Text analysis with Stata txttool, R R.temis and Hyperbase (15FEB2019)

  2. This session • General references • What’s the matter? • Text(ual) analysis, lexicometry, textmining, … • STATA tools • R tools • Hyperbase

  3. References: • Set of references • « As usual »: • STATA ADVANCED MANUAL: • http://www.louischauvel.org/stata_manuel_advanced.pdf • Plus more recent …

  4. Main references Find them online on http://www.a-z.lu/ ALMOST NONE, recently, apart: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications Author: Gary Miner, , John, IV Elder, , Andrew Fast, , Thomas Hill, , Robert Nisbet, , and DursunDelen Too much, too heavy, too general, but it is the reference …

  5. SEE ALSO Find this online at : https://mhealth.jmir.org/2018/4/e101/

  6. Whatis to bedone? • Open, long, answer to a question / issue / interview • Typical case: 30-100 interviews of several minutes to 1 hours+ • In general you personally know your sample • And you have some additional indicators on what they are • Description of contents of speech / content / matter / style … • to understand major cleavages through what people say • A quantitative extension of qualitative research

  7. Typicalprocessing: • Data management • Clean (lower case, punctuation, quotes, ???) • Format the data (different in each software) • Import the data in the software (many issues) • “Stopwords” and lemmatization (suppress grammatical flection) • "Stemmization" (see Porter Stemmer ) • Data processing • Dictionary and sub-counts of words  what they speak of and who • Concordance / Correspondence of words  coherence of words / people • Factor & cluster analyses  contrasts and grouping of words / people • … interpretations … https://tartarus.org/martin/PorterStemmer/ https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

  8. Typicalprocessing: • Data processing • Main issue  proximity of words and of texts •  we need a matrix of relations (proximity) •  many solutions (differences, ratios) •  relatively common solution • consider the row=wordscolumn=person table of frequency • add .5 frequency per cell (zeros) • log the frequency • compute the mean of log(fr) per row • keep the differencebetweenlog(fr) and mean of log(fr) per row • + and – are a good indicator of attraction/replusion of word/pers • Factor & cluster analyses

  9. Softwares Qualitative research oriented Nvivo Atlas-ti Etc Quantitative method oriented STATA and R commands Hyperbase (FR) WordStat(for STATA) TextAnalyst

  10. Neil Gorsuch Typical case https://www.youtube.com/watch?v=RlJEXiZONrQ 22 U.S. Senators R&D Opening Declaration in U.S. Ass. Justice Hearings https://www.congress.gov/115/chrg/shrg28638/CHRG-115shrg28638.htm

  11. Neil Gorsuch Typical case • 22 U.S. Senators Opening Declaration in U.S. Ass. Justice Hearings • 22 extensive transcript (10 minutes) • We know their names • https://eugdpr.org/ • GDPR issues? = NO, it is public… • In the dataset: name+ d/r = political partyand transcript • You love U.S. politics ? • https://en.wikipedia.org/wiki/Neil_Gorsuch_Supreme_Court_nomination#Committee • https://en.wikipedia.org/wiki/United_States_Senate_Committee_on_the_Judiciary#Members,_115th_Congress • See texts here • http://www.louischauvel.org/Gorsuch.doc https://www.youtube.com/watch?v=RlJEXiZONrQ

  12. Raw Material

  13. PART 1 STATA and textanalysis • Have a Stata 13 minimum … • The long string text has almost no limitation • Copy-Paste is a simple way to import data • So… • Important STATA ssc install module: • ssc install txttool • Provides Porter stemming option (stem) and counts of words (bag) • The rest is usual multidimensional descriptive analysis • (factor and cluster) • Exemple : STATA syntaxhttp://www.louischauvel.org/gorsuch.do WE PROCEED NOW!

  14. PART 2 R and textanalysis • R.temis, a new (V2) R Package (TExtMIning Solution) • https://cran.r-project.org/web/packages/R.temis/R.temis.pdf • First install the latest version of R-Studio • (with the latest version of R) • Install the package R.temis • Additionalformattingrequirements • Exemple : R-script http://www.louischauvel.org/gorsuch.R http://www.louischauvel.org/Rtemis_FR.docx https://rtemis.hypotheses.org/ https://cran.r-project.org/web/packages/R.temis/index.html WE PROCEED NOW!

  15. PART 3 HYPERBASE and textanalysis available for free here for free http://ancilla.unice.fr/ The + Free, robust, appropriate for multilingualcontexts But old, French, and at some point you have to go back to part I = STATA

  16. Main references Find them online on http://www.a-z.lu/ https://www.stata-journal.com/sjpdf.html?articlenum=dm0077 http://ancilla.unice.fr/bases/manuel.pdf

More Related