Bettina Berendt Department of Computer Science KU Leuven, Belgium

Text as data – with an introduction to text exploration with Voyant, and a motivation via Text Mining ‹#› Bettina Berendt Department of Computer Science KU Leuven, Belgium http://people.cs.kuleuven.be/~bettina.berendt/ Information Structures and Implications Last updated on 17 November 2015

Starting questions for today • What is a text? • What questions can we ask of a text? • What kind of answers "make us happy"? • In what sense are texts data? • Hint: Think of our two example databases. • What does this have to do with relational databases?

Our goal for today Import SQL query + export Texts sorted & grouped into different ASCII files by criteria of interest Voyant skin & command Info about the texts (CSV) Import Export Rel. database 3

On using computers in text analysis: the goal need not be to revolutionise knowledge as such “As long as there have been books there have been more books than you could read. … Knowing how to "not-read" is just as important as knowing how to read” (Mueller, 2007). “data mining and machine learning are best understood in terms of “provocation”—the potential for outlier results to surprise a reader into attending to some aspect of a text not previously deemed significant—as well as “not-reading” or “distant reading,” the automated search for patterns across a much wider corpus than could be read and assimilated via traditional humanistic methods of “close reading.”” (Kirschenbaum, 2007) 4

Close reading describes, in literary criticism, the careful, sustained interpretation of a brief passage of text. Such a reading places great emphasis on the single particular over the general, paying close attention to individual words, syntax, and the order in which sentences and ideas unfold as they are read. The technique […] is now a fundamental method of modern criticism. […] Close reading can be compared/contrasted to the concept of distant reading, [a term attributed to] literary scholar Franco Moretti, [defined] as "understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data."

Agenda You use text mining every day Text exploration with the DH tool Voyant Texts as strings and feature vectors / tables Voyant and relational databases More sophisticated analyses of texts-as-tables: An example and basics

Origins of text mining. Or: What is a text for information retrieval? Let‘s do some reverse engineering ... 9

Words, source relevance, and personalization 10

Words and knowledge bases (1) Metadata as output 11

Knowledge-based text processing (2) Metadata as input? Requires different search interfaces! 12

Trending topics: a form of summarization 15

Finding “similar“ texts: Clustering(example Google News) 16

Going further: What topics exist in a collection of texts, and how do they evolve? News texts, scientific publications, … Mei & Zhai (2005)

Guiding questions • Information retrieval: • Given the current user‘s information need, which are the most relevant documents? • Text mining: • What do the documents tell us? What‘s in the texts? What can we learn about the texts, their authors, ... • Many different subquestions • Summarization (of one text, of many texts) is just one of them • Cf. • “Distant reading“ (Moretti) • understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data. • “Machine reading“ (UCL Machine Reading Group) • machines that can read and "understand" this textual information, converting it into interpretable structured knowledge to be leveraged by humans and other machines alike 18

Speed-reading(Woody Allen) I took a course in speed reading and was able to read War and Peace in twenty minutes. It's about Russia. ... also quoted differently: I took a speed reading course and read 'War and Peace' in twenty minutes. It involves Russia. 20

A personal “experiment“- deliberately a bit silly, more a gentle introduction to a great tool and to some pitfalls of “distant reading“ (I haven‘t read War and Peace yet.) 21

Speed-reading with word clouds:The Voyant tool (single-digit number of seconds) 22

Note about „said“: Compare Joyce‘s Dubliners 23

Word frequencies vs. Woody Allen 24

Can we find out more about the 3? 25

Double-check in Wikipedia (method: string search) • Count PyotrKirillovich (Pierre) Bezukhov: The large-bodied, ungainly, and socially awkward illegitimate son of an old Russian grandee. Pierre, educated abroad, returns to Russia as a misfit. His unexpected inheritance of a large fortune makes him socially desirable. Pierre is the central character and often a voice for Tolstoy's own beliefs or struggles. • Prince Andrey Nikolayevich Bolkonsky: A strong but skeptical, thoughtful and philosophical aide-de-camp in the Napoleonic Wars. • Some searching needed ... Andrew ... Andrei ... Andrey • Countess Natalya Ilyinichna (Natasha) Rostova: A central character, introduced as "not pretty but full of life" and a romantic young girl, although impulsive and highly strung, she evolves through trials and suffering and eventually finds happiness. She is an accomplished singer and dancer. • ... • Prince Anatole Vasilyevich Kuragin: Hélène's brother and a very handsome and amoral pleasure seeker who is secretly married yet tries to elope with Natasha Rostova. • VasilyDmitrichDenisov: Nikolai Rostov's friend and brother officer, who proposes to Natasha. 26

From Wikipedia‘s plot summary(method: string search) • ... • Natasha is convinced that she loves Anatole and writes to Princess Maria, Andrei's sister, breaking off her engagement [with Andrei]. At the last moment, Sonya discovers her plans to elope and foils them. Pierre is initially horrified by Natasha's behavior, but realizes he has fallen in love with her. During the time when the Great Comet of 1811–2 streaks the sky, life appears to begin anew for Pierre. • Prince Andrei coldly accepts Natasha's breaking of the engagement. He tells Pierre that his pride will not allow him to renew his proposal. Ashamed, Natasha makes a suicide attempt and is left seriously ill. • ... • Having lost all will to live, [Andrei] forgives Natasha in a last act before dying. • Pierre's wife Hélène dies from an overdose of abortion medication (Tolstoy does not state it explicitly but the euphemism he uses is unambiguous). Pierre is reunited with Natasha, while the victorious Russians rebuild Moscow. Natasha speaks of Prince Andrei's death and Pierre of Karataev's. Both are aware of a growing bond between them in their bereavement. With the help of Princess Maria, Pierre finds love at last and, revealing his love after being released by his former wife's death, marries Natasha. Total time: 29 mins since creation of word cloud, 17 mins since creation of Pierre-Natasha-Andrew chart (includes making these slides for you) 27

Questions • How much of this was “really automatic“? • What existing knowledge (in my head and in others‘) went into this analysis, • and how? • Can you think of another reason why this (deliberately) turned out silly? 28

More interesting / serious examples (1)(from the summer school participants) • Analysis of ego-shooter missions (thanks to Kathrin Trattner) 29

Comment B. Berendt – compare this with an earlier text-mining analysis of reporting on the same events by CNN in comparison with Al Jazeera • See next slide 30

Unsupervised learning of bias What characterizes different news sources? Nearest neighbour / best reciprocal hit for document matching; Kernel Canonical Correlation Analysis and vector operations for finding topics and characteristic keywords 31 [Fortuna, Galleguillos, & Cristianini, 2009]

More interesting / serious examples (2)(from the summer school participants) • Joseph Goebbels‘ sportpalast speech (a famous propaganda speech from 1943: “Do you want the total war?“) • frequencies of negatively connotated words (“bolshevism“, “judaism / the Jews“) vs. positively connotated words (“Germans“) suggest: • The speech starts with a threat scenario and ends with a positive vision of the future • Remark B. Berendt: This is borne out by reading the full text, and it is also a classical rhetorical structure. Text from: http://www.1000dokumente.de/index.html?c=dokument_de&dokument=0200_goe&object=translation&st=&l=de 32

More interesting / serious examples (4)(from others) Examples of Voyant in Research: http://docs.voyant-tools.org/about/examples-gallery/ 33

And now for your own “experiments” http://docs.voyant-tools.org/start/ and inspirations from the new version: http://docs.voyant-tools.org/workshops/dh2015/ 34

How did the representation of your text change? 36

Possible interactions between a relational database and a text analysis tool (1) Import SQL query + export Texts sorted & grouped into different ASCII files by criteria of interest Voyant skin & command Rel. database 38

Export all speeches into 1 text file SELECT `Spoken_text` INTO OUTFILE ' filepath\\all_texts.csv' FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n' FROM `speech` 39

Differentiate by one criterion – e.g. language Language count(`Spoken_text`) SL 2 LT 2 BG 3 LV 3 CS 4 SV 9 FI 12 DA 13 HU 14 EL 18 PT 18 ES 22 NL 23 SK 25 PL 30 RO 37 IT 39 FR 41 DE 85 en 228 40

Example: Export only speeches in English (into 1 text file) SELECT `Spoken_text` INTO OUTFILE 'filepath\\speeches_in_english.csv' FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n' FROM `speech` WHERE `Language`like "%en%" ORDER BY `Agenda_item_ID` (the ORDER BY is a simple way of ordering the speeches by time in the output) 41

Possible interactions between a relational database and a text analysis tool (2) Import SQL query + export Texts sorted & grouped into different ASCII files by criteria of interest Voyant skin & command Info about the texts (CSV) Import Export Rel. database 42

Task: How to answer this question with Voyant + SQL? How similar are the speeches in the top-4 languages? E.g.: Are French speeches more similar to Romanian speeches than Italian speeches are to Romanian ones? Recommendation: Exclude English and take the next 4 most frequent languages A simple similarity measure: the Jaccard similarity Sim(t1,t2) = (number of words that occur both in t1 and t2) / (number of words that appear in t1 or in t2) Hint: you need to simplify this measure to restrict yourself to the most frequent words in texts 1 and 2

Hint Intersection: How can you get a list of the words that are in both of two tables? Think of joining the tables Union: How can you get a list of the words that are in either or both of two tables? SELECT word FROM table1 UNION SELECT word from table2

(Solution to follow)

Outlook: Using classifier learning for literature analysis – here: a (Weka) decision tree (early example: MONK) ‹#› Sara Steger (2012). Patterns of Sentimentality in Victorian Novels. Digital Studies 3(2).

Starting point is the raw term frequency as term weights Other weighting schemes can generally be obtained by applying various transformations to the document vectors Background: Document Representation as Vectors = relational tables Features Document Ids nova galaxy heat actor film role diet A 1.0 0.5 0.3 B 0.5 1.0 C 0.4 1.0 0.8 0.7 D 0.9 1.0 0.5 E 0.5 0.7 0.9 F 0.6 1.0 0.3 0.2 0.8 a document vector

Some formalism: the vector-space model of text (basic model used in information retrieval and text mining) • Basic idea: • Keywords are extracted from texts. • These keywords describe the (usually) topical content of Web pages and other text contributions. • Based on the vector space model of document collections: • Each unique word in a corpus of Web pages = one dimension • Each page(view) is a vector with non-zero weight for each word in that page(view), zero weight for other words  Words become “features” (in a data-mining sense) 49

Next week More on text

References Again, there’s no real written version of today’s lecture and exercise session. Apart from the hyperlinks cited on the slides, here’s background reading: Individual sources cited on the slides • Fortuna, B., Galleguillos, C., & Cristianini, N. (2009). Detecting the bias in media with statistical learning methods. In Text Mining: Classification, Clustering, and Applications, Chapman & Hall/CRC, 2009. • Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation. http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf • Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. KDD 2005: 198-207 • Mueller, M. “Notes towards a user manual of MONK.” https://apps.lis.uiuc.edu/wiki/display/MONK/Notes+towards+a+user+manual+of+Monk, 2007. • Steger, S. (2012). Patterns of Sentimentality in Victorian Novels. Digital Studies 3 (2) http://www.digitalstudies.org/ojs/index.php/digital_studies/article/view/238/294

More DH-specific tools Overviews of 71 tools for Digital Humanists • Simpson, J., Rockwell, G., Chartier, R., Sinclair, S., Brown, S., Dyrbye, A., & Uszkalo, K. (2013). Text Mining Tools in the Humanities: An Analysis Framework. Journal of Digital Humanities, 2 (3), http://journalofdigitalhumanities.org/2-3/text-mining-tools-in-the-humanities-an-analysis-framework/ • See also the link collection on the Voyant documentation Web page 52

Bettina Berendt Department of Computer Science KU Leuven, Belgium

Bettina Berendt Department of Computer Science KU Leuven, Belgium

Presentation Transcript

Bettina Berendt

Barbara DEWAELE, PhD Department of Human Genetics KU Leuven - UZ Leuven Leuven, Belgium

Department of Computer Science

Department of Computer Science

Presented by Bettina Berendt, K.U. Leuven

Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

Jan Engelen KU Leuven (Belgium) ULD, Brno 2013

Department of computer science

Riina Vuorikari European Schoolnet / Open Univ. of the Netherlands Bettina Berendt KU Leuven

Bettina Berendt Humboldt University Berlin, Institute of Information Systems

OpenCourseWare KU Leuven

DEPARTMENT OF COMPUTER SCIENCE

Department of Computer Science

Department of Computer Science

Prof. dr. Ilse Jonkers, Human Movement Biomechanics, KU Leuven, Belgium

Department of Computer Science

Riina Vuorikari European Schoolnet / Open Univ. of the Netherlands Bettina Berendt KU Leuven

Presented by Bettina Berendt, K.U. Leuven

Internships Faculty of Science KU Leuven

Department of Computer Science