Close, Distant, and Scalable Reading Glenn Roe & Martin Wynne Digital.Humanities@Oxford Summer School July 10 2013

Close, Distant, and Scalable Reading Glenn Roe & Martin Wynne Digital.Humanities@Oxford Summer School July 10 2013

Close Reading Close reading: "operates on the premise that literature, as artifice, will be more fully understood and appreciated to the extent that the nature and interrelations of its parts are perceived, and that that understanding will take the form of insight into the theme of the work in question. This kind of work must be done before you can begin to appropriate any theoretical or specific literary approach”.

Close Reading Close reading: "operates on the premise that literature, as artifice, will be more fully understood and appreciated to the extent that the nature and interrelations of its parts are perceived, and that that understanding will take the form of insight into the theme of the work in question. This kind of work must be done before you can begin to appropriate any theoretical or specific literary approach”. [A] finely detailed, very specific examination of a short poem or short selected passage from a longer work, in order to find the focus or design of the work [...] the meaning of the microcosm, containing or signaling the meaning of the macrocosm (the longer work of which it is a part). To this end "close" reading calls attention to all dynamic tensions, polarities, or problems in the imagery, style, literal content, diction, etc”. http://theliterarylink.com/closereading.html

Close Reading as the paradigm fortext-based humanities scholarship

But what do you do witha million books? There are only about 30,000 days in a human life -- at a book a day, it would take 30 lifetimes to read a million books and our research libraries contain more than ten times that number. Only machines can read through the 400,000 books already publicly available for free download from the Open Content Alliance. • Gregory Crane, “What do you do with a million books?” D-Lib Magazine, March 2006

And 5 million books? We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities. www.sciencexpress.org / 16 December 2010

Culturomics…

Distant Reading Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, “Conjectures on World Literature” Distant Reading, 2013.

Distant Reading A canon of 200 novels, for instance, sounds very large for 19th-century Britain (and is much larger than the current one), but it still less than %1 of the novels that were actually published […] and close reading won’t help here, a novel a day every day of the year would take a century or so … And it’s not even a matter of time, but of method: a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as such, as a whole. Franco Moretti, Graphs, Maps, Trees: Abstract Models for Literary History, 2005

Digital Humanities andDistant Reading The Humanities discovers data (DH 1.0  DH 2.0) Quickly leads to a “data deluge” (arslonga, vita brevis) Big Data approaches to Humanities collections (e-Research) From accelerated research to new knowledge discovery

Digital Humanities andDistant Reading The Humanities discovers data (DH 1.0  DH 2.0) Quickly leads to a “data deluge” (arslonga, vita brevis) Big Data approaches to Humanities collections (e-Research) From accelerated research to new knowledge discovery digital

Digital Humanities andDistant Reading The Humanities discovers data (DH 1.0  DH 2.0) Quickly leads to a “data deluge” (arslonga, vita brevis) Big Data approaches to Humanities collections (e-Research) From accelerated research to new knowledge discovery digital > digitisation

Big Data and the Humanities • How Big is Big? • The Complete Works of Voltaire (Voltaire Foundation): • 1,077 individual works, 6.7 million words • The Digital Encyclopédie of Diderot and d’Alembert (University of Chicago): • 28 volumes in folio; 74,00 articles; 21.7 million words • Electronic Enlightenment (University of Oxford): • 60,000 letters, 23 million words • ECCO-TCP (Oxford Text Archive): • 2,300 volumes, 75 million words • ARTFL-Frantext (University of Chicago): • 3,500 volumes, 215 million words • Early English Books Online EEBO (Northwestern University): • 23,000 volumes, ~1 billion words

Matt Jockers, University of Nebraska-Lincoln Macroanalysis: Digital Methods and Literary History (UIUC Press, 2013)

Matt Jockers, Macroanalysis (2013).

Simon Raper, “Graphing the history of philosohy”

Distant Reading has a Long History: • Annales School, Book History, etc. • Counting, not reading: • After death inventories • Library holdings/circulation records • Archives of publishers • Vocabulary of titles (Furet) • Censorship records • Martin, Furet, Darnton, Chartier, etc…

Robert Darnton, The Forbidden Best-Sellers of Pre-Revolutionary France (New York, 1995), 189.

From “distant” (not) reading to close reading and back again... Digital Humanities as a locus for “scalable” reading practices DATA: digitally assisted text analysis Martin Mueller, Northwestern

Digital Humanities as locus for “Scalable Reading” By “not reading” we examine: concordances, frequency tables, feature lists, classifications, collocation tables, statistical models, networks, etc… We can track: Literary topoi (E.R. Curtius), concepts (R. Koselleck, Begriffsgeschichte), épistémès (M. Foucault) and other semantic patterns: over time, between categories, across genres. So that distant reading and data-driven analysis can provide larger contexts for close reading(s) and traditional scholarship.

Digital Humanities as locus for “Scalable Reading” Three primary areas of Digitally Assisted Text Analysis: 1. Computational/Corpus Linguistics 2. Information Retrieval 3. Text Mining and Data Visualization

Corpus Linguistics and Scalable Reading Corpus Concordance Collocation Sinclair, John, Corpus, Concordance, Collocation, Oxford University Press, 1991

Some testable assertions State • “...no political writer before the middle of the sixteenth century used the word 'state' in anything like its modern political sense [referring to the machinery of government and social control]” (Skinner, Quentin, The Foundations of Modern Political Thought, Cambridge University Press, 1978). Tudor • “The idea of a "Tudor era" in history is a misleading invention, claims an Oxford University historian. Cliff Davies says his research shows the term "Tudor" was barely ever used during the time of Tudor monarchs.” (http://www.bbc.co.uk/news/education-18240901 May 2012) Holocaust • “I will argue that “The Holocaust” is an ideological representation of the Nazi holocaust...Until recently, however, the Nazi holocaust barely figured in American life. Between the end of World War II and the late 60s, only a handful of books and films touched on the subject”. (Norman Finkelstein, The Holocaust Industry. Verso, 2000.)

A new opportunity “It is not easy to justify assertions about the alleged frequency of infrequency of some particular belief or attitude in the past. How many examples does one need to cite in order to prove the point? Lacking any satisfactory method of quantifying these matters, all I can do is to record my impressions after long immersion in the period”. Keith Thomas, The Ends of Life, Oxford University Press, 2010.

“We cannot hope to understand the behaviour of people long dead, unless we can reconstruct the mental assumptions which led them to act as they did.” - Keith Thomas, The Ends of Life, Oxford University Press, 2010. Evidence: Writing Speech Thoughts Actions Artefacts (art, architecture, cooking, etc.) Other? Intellectual History

Isn't this just Googling stuff? or Isn't it just looking up words in online text collections? An objection (or two)

How do we interpret the results? We need to ask the questions: What's in my corpus? What's missing from the population of texts which the corpus is sampled from? What claims can I make about results from this dataset? What is the right tool for the job? Will I successfully retrieve all occurrences of the word forms which I am looking for? How can I make my search term more sophisticated? What claims can I make about the significance of the frequencies? How can I improve the process, and refine the results? What do I need to investigate further? The perils of interpretation…

DH Research and Development: Full text search/retrieval Tool development Text mining approaches PhiloLogic search engine Distant > Scalable Reading

Information Retrieval:PhiloLogic search engine Open source full-text search and analysis system based on traditional models of humanistic textual scholarship. Used worldwide by a number of teams independently of its French roots: Perseus under PhiloLogic - Greek and Latin Library The École des Chartes in Paris - medieval charters, etc. Brown Women Writers Projects - heavy TEI encoding -- (Early Modern Women's Studies and The Scholarly Technology Group of Brown University)

Information Retrieval:PhiloLogic search engine Maison de Balzac in Paris (scholarly on-line edition of Balzac's Comédie humaine) Abraham Lincoln Digitization Project at Northern Illinois University Indica et Buddhica - Sanskrit texts compiled by an Independent scholar in New Zealand Alexander Street Press, a commercial on-line publisher. Many collections of large data sets, including a large collection of Black drama (about 1,200 plays)

Information Retrieval:PhiloLogic search engine PhiloLogic3's general features include: Word and phrase searching: • Proximity searches in sentences and paragraphs. • Similarity searches - fuzzy matching (wildcards*) Corpus definition using rich metadata at the document and sub-document level (Author, Title, Dates, Genre, etc.) A variety of advanced reporting features: • Concordances • KWICS (Keyword in Context) • Frequency distributions per period/work/author, etc. • Collocations and collocation tables

Information Retrieval:PhiloLogic search engine

"From words to works": Extensions to PhiloLogic PhiloMine: machine learning & text mining package Open Source: http://code.google.com/p/philomine/ PhiloLine/PAIR: sequence alignment algorithms for text comparison Open Source: http://code.google.com/p/text-pair/

Close, Distant, and Scalable Reading Glenn Roe & Martin Wynne Digital.Humanities@Oxford Summer School July 10 2013