1 / 29

Gramsci ’ s authorship attribution of anonymus newspapers articles

Gramsci ’ s authorship attribution of anonymus newspapers articles. Maurizio Lana Histoire et informatique Textométrie des sources historiques 6.6.2014. who we are. maurizio lana mirko degli esposti emanuele caglioti dario benedetto 1 scholar and 3 physical mathematicians.

effie
Download Presentation

Gramsci ’ s authorship attribution of anonymus newspapers articles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gramsci’s authorship attribution of anonymus newspapers articles Maurizio Lana Histoire et informatique Textométrie des sources historiques 6.6.2014

  2. who we are • maurizio lana • mirko degli esposti • emanuele caglioti • dario benedetto • 1 scholar and 3 physical mathematicians

  3. it’s always data • the analysis of numerization of physical world phenomena can equally work on • TAC imaging, • songs, • ECG, • texts, • …

  4. reason for the study • national edition of Gramsci’s works, by Ministero dei Beni Culturali • new work on the newspaper articles • many anonymous newspaper articles in the journals and newspapers Gramsci wrote for:Il Grido del Popolo, Avanti!, La Città Futura • request from the Fondazione Gramsci to start anew the study of anonymous articles, to find new evidences of Gramsci writings • we were in 2005

  5. a little background • the start is in 1847, V.J. Bunjakovskij On the possibility to apply determining measures of confidence to the results of some observing sciences, particularly statistics • 1897-98, W. Lutosławski, “On Stylometry”; “Principes de stylometrie” • 1959, D. R. Cox and L. Brandwood, On a discriminatory problem connected with the works of Plato • 1962, A. Ellegård, Who was Junius? • 1964, F. Mosteller and D. Wallace Inference and Disputed Authorship: The Federalist • 1978, A. Kenny, The Aristotelian ethics: a study of the relationship between the Eudemian and Nicomachean ethics of Aristotle • 1980, J.P. Benzécri Pratique de l’analyse des données • 1987, J. F Burrows, Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style, ”LLC”, 2, 1987, pagg. 61-70

  6. in common… • … they have the work at words levels

  7. the turning point • G. Ledger, Re-counting Plato: A Computer Analysis of Plato’s Style, Oxford, Clarendon Press, 1989 • the scope are words containing a specified letter; words ending in a specified letter; words with a specified letter as penultimate • that is semantically and linguistically meaningless parts of the words • “I have departed from the traditional approach of stylometry by ignoring entirely meanings and grammatical functions, measuring instead the frequencies of words according to their orthographic content”

  8. today, for me (for us) • the key is: a latent mathematical structure of the text • from: L. Doležel, A note on quantification in text theory, in: “Text Processing”, S. Allén ed., Stockholm, 1982, pagg. 539-552 • an expression of the idea: D. Khmelev, F. Tweedie, Using Markov chains for identification of writers, “LLC”, 16, 4, 2001, pagg. 299-307

  9. today, for me (for us) • another expression: D. Benedetto, E. Caglioti, V. Loreto et al., Language Trees and Zipping, “Phys. Rev. Lett.” 88, n. 4, 048702-1, 048702-4 (2002) • take 1 texts, compress it with Zip; • then take another text and compress it with the compression dictionary of the first one; • measure the difference in size: this is the measure of the relative entropy

  10. then came the AAAC • in 2004 the american mathematician Patrick Juola proposed the ad-hoc authorship attribution competition to experimentally find the best method to correctly attribute anonymous works:http://www.mathcs. duq.edu/~juola/authorship_contest.html • second best scorer Vlado Keselj, with a method based on measurements of n-grams frequencies

  11. the state of the QAA world in 2005 • in 2002 Jack Grieve, for his thesis “Quantitative Authorship Attribution: A History And An Evaluation Of Techniques”, counts at least 39 known and used methods with 93 variants for Quantitative AA • the aim of AAAC: prune the useless methods • nevertheless: this continue to be not science, but craftmanship

  12. in 2005 we started • we had to prove to the Fondazione Gramsci that the Quantitative AA produced good results • we choose to use two QAA methods: • relative entropy (already described) • n-gram distances (which gave Keselj the 2° palce in the AAAC)

  13. the protocol • phase 1: 50 surely Gramscian texts; 50 surely non-Gramscian texts; • do whatever you like to be able to recognize the Gramscian as Gramscian and the non-Gramscian as non-Gramscian • phase 2 (blind test): 40 unidentified texts, some Gramscian and some not: classify them correctly

  14. text preparation • deletion of: • citations of any lenght • proper nouns • numbers • no lemmatization: e.g. the choice for a given tense and person of a verb contains some quantity of information we cannot evaluate properly in order to discard it

  15. n-grams • sequencies of n entities you must choose (we chose characters) • sliding n-grams: in “final” a 3-gram reads fin, ina, nal • to find the right n you do tests • n-grams capture fragments of meaning, syntax, collocations/cooccurrences, etc. • you have a dictionary of gramscian n-grams • you check the n-grams of your anonymous texts; you count the matches and the non-matches and do an algebric sum: if positive the text is gramscian, if negative not

  16. strategy • maximize the correct attributions • at the same avoiding false attributions • = some missed attributions are ok if you don’t produce false attributions • you must have your commissioner trust you

  17. strategy 2 • we don’t know if, how, and how much the “parole” of an author changes across matters, audience, genre, time, … • so we decide that we had to work on well defined periods: their boundaries being left to decide to the Gramsci experts • 1° period 1914-1921

  18. a little of maths • having two methods at work, we could build a cartesian plane, where the results of he measures were plotted after normalization bringing them in the range -1 / + 1

  19. phase 1 - setup

  20. phase 2 – blind test

  21. the day after • westarted to do the attributions - beingpaid by Fondazione Gramsci for it - withoutknowinganything of the texts, and givingperiodical reports to the historianswhowereeditors of the variousvolumes of the nationaledition od Gramsci works • wegot the texts, normalizedthem, measuredthem, and produced a Report wesent to Fondazione Gramsci • historiansevaluation of the QAA: no proposedattributionwasunacceptable, evenifnoteveryproposedattributionwasaccepted • [example of report]

  22. now we have stopped • due to the cuts to research funds, the national edition is at now stopped

  23. some practical principles on AA • no tool can ‘read’ a text and say you: this text was written by Francesco Stella • you can only classify the texts you chose to work on, crunched by the tool you use • all of the texts will be connected: you must interpret the results • you must mix anonymous or disputed works with “control works”: same period, same genre, same language, same author, similar authors, …

  24. be careful • when you have proper nouns in your works, it’s easy to classify them: • R. Clement and D. Sharp, Ngram and Bayesian Classification of Documents for Topic and Authorship, “LLC”, 2003, 18(4):423-447 • but you don’t really classifiy the texts, you classify the collections of proper nouns they contain

  25. why the gramscicaswas/isdifficultand strange • articles are very short: between 300 and 1000/1200 words • all of these articles share: matters, ideology, context • there is no countercheck, and you work for a scientific and productive initiative (it’s not ‘simply’ an experiment) • the tables showing the matches are sparse tables, nevertheless these data work well

  26. now what • Patrick Juola, the mathematician who proposed the AAAC, has released JGAAP, a package offering various tools for QAA: • http://evllabs.com/jgaap/w/index.php/ • the R package with stylo is impressive and I wish we had it when we started our work with Gramsci texts

  27. some references to start from • C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, An example of mathematical authorship attribution, “Journal Of Mathematical Physics”, 2008, 49, pp. 1 – 20 • C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, L'attribuzione dei testi gramsciani: metodi e modelli matematici, “La Matematica nella Società e nella Cultura”, 2010, 3, pp. 235 – 269 • M. Lana, Come scriveva Gramsci? Metodi matematici per riconoscere scritti gramsciani anonimi, “Informatica Umanistica”, 2010, 3, 31-56

  28. some references (2) • M. Lana, Individuare scritti gramsciani anonimi in un" corpus" giornalistico. Il ruolo dei metodi quantitativi, “Studi storici: rivista trimestrale dell'Istituto Gramsci”, 52 (4), 859-880 • P. Juola, Authorship Attribution, “Foundations and Trends in Information Retrieval”, Vol. 1, No. 3 (2006) 233–334http://www.conll.org/~walter/educational/material/fnt-aa.pdf • J. Grieve, Quantitative Authorship Attribution: An Evaluation of Techniques, LLC 22: 251-270http://dl.dropboxusercontent.com/u/99161057/Grieve_authorshipattribution.pdf

  29. thanks!

More Related