1 / 32

Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do

Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France. Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do. University of Kentucky, October 24 2007. Two projects.

damia
Download Presentation

Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France Processing Textual Sources for Linguistic and Literary Research:What a 'Solitary Scholar' Can Do University of Kentucky, October 24 2007

  2. Two projects • Scholarly re-edition of a 1861 “Anonymous” folklore collection • Corpus of Medieval French manuscript transcriptions for the study of punctuation

  3. Folklore Project 1/14

  4. Folklore Project 2/14 Project Team • Vera Kuznetsova • Senior Researcher, Institute of Philology SB RAS • Specialist in Russian folklore • Olga Laguta • Professor, Novosibirsk State University • Linguist • Alexei Lavrentiev

  5. Folklore Project 3/14 Objectives • Verify the authenticity of folklore texts in the collection • Analyze linguistic features of the texts • Learn more about the author of the collection • Make these texts available to scholarly community

  6. Folklore Project 4/14 Challenges • Encode data in a sustainable format (TEI XML) using available tools • Microsoft office (Word, Access) • XML processing software (XML Spy) • Perl • Configure the tools for the users with virtually no experience in IT

  7. Folklore Project 5/14 Workflow Metadata Tokenized XML-TEIdocuments Word Documents XSL Stylesheets Perl script Lemmatized XML-TEIdocuments AccessDatabase Printededition Linguistic analysis Vocabularywith contexts

  8. Folklore Project 6/14 Worddocument

  9. Folklore Project 7/14 Metadata file [1. File name] chtochelovekzakhochet ; [номер] 20 ; [2. Заглавие текста (в источнике)] Что человек захочет, то и сделает ; [3. Заглавие текста (рабочее)] Что человек захочет ; [4. Коллектив - редактор электронной версии] Сектор русского языка в Сибири, Институт филологии СО РАН ; [5. Ответственные исполнители] : [функция] Ввод текста и предварительная разметка ; [ФИО] Кузнецова Вера Станиславовна, Алешина Ольга Николаевна ; [функция] Конвертирование в формат XML-TEI, валидация ; [ФИО] Лаврентьев Алексей Михайлович . [6. Информация о проекте] : Корпус текстов русской фольклорной прозы (легенды) ; [7. Информация об источнике] : [Информация о редакторе(ах), составителе(ях) и т.п.] : [функция] подготовка к изданию ; [ФИО] Кузнецова Вера Станиславовна ; [функция] составитель сборника ; [ФИО] аноним ; [функция] автор записи ; [ФИО] не указан . [Место записи] не указано ; [Издательство] типография Ф. Иванова; [Место издания] Санкт-Петербург ; [Год издания] 1861 ; [ISBN] ???? .

  10. Folklore Project 8/14 Perl script • Takes Word document saved in HTML (filtered) format • Takes the metadata • Produces an XML-TEI document • Tokenizes and gives ID to <w> and <s> • Transforms analytical markup into <seg type=“…”> elements

  11. Folklore Project 9/14 XML Document

  12. Folklore Project 10/14 XSLT Stylesheets • Produce legible text for proofreading • Produce tables to be exported to the database

  13. Folklore Project 11/14 Access Database

  14. Folklore Project 12/14 Access Database

  15. Folklore Project 13/14 Access Database

  16. Folklore Project 14/14 Results • Printed edition • Texts • linguistic analysis supplement • indexes • XML-TEI lemmatized text corpus • XSLT stylesheets • Access database • morphological table, • forms for lemmatization and dictionary • Problem: no direct connection between the printed edition and the XML texts

  17. Punctuation Project 1/12 Challenges • Create an adequate representation of linguistically relevant data from a medieval manuscript • Multiple visualizations according to various editing traditions • Annotate and analyze the use of punctuation marks

  18. Punctuation Project 2/12 Project “History” • 1994-1999: first transcriptions using ASCII special characters • 2001: first annotation using Excel • 2003: XML-TEI (Charrette-style) transcriptions • 2005-2007: XML-TEI (Menota-style) transcriptions

  19. Punctuation Project 3/12 “Special” data to be encoded

  20. Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs

  21. Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations

  22. Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations • Large initials • “Abnormal” word spacing

  23. Multiple visualizations Punctuation Project 4/12 “Normalized” Presentation [ § 7]  Endementres qu'il parloient einsi si entra laienz uns vaslez qui dist au roi: « Sire noveles vos aport mout merveilleuses. – Queles ? XML Transcription <pn="7"> <lbn="6"/> <wxml:id="w016_0251"> <norm>Endementres</norm> <dipl>ENdementres</dipl> <facs><mdv_dropcapletter="E" color="blue"size="2"sizeAct="2"> E</mdv_dropcap>Ndementre&slong;</facs> </w> <waggl="elision"xml:id="w016_0252"> <norm>qu</norm> <dipl>qu</dipl> <facs>qu</facs> </w> “Diplomatic” Presentation [ § 7]  ENdementres qu'il parloient einsi si entralaienz uns uaslez qui dist au roi. Sire noueles uos aport mout merueilleuses. Queles “Imitative” Presentation [ § 7]  ENdementreſ quıl parloıent eínſı ſı entͣ laıenz unſ uaſlez quı dıſt au roı . Sıre noueleſ uoſ apot mout merueılleuſeſ . Queleſ Extract from Ms.Lyon BM, P.A. 77, Queste del saint Graal, Photo: BM Lyon, Transcription: Graal Project

  24. Punctuation Project 5/12 Encoding choices • “Menota-style” TEI extension • Multiple representation at a word level (norm, dipl, facs, pal?) • Additional elements • punct, mdv_dropcap, mdv_lb… • Additional attributes • w/@aggl, punct/@force...

  25. Punctuation Project 6/12 Workflow • Compact syntax transcription • xml + “shortcut” characters (cf. Wiki) • Text description using Access Database • Ms Description • Text typology • Expanding to a standard XML format using a Perl script • Export to tabular format for annotation • Re-integration of annotation to XML documents • Export and analysis using Weblex software

  26. Punctuation Project 7/12 Compact syntax

  27. Punctuation Project 8/12 Manuscript description

  28. Punctuation Project 9/12 Expanded XML

  29. Punctuation Project 10/12 Annotation

  30. Punctuation Project 11/12 Weblex

  31. Punctuation Project 12/12 Results • 25 fragments of manuscripts transcribed and described • Encoding guidelines • Integrated database of text descriptors (editions and transcriptions) • Perl scripts for conversions • XSLT stylesheets

  32. Thank You!

More Related