320 likes | 420 Views
Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France. Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do. University of Kentucky, October 24 2007. Two projects.
E N D
Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France Processing Textual Sources for Linguistic and Literary Research:What a 'Solitary Scholar' Can Do University of Kentucky, October 24 2007
Two projects • Scholarly re-edition of a 1861 “Anonymous” folklore collection • Corpus of Medieval French manuscript transcriptions for the study of punctuation
Folklore Project 2/14 Project Team • Vera Kuznetsova • Senior Researcher, Institute of Philology SB RAS • Specialist in Russian folklore • Olga Laguta • Professor, Novosibirsk State University • Linguist • Alexei Lavrentiev
Folklore Project 3/14 Objectives • Verify the authenticity of folklore texts in the collection • Analyze linguistic features of the texts • Learn more about the author of the collection • Make these texts available to scholarly community
Folklore Project 4/14 Challenges • Encode data in a sustainable format (TEI XML) using available tools • Microsoft office (Word, Access) • XML processing software (XML Spy) • Perl • Configure the tools for the users with virtually no experience in IT
Folklore Project 5/14 Workflow Metadata Tokenized XML-TEIdocuments Word Documents XSL Stylesheets Perl script Lemmatized XML-TEIdocuments AccessDatabase Printededition Linguistic analysis Vocabularywith contexts
Folklore Project 6/14 Worddocument
Folklore Project 7/14 Metadata file [1. File name] chtochelovekzakhochet ; [номер] 20 ; [2. Заглавие текста (в источнике)] Что человек захочет, то и сделает ; [3. Заглавие текста (рабочее)] Что человек захочет ; [4. Коллектив - редактор электронной версии] Сектор русского языка в Сибири, Институт филологии СО РАН ; [5. Ответственные исполнители] : [функция] Ввод текста и предварительная разметка ; [ФИО] Кузнецова Вера Станиславовна, Алешина Ольга Николаевна ; [функция] Конвертирование в формат XML-TEI, валидация ; [ФИО] Лаврентьев Алексей Михайлович . [6. Информация о проекте] : Корпус текстов русской фольклорной прозы (легенды) ; [7. Информация об источнике] : [Информация о редакторе(ах), составителе(ях) и т.п.] : [функция] подготовка к изданию ; [ФИО] Кузнецова Вера Станиславовна ; [функция] составитель сборника ; [ФИО] аноним ; [функция] автор записи ; [ФИО] не указан . [Место записи] не указано ; [Издательство] типография Ф. Иванова; [Место издания] Санкт-Петербург ; [Год издания] 1861 ; [ISBN] ???? .
Folklore Project 8/14 Perl script • Takes Word document saved in HTML (filtered) format • Takes the metadata • Produces an XML-TEI document • Tokenizes and gives ID to <w> and <s> • Transforms analytical markup into <seg type=“…”> elements
Folklore Project 9/14 XML Document
Folklore Project 10/14 XSLT Stylesheets • Produce legible text for proofreading • Produce tables to be exported to the database
Folklore Project 11/14 Access Database
Folklore Project 12/14 Access Database
Folklore Project 13/14 Access Database
Folklore Project 14/14 Results • Printed edition • Texts • linguistic analysis supplement • indexes • XML-TEI lemmatized text corpus • XSLT stylesheets • Access database • morphological table, • forms for lemmatization and dictionary • Problem: no direct connection between the printed edition and the XML texts
Punctuation Project 1/12 Challenges • Create an adequate representation of linguistically relevant data from a medieval manuscript • Multiple visualizations according to various editing traditions • Annotate and analyze the use of punctuation marks
Punctuation Project 2/12 Project “History” • 1994-1999: first transcriptions using ASCII special characters • 2001: first annotation using Excel • 2003: XML-TEI (Charrette-style) transcriptions • 2005-2007: XML-TEI (Menota-style) transcriptions
Punctuation Project 3/12 “Special” data to be encoded
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations
Punctuation Project 3/12 “Special” data to be encoded • Variant character glyphs • Abbreviations • Large initials • “Abnormal” word spacing
Multiple visualizations Punctuation Project 4/12 “Normalized” Presentation [ § 7] Endementres qu'il parloient einsi si entra laienz uns vaslez qui dist au roi: « Sire noveles vos aport mout merveilleuses. – Queles ? XML Transcription <pn="7"> <lbn="6"/> <wxml:id="w016_0251"> <norm>Endementres</norm> <dipl>ENdementres</dipl> <facs><mdv_dropcapletter="E" color="blue"size="2"sizeAct="2"> E</mdv_dropcap>Ndementre&slong;</facs> </w> <waggl="elision"xml:id="w016_0252"> <norm>qu</norm> <dipl>qu</dipl> <facs>qu</facs> </w> “Diplomatic” Presentation [ § 7] ENdementres qu'il parloient einsi si entralaienz uns uaslez qui dist au roi. Sire noueles uos aport mout merueilleuses. Queles “Imitative” Presentation [ § 7] ENdementreſ quıl parloıent eínſı ſı entͣ laıenz unſ uaſlez quı dıſt au roı . Sıre noueleſ uoſ apot mout merueılleuſeſ . Queleſ Extract from Ms.Lyon BM, P.A. 77, Queste del saint Graal, Photo: BM Lyon, Transcription: Graal Project
Punctuation Project 5/12 Encoding choices • “Menota-style” TEI extension • Multiple representation at a word level (norm, dipl, facs, pal?) • Additional elements • punct, mdv_dropcap, mdv_lb… • Additional attributes • w/@aggl, punct/@force...
Punctuation Project 6/12 Workflow • Compact syntax transcription • xml + “shortcut” characters (cf. Wiki) • Text description using Access Database • Ms Description • Text typology • Expanding to a standard XML format using a Perl script • Export to tabular format for annotation • Re-integration of annotation to XML documents • Export and analysis using Weblex software
Punctuation Project 7/12 Compact syntax
Punctuation Project 8/12 Manuscript description
Punctuation Project 9/12 Expanded XML
Punctuation Project 10/12 Annotation
Punctuation Project 11/12 Weblex
Punctuation Project 12/12 Results • 25 fragments of manuscripts transcribed and described • Encoding guidelines • Integrated database of text descriptors (editions and transcriptions) • Perl scripts for conversions • XSLT stylesheets