1 / 17

TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad

TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad Unit for Digital Documentation University of Oslo. The Menota network. Menota is a network of institutions working with medieval texts

monifa
Download Presentation

TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad Unit for Digital Documentation University of Oslo

  2. The Menota network • Menota is a network of institutions working with medieval texts • 18 institutions in four countries (Sweden, Norway, Denmark and Iceland) members so far • Governed by a board with one representative from each country • All institutions meet annually for a council meeting to discuss status and challenges

  3. Text publication • Anyone can add texts to the archive, but: • Texts have to comply to Menota encoding standards • Extended TEI P5 • Detailed encoding manual available. • Menota auhorization required for repository access • Texts added to repository will be available for browsing instantly, but scripting/processing by publisher required to add them to search/corpus

  4. Search and display • Search form data filtered through Cocoon to CGI • CGI script queries Corpus Workbench (CWB) • CWB builds search result as TEI KWIC (P5 compliant) then returns it to Cocoon • Search result is filtered by Cocoon and various style sheets to a concordance list • Added concordance functions: Context-aware menu, link to text view/dictionary, etc.

  5. Search system architecture

  6. Web frontend Static files (Handbook, minutes, news list etc). views Dynamic files (Menotic texts (HTML/PDF) – converted by xsl on the fly). Apache Cocoon searches Corpus Workbench (Menotic texts, converted to corpus format)

  7. Search system (PERL) • Receives the search parameters from CGI • Reformats into CQP query • Sends the query to CWB API • Receives result objects • Reformats into TEI KWIC (P5) • Returns as HTTP reply

  8. Corpus (CWB) • Uses a UTF-8 aware CWB 3.2 • Moderate size (currently ~ 1 mill. tokens) • Currently CGI based use only

  9. The query format • Search submitted through standard web form • Form variables captured by Cocoon • New CGI search built by Cocoon sitemap • The CGI format (parameters and arrays of parameters) sent to PERL • Local format query formats (not standard based semantics)

  10. Reply format: Result sets in TEI

  11. TEI KWIC header (needs more work) <teiHeader><fileDesc><titleStmt><title type="main">Search result from Menota corpus</title><title type="sub">Searched for [word="Sæm.*" ] and had 28 hits.</title></titleStmt><publicationStmt><p>For internal system use in Menota system</p></publicationStmt><sourceDesc><p>Machine generated based on Menota corpus. More information about Menota can be found on the<ref target="http://www.menota.org/">Menota webpage</ref>.</p></sourceDesc><editionStmt><ab type="searchWord">Sæm.*</p><ab type="numHits">28</ab></editionStmt></fileDesc><encodingDesc><tagsDecl><namespace name="http://www.tei-c.org/ns/1.0"><tagUsage gi="w">The attribute n is used for the identification of a word within the Menota file from which it is retrieved.</tagUsage></namespace></tagsDecl></encodingDesc></teiHeader>

  12. <body> <div type="corpus"><p>CGI hits: 10</p><p>Last: 10</p><p>Verkdel: </p><list><item><ref target="AM-63-fol"/><w ...>Gothormr</w><w ...>ſonr</w><w ...>Haralds</w><w ...>flettis</w><w ...>oc</w><w type="keyword" … > Sæmundr</w> <w ...>húsfreyia</w> TEI KWIC body <w ...>hann</w><w ...>atti</w><w ...>Jngibiorgu</w><w ...>dóttor</w> </item><item><ref target="AM-63-fol"/> ...</item><item><ref target="HolmPerg-17-4to"/> ...</item></list></div></body>

  13. <w type="keyword" n="w49407" lemma="sem" me:msaX="CU" me:msaI="" me:msaG="" me:msaN="" me:msaC="" me:msaS="" me:msaR="" me:msaP="" me:msaT="" me:msaM="" me:msaV="" me:msaF="" me:msaE="" me:msaY="IN" context="[TEI][text xml:lang='onw'][body][div org='uniform' part='N' sample='complete' type='chapter'][p]">Sæm</w> The w element <w n="w49328" lemma="félagskapr" me:msaX="NC" me:msaI="" me:msaG="M" me:msaN="S" me:msaC="D" me:msaS="I" me:msaR="" me:msaP="" me:msaT="" me:msaM="" me:msaV="" me:msaF="" me:msaE="" me:msaY="" context="[TEI][text xml:lang='onw'][body][div org='uniform' part='N' sample='complete' type='chapter'][p]">fælagskap</w>

  14. Serving external systems • Currently, searches must be in CGI format to get a TEI KWIC reply • This can be used by external systems • Would like a better format for searches • Standardised TEI KWIC format as part of P5? • Wider inter-operability: Export to Open Annotation format? To other formalisms?

  15. Using external systems • If other systems would reply in TEI KWIC then we could integrate them in our searches • Must define merge operation on TEI KWIC • Include proposal for TEI KWIC format in guidelines?

  16. The TEI KWIC document • Storing and using the concordances over time • Well documented link back to sources for each word • TEI KWIC document returned to user or available for download? • Versioning? • Publishing from TEI KWIC • WHY?

  17. Thank you! http://www.menota.org/

More Related