1 / 34

Release the power of your XML markup!

Release the power of your XML markup!. X ML A ware I ndexing & R etrieval A rchitecture. http://www.xaira.org. What are digital resources actually for ?. integration of disparate sources texts, commentaries, sources, variations… multimedia, manuscripts, transcriptions, metadata…

Download Presentation

Release the power of your XML markup!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Release the power of your XML markup! XML Aware Indexing & Retrieval Architecture http://www.xaira.org

  2. What are digital resources actually for ? integration of disparate sources texts, commentaries, sources, variations… multimedia, manuscripts, transcriptions, metadata… a new way of preservation media disappear, data remain "multiplication beyond the reach of accident" a huge expansion of accessibility quantitative … and also qualitatitive

  3. .. And are we delivering? integration of disparate sources Different user communities have different -- and sometimes contradictory -- agendas and priorities a new way of preservation The business model is still unclear: is digitization a public good? And the technical problems may be insuperable a huge expansion of accessibility Implies a huge expansion of metadata provision But what about qualitative expansion?

  4. The majority view Libraries collect and distribute Texts What is a text? It’s a book Or an electronic simulation of one Texts are cultural artefacts, which need to be treated like other cultural artefacts

  5. This is good news for … Librarians and other cultural guardians An emergent new class of textual editors Students in search of easy dissertation topics Antiquarians of all sorts (until the books run out)

  6. A minority view… • What is a Text? • The book is not the text! • Page images are not the text! • Markup is not the text! • The text is all IN YOUR HEAD • It’s an avatar for the language • Language is more interesting than texts

  7. Remember the concordance? Defamiliarizes and decontextualizes the components of a text Facilitates analysis of Lexis, syntax, and lexical patterns Co-occurrence, collocation, colligation Informed by metadata categorization and acculated interpretation A way of reading a text in its context as a means of discovering its primings

  8. From SARA to XAIRA… SARA (SGML-Aware Retrieval Application) was BNC-specific XAIRA (XML-Aware Indexing and Retrieval Architecture) is a generic XML Corpus searching toolkit Currently available for MS Windows only Open source version under development Join our beta test at http://www. xaira.org Object oriented design

  9. XAIRA: the key features Supports word search, concordance generation and manipulation, collocation, lexical analysis Uses XML annotation to the max and supports XML-aware complex queries Leverages existing standards TEI/XCES Unicode CSS and XML Web services Uses efficient and compact indexing appropriate to small or huge corpora

  10. First catch your corpus… any collection of well-formed XML documents if a DTD is supplied, the corpus must be valid if no TEI header is present, one will be created the more you put in, the more you get out "texts" are defined independently of file structure, as are the relevant units within them all indexing information is stored in the corpus header

  11. Next, build your index… Can be done simply by adding appropriate declarations to the TEI Header and running the indexer utility But probably easier to do with the supplied Indextools utility which organizes and validates the files you are using updates (or creates) the header with tokenization and indexing rules tag and attribute usage, descriptive codebooks etc. "bibliographic" metadata default behaviour for character encoding, formats used, etc optionally runs and tests the indexer And there is also a wizard….

  12. What goes in the index? • tokenization • implicit, following Unicode rules (locale-sensitive) • explicit, following mark up • supports lexical features (eg collocation) • lemmatization and POS tags • special case of "additional key" mechanism • generalized to provide fast context-specific searches • tag indexes • attribute values and codebooks

  13. lexica TEI Header index WSDL client PC client Web client The architecture corpus Xairo (or χρ) : the Xaira Object model

  14. Hoorah for Unicode All data is held internally as Unicode this allows us to defer most problems (e.g. tokenization, case-folding, line-breaking, character normalization, glyph composition) to someone else! User interface issues For output, use one or more appropriate fonts For input, we provide a keyboard definition utility Unicode rules can also be modified…

  15. Client/protocols The original SARA protocol Corpus Query Language Ad-hoc ASCII strings Now revised completely Sara Object Model can be accessed directly by the client as a web service using saraScript The model defines an XML-corpus Query Language (XQL) methods to manipulate XQL queries and results

  16. XMLQuery Language • XML vocabulary for searching • word, punctuation mark, substring • word + secondary keys (e.g. POS) • XML start- or end-tag, plus attributes • Unicode-compliant regular expressions • Including • The usual Boolean operations • sequence, disjunction, join • Scoping operators • Special lexical features • Not Xpath, not Xquery…yet

  17. Client features • Word and lemma query • User-configurable display • plain, XML, user-defined stylesheets • Texts, Results, Browse windows • Results can be exported in XML • Scripting language • “visual interface” for complex queries

  18. Target queries What is the most frequent noun in this corpus? Find a random sample of 100 instances of "fish" followed by "chips" within 4 words Find sentences beginning with a conjunction. Show all inflected forms of the name "Winston". Show sentences which begin with "well" and end with a question mark. How often and in what contexts is the word "nature" used in different kinds of writing? Which verbs collocate significantly with "bosom" at different periods of history? Do men use colour vocabulary differently from women?

  19. Phrase or simple query search word or phrase can be case sensitive can include punctuation can include anyword character watch out for tokenization problems

  20. Sample stylesheet display

  21. Word Query searches the lexicon for a word stem or pattern returns matching word forms with frequencies can restrict by frequency can apply lemmatization rules .. then carries out a lookup to display hits

  22. Collocation query • Compares frequency of a given word or lemma with frequencies of all others appearing within a specified context • Ranks results by a statistic (Mutual Information or Z-score) indicating the degree of collocational strength • Can be very suggestive…

  23. Collocations of the lemma God

  24. Partitions A partition is a way of grouping the texts making up a corpus, according to some explicit annotation or characterization (e.g. an attribute value) according to whether or not they match a query (a partition of two halves) arbitrary manual classification Each member of a partition is a discrete text Analysis shows the rate of occurrence of hits within members of the partition Partitions can be saved and re-used or defined dynamically indextools generates a default partition using <catRef> element

  25. Love in BNC Baby

  26. Collocates of love spoken demographic texts fiction

  27. XML Query

  28. Building complex queries visual interface scope node defines where to look an XML element by span query nodes define what to look for word, phrase, addkey, pattern, XML link types define sequence in which query node targets should occur next, one-way, two-way

  29. Sentences beginning with conjunctions

  30. What is XAIRA's niche? Web search engines patchy and unknowable coverage designed to recover content, not word forms hard to cite, harder to process XML display engines expensive, geared to reader not searcher focus on presentation rather than content As a back end for your next generation web application

  31. BNC Baby • Demo CD available here and now • Contains • Four million-word samples from BNC • Nameless Shakespeare • Brooklyn Corpus of Old English • Xaira release 1.08 • Sign up here!

More Related