What’s needed for lexical databases? Experiences with Kirrkirr

What’s needed for lexical databases?Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University http://www.sultry.arts.usyd.edu.au/kirrkirrr/

Overview • Background on the Kirrkirr project • What’s needed for dictionary databases • Kirrkirr data structure and data access

Background: Kirrkirr • A dictionary browser/visualization tool • In use with a dictionary of Warlpiri, an Indige-nous Australian language (large for such a dictionary - 10 Mb – with exx, crossrefs, etc.) • Dictionary is maintained by linguists as text files, with text editor, in an ad hoc format • We convert it automatically into validated XML (stack-based error-correcting Perl parser) • Kirrkirr software is written in Java (JDK1.1, any platform) and uses XML text file “database”

Alawa Warlpiri Warumungu

Kirrkirr: Objectives • Exploit the power of a computer interface in mediating between users and dictionary data • Present a dictionary in a way which is flexible, interactive, customizable, and fun • Do visualization: networks of words, domains, activities, dictionary reversal (W-E  E-W) • Suitable for diverse users, with widely varying literacy levels: inter alia linguists, elementary school children, teachers, and native speakers • Aid linguistic science: for subtle linguistic judgments, one needs speaker involvement

Usability • We’ve been doing paper and electronic dictionary usability testing (Corris, Manning, Poetsch, and Simpson 1999, 2001) 10/6/00: Steve Patrick Jampijinpa, Jessie Patrick Nangala and Samara Napangardi • Steve started to look at it with the children, … taking them through the exercises in the dictionary worksheet, and getting them to do the typing and mousing. JP was keen to look up words, Samara, being younger, was more interested in flashing things and banging keys, but was also keen to be involved. They were keen to look up words which had pictures…. They were disappointed not to find puluku in the dictionary – Samara tried to look it up under cow as well. JP was a slow careful speller, and so could type in words she wanted to know without having them written in front of her. We used the rhyme sort to find rhymes. While rhyme is not a feature of Warlpiri songs, it is useful for teaching phonics. Steve asked whether the dictionary would be at the school, and was pleased to hear that when Carmel got some more RAM it would be.

The many aspects of databases • Three levels: a logical level specifying query semantics between physical data level and external views of/interfaces to the data  • Data model; data integrity and consistency  • Query language  • Concurrency control, transaction management, and data recovery  • We’re not doing this – like most XML work? (Abiteboul et al. 2000) – but some people need this • Storage and query optimization; indices

Choices for dictionary representation • A relational database (Nathan and Austin 1992, …) • The flexible, hierarchical, ordered text structure of dictionaries means that this is painful to do; retrieving dictionary entries may involve innumerable joins • A text file (“the document culture”) • Common in practice. No data integrity, etc. • But portable and tangible. Authors like it. • As semi-structured data  • Matches variable, non-rigid, and extensible hierarchical structure found in dictionaries

But semi-structured data is a continuum… • From highly structured data that could easily be represented in a relational or OO database (but isn’t for interchange or trendiness reasons) • To very unstructured text data, with occasional limited markup of basic structure • Linguistic databases tend to be at the unstructured end of the continuum • But (unfortunately for linguists) most work on semi-structured databases has focused on the quite structured end … with only very limited work aimed at text databases

Crucial observation for dictionary databases • In fairly unstructured databases, the contents of fields are also likely to be quite free-form • Desired querying is likely to involve flexible content-based queries • Current XML query language proposals don’t adequately support this style of usage • Even standard techniques for text, like word-based inverted file indices, often contain restrictions, such as allowing wildcards only at the end of words, which greatly limit their usefulness in text applications (e.g., PAT (Salminen and Tompa 1994) can’t search for ‘-isms’)

Ramifications for indexing • Pre-indexing is often not particularly useful or effective over text databases • Regular expressions are often more suitable • Linguists often want to ask pattern questions (words with a high vowel after a velar) • We can do “fuzzy spelling” spelling correc-tion without Soundex-style precomputation • In Kirrkirr, we’re working on doing online morphological analysis, which is again usefully viewed as a finite-state transduction

Indexing • Indexing is not particularly needed: you can grep 10 Mb in 2–3 seconds on standard PC (users are happy to wait) • XML indexing research has concentrated on the structured end of the problem: • Regular expressions over path structures are not of much use for textbases • We mainly need queries over textual content within XML entities • There are not complex join conditions but simple use of intersection or alternation • Realistic search needs do not add excessive combina-toric complexity: A linear search of the text is sufficient

Data models/schemas • Data consistency and correctness are vitally important • Even if authors like text editors, it’s a license to make errors and inconsistencies • Every kind of validation available has been useful (DTD, id/idref-style constraints) • One dictionary data model doesn’t fit all • E.g., Warlpiri dictionary has unusual organization via paradigm examples • I feel that exploring mediators will be more profitable than complex standards

Data structures and data access in Kirrkirr • Data maintained by lexicographers in text files • Backslash codes, but with end tags, nesting • Converted to XML via Perl parser • Result is guaranteed to be valid XML (though heuristic parser can make semantic errors) • This has involved a lot of work and revealed many inconsistencies in the data. Painful! • Automatic data consistency and integrity maintenance is really useful, I’d argue! • But text gives freedom, ease-of-use, tangibility (UI issues win: cf. Excel vs. Access)

Indices/tables • Kirrkirr builds and stores on disk two custom indices/tables over the XML • One indexes Warlpiri headwords to XML file positions, and holds a few extra bits of info (about pictures, subentry status, etc. (so the scroll list can be displayed quickly) • The other indexes English glosses to Warlpiri words • Maintained in memory at runtime • (not that large, allows easy regexp-based fuzzy spelling matching)

Kirrkirr data access Indices in memory XML Warlpiri dictionary file word position bits word position bits word position bits <DICTIONARY> <ENTRY> ... </ENTRY> <ENTRY> ... </ENTRY> <ENTRY> ... </ENTRY> </DICTIONARY> Kirrkirr Dictionary Browser Dic- tio- nary interface English Warlpiri English Warlpiri XML Parser XML Document Object Model Our “logical level” is Java code with hardwired methods for each query – though we have also experimented with XQL (for parts of it) grep (Jakarta-ORO)

Data access • Scroll list display, simple lookups and searches over headwords and glosses done purely from in-memory indices • Getting cross-references for network display, semantic domains, pictures, HTML, etc. is done by using index to jump into XML file, and then parsing it (with SAX until end of entry) • Complex searches are done as entity-sensitive regexp search over either the whole dictionary file, or the entries that the search is restricted to (found via the headword index)

Customizing Format with XSLT • XSLT stylesheets format dictionary entries in ways suited to the needs of different users • E.g., simple formats for low literacy users • The resulting HTML pages show typed cross-references in the dictionary as colored hyperlinks between different words • Since the XML is parsed at run-time, we can add extra information by “parameter passing” from the program to the XSLT • E.g. file locations for pictures, search titles

English-Warlpiri Dictionary • Source dictionary is only Warlpiri-English, but a bidirectional dictionary is needed by users • An English index was built from glosses so that glosses link to equivalent Warlpiri entries • Basis for English wordlist and fast search • Multiword glosses are indexed everywhere except for stopwords, giving easy lookup • One underlying dictionary: data consistency • The XML entries of all Warlpiri equivalents to an English word are merged, and passed to an XSLT stylesheet which merged HTML

Warlpiri Morphological Parsing • Warlpiri is an agglutinating language: • nyangulparnangku • nya -ngu -lpa =rna =ngku • see -PAST -IPFV =1SG.SUBj =2SG.OBJ • ‘I was looking at you.’ • For lookup/linking, users or the program have to know the root/citation form • This is difficult for people with limited literacy • We have been developing a morphological analyzer so we can look up any form, and link words in examples, etc. (Finite state methods)

Conclusions • The data structuring and data integrity of a semi-structured database are great for dictionaries • A query language, which supported textual content-based queries well, would be great too • At present, though, we do not have many good options, and Kirrkirr get by with limited ad hoc indices and text searches, done via a dictionary abstraction layer in the code • This hasn’t troubled us too much; UI issues have normally been much bigger challenges

Acknowledgements • Ken Hale, Mary Laughren, Robert Hoogenraad • Jane Simpson, David Nash • Nic Gambold, Kay Ross • Kevin Jansz, Nitin Indurkhya, Kevin Lim • Miriam Corris, Susan Poetsch • and many others….

What’s needed for lexical databases? Experiences with Kirrkirr