Discovery of Patterns in Digital Records DESI III at ICAIL 2009 Global E-Discovery/E-Disclosure Workshop: A Pre-Conference Workshop at the 12th International Conference on Artificial Intelligence and Law A. Shelly Spearing (email@example.com) Jorge H. Román (firstname.lastname@example.org) Los Alamos National Laboratory, High Performance Computing Division, Scientific Software Engineering Group (HPC-1) June 8, 2009, Barcelona, Spain
Problem statement • Repositories with terabytes of Digital Records are being created and updated daily! • In the legal profession, it is imperative to identify “relevant” records from these large repositories. Currently “relevance” is determined by a complex set of Boolean logic query strings used for “full text” searching of these repositories. • The Boolean logic search string approach has several weaknesses: • 1) Documents may have to be categorized by humans. This is a time-consuming process especially in larger sets. Furthermore, it may not be accurate as different human experts may use different categorization terms or may be influenced by past experiences or current focus areas. • 2) For many cases (e.g., tobacco litigation) the source documents are older paper documents that must first be digitized. After digital images of the original documents are created, the text is automatically recognized using Optical Character Recognition (OCR). However, OCR quality is directly related to the quality of the input. Documents of poor quality (i.e., dark, or text barely legible) will generate fragmented words. • 3) The complexity of the Boolean logic may be such that relevant documents are missed. This could be caused by faulty logic, incorrect term usage and other factors that narrow down too soon the documents.
Background • The LANL’s Digital Knowledge Discovery (DKD) team has a long history of working with digitized records in large archival repositories. The technology in use overcomes the above problems by automatically extracting important concepts from text. Fragmented words still contain patterns that can be used to identify relevant documents. Also, the discovery interface uses natural language and graphics to convey complex relationships. Some of the features of the approach are: • Automatic identification of key concepts and the relationships among them in each document. This process is analogous to user identified “Keywords / Key-phrases” in scientific publication, but it is done by an algorithm consistently and is not biased to include temporary fads. • Creation of indices using the key concepts, which facilitates identification of patterns in the documents. These indices are similar to the indices in the back of a book used to find relevant pages. In our case, they are used to identify relevant documents and are created without human intervention. • The discovery process uses natural language to translate search concepts into complex target patterns which can be used to rank documents for relevance and identify the best ones. In other words, important concepts can be extracted from natural language problem statements, which are in turn used to generate a scoring mechanism for assessing goodness-of-fit. This feedback loop can also be used for the discovery of other relevant documents.
Current Technology • The search string “(television OR film OR theater) AND (children OR minors)” is used to narrow down the search. • This example is from the “product placement” hypothetical complaint. It required a human to understand what the complaint means and how it could be formulated to the system. • This sample search was against “Tobacco Institute” documents for this illustrative example.
Knowledge Signatures (kSigs) Knowledge Signature • Automatically extracted key concepts (knowledge) • Relationship between these concepts is also identified (hierarchy) • Knowledge is annotated in the original context • Knowledge is hyperlinked to ease the navigation by end user Knowledge annotated and hyperlinked Slide 5
Automated Taxonomy Generation • kSigs for a selected set are aggregated to create a taxonomy • Taxonomies can be sorted by frequency (as depicted here) • Taxonomies can be compared against other taxonomies of kSigs • The results comparison can be used to rank the goodness-of-fit of between them • Further knowledge filtering can reduce the amount of to knowledge and highlight the most relevant ones. • Taxonomies allow the summarization of a large collection of digital content therefore exposing the collective knowledge contained in the collection.
Knowledge Network (kNet) Create taxonomy by fusing knowledge across documents. The size of the nodes denotes the frequency of the concept in the set. The width of an edge denotes the frequency of the two concepts. The nodes and edges are color coded; red is for top level nodes, blue for second level and green for the third level. The levels denote the importance of a concept within a document and the directed edge shows that a concept supports another concept. For example “Prohibiting” is a supporting concept of “Cigarette.”
Conclusion • Automated derivation of digital knowledge is possible • Depending on the question there are many other ways to display taxonomies and their comparison. • For example, Subject Matter Expertise discovery, take the publications by one author and create a taxonomy sorted by frequency of concepts. Top N concepts give an indication of the author’s knowledge. • Unique/overlapping knowledge display from a comparison of taxonomies. • Goodness of fit for new material as it relates to a taxonomy (for categorization) • The DKD Team will continue to search for new and innovative ways to discover and display knowledge to facilitate human consumption. The final goal is to facilitate knowledge transfer through digital content.