SmartXAutofill Intelligent Data Entry Assistant for XML Documents

SmartXAutofillIntelligent Data Entry Assistantfor XML Documents Danico Lee April 7, 2005

Background – XML Technology • XML is a mark-up language for data representation and data exchange • Characteristics and advantages of XML: • Users from different professions can define their own tags and attribute names Allow people in the same field to exchange data and information • XML document structures can be nested to any level of complexity • XML document can contain an optional description of its grammar for performing structural validation • XML is important in today’s high-volume data-collection environments • As of 10/26/2004, 1,556,009 people worldwide were using a single XML application tool • Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office 2003 • Major companies are putting their data in XML • Many professional groups have developed their own XML ontologies, e.g. OMF for meteorologists and CML for chemists

XML Document - Example

Background - Autofill Technology • Currently 70 million workers or 59% of working adults in the U.S. complete forms on a regular basis • Data entry process is tedious, error-prone, time-consuming and person-power intensive • Most businesses continue to process almost 80%of their forms manually (according to Verity Inc.) • Autofilland Auto-completetechnologies ease the burden of data entry by automatically predicting and suggesting values for empty data fields • Problems with current autofill technologies: • Require a perfect match with the historical data, e.g. AOL • Or, require previously stored templates, e.g. Roboform • Mostly are for web-base forms • Can only handle simple data, e.g. name and address in online shopping forms; no support for complex XML structures • Inaccurate

Motivation • XML is the primary standard of data representation and data exchange • Most businesses continue to process almost 80%of their forms manually • Data entry process for XML documents is tedious, error-prone, time-consuming and person-power intensive • Current software tools for XML only simplify the implementation process • Information for XML documents still needs to be manually entered • Previous software tools for assisting data entry • Inaccurate • Do not support complex XML grammars

Approach • Our goal: reduce the burden on the user by automating the data entry into XML documents • SmartXAutofill - an intelligent data entry assistant for predicting and automating inputs for XML documents • based on the contents of historical document collections in the same XML domain • Incorporate an ensemble classifier that integrates multiple internal classification algorithms into a single architecture • Each internal classifier uses approximate techniques from Machine Learning to predict and suggest a value for an empty XML field • Approximate match: predict the empty node values between the values in a historical collection of XML documents and the values in a partially filled document, e.g. probabilistic • Very different from current autofill systems which require a perfect match between the incomplete document and the values of stored documents

Overview • User enters data into an XML form and moves cursor to an empty field • SmartXAutofill examines the data entered • SmartXAutofill examines the historical XML collection • Machine Learning algorithms predict what the data value should be • Weighting System learns and improves from past performance by rewarding algorithms that make correct predictions • Voting System forms a consensus decision • SmartXAutofill returns one or more suggestions for the current field • User selects one of the SmartXAutofill suggestions or enters another value

Underlying Technology – Ensemble Learning • Problem: impossible to predict which classification algorithm will work best for what type of document • Solution: Ensemble classifier • A collection of a number of classification algorithms; each classifier provides predictions for the value of an XML node • Learn which individual algorithms provide better predictive accuracy for different XML domains and for different nodes in the XML documents in these domains • Adapt itself to the specific XML collection, and perform better than any individual predictive algorithm • Boosting is one of the most widely used ensemble method

Underlying Technology – Ensemble Learning (cont’d) • Our ensemble boosts the internal classifiers based on their past performances through weighting the individual classifiers • Previous work in boosting combined the same type of classifier, learned by the same methodology, but trained on different examples • Our ensemble combines different types of classifiers into an integrated classification framework • Extra feature: collection of XML documents used for prediction are constrained by a “time window” • Only N latest documents are used • N is defined by the user • Allow the system to adapt itself to the type of documents being entered recently

Ensemble Weighted Voting Example • Three classifiers provide three suggestions each • All classifiers have the same weight initially • Classifiers are modified based on their performance for different nodes in the XML domain • Classifier A makes three suggestions: • the top one receives a rank value of 3 • the second one of 2 • the third one of 1 • Rank values are multiplied by the weight of the classifier and then normalized by the sum of the weights of all the classifiers • Suggestion with the highest score is the one selected by the ensemble and presented to the user

SmartXAutofill Demo Node Information displays data about the currently selected element Editor for the element “Title” Drop-down box containing the best suggested values Suggestion Information displays the top-ranked suggestions from each suggestor for the currently selected element Voting Information displays a bar for each possible suggestion - colored components show contribution of each suggestor in the vote Editor for the element “place”, which has two child elements, “room” and bldg Pop-up menu for adding new elements Weight Information displays historical accuracy of each suggestor for the currently selected element

Testing Approach • To span the size and complexity dimensions, XML document data were collected from 11 domains • APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT, UniProf, UWM, and WSU • Size ranged from around 50 to 5000 documents • Between 20 and 420 nodes per document • Document collections were randomly separated into two sets: seed (10% of the collection or 100 documents), and training collections • Seed collection - historical information for making predictions • Training collection - trained the ensemble by modifying its weights based on the accuracy of the suggestion • Continuously trained the learning component and tested the system • Documents were randomly selected and all nodes were suggested in random order • Add documents from the training collection to the seed after used • Note: Classifier does not made suggestion for a particular field if there were no historical data for it or if every previous value for the field was unique, e.g. abstracts of papers

Test Results for Different Domains

Weights of selected XML nodes from iProClass domain

Test Result Discussion • Different classification algorithms perform better for different domains • Ensemble classifier performed at least as well as the best performing internal classification algorithm for a domain • Different classifiers are preferred for different nodes

Our Technology - SmartXAutofill • First methodology proven to intelligently predict, suggest and autofill data for XML documents • Learn and adapt itself to any XML domain without the need of custom algorithms • “Time window” allows the technology to adapt itself to the particular set of XML documents being filled at that time • Speed up data entry process for XML documents from 20% to 99%

SmartXAutofill Intelligent Data Entry Assistant for XML Documents

SmartXAutofill Intelligent Data Entry Assistant for XML Documents

Presentation Transcript

Processing XML Documents

Querying XML Documents and Data

XML Toolset for Analysis of XML Documents

Transforming paper documents into XML format: an “intelligent” approach

9 Querying XML Data and Documents

Effective Entry Documents

Efficiently Publishing Relational Data as XML Documents

Projecting XML Documents

8 Querying XML Data and Documents

XML Syntax: Documents

Intelliviz Intelligent telemetry data visualization assistant

Presenting XML Documents

IPDRA: Intelligent Patient Data Review Assistant

Online Data Entry Services - Hire Virtual Assistant Data Entry

Online Data Entry Services - Hire Virtual Assistant Data Entry

Online Data Entry Services - Hire Data Entry Virtual Assistant

Querying XML Documents

Online Data Entry Services - Hire Virtual Assistant Data Entry

Online Data Entry Services - Hire Virtual Assistant Data Entry

Online Data Entry Services For Hire - Hire Data Entry Virtual Assistant

Online Data Entry Services - Expert Virtual Data Entry Assistant

Online Data Entry Services - Hire Data Entry Virtual Assistant