1 / 16

SmartXAutofill Intelligent Data Entry Assistant for XML Documents

SmartXAutofill Intelligent Data Entry Assistant for XML Documents. Danico Lee April 7, 2005. Background – XML Technology. XML is a mark-up language for data representation and data exchange Characteristics and advantages of XML:

demi
Download Presentation

SmartXAutofill Intelligent Data Entry Assistant for XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SmartXAutofillIntelligent Data Entry Assistantfor XML Documents Danico Lee April 7, 2005

  2. Background – XML Technology • XML is a mark-up language for data representation and data exchange • Characteristics and advantages of XML: • Users from different professions can define their own tags and attribute names Allow people in the same field to exchange data and information • XML document structures can be nested to any level of complexity • XML document can contain an optional description of its grammar for performing structural validation • XML is important in today’s high-volume data-collection environments • As of 10/26/2004, 1,556,009 people worldwide were using a single XML application tool • Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office 2003 • Major companies are putting their data in XML • Many professional groups have developed their own XML ontologies, e.g. OMF for meteorologists and CML for chemists

  3. XML Document - Example

  4. Background - Autofill Technology • Currently 70 million workers or 59% of working adults in the U.S. complete forms on a regular basis • Data entry process is tedious, error-prone, time-consuming and person-power intensive • Most businesses continue to process almost 80%of their forms manually (according to Verity Inc.) • Autofilland Auto-completetechnologies ease the burden of data entry by automatically predicting and suggesting values for empty data fields • Problems with current autofill technologies: • Require a perfect match with the historical data, e.g. AOL • Or, require previously stored templates, e.g. Roboform • Mostly are for web-base forms • Can only handle simple data, e.g. name and address in online shopping forms; no support for complex XML structures • Inaccurate

  5. Motivation • XML is the primary standard of data representation and data exchange • Most businesses continue to process almost 80%of their forms manually • Data entry process for XML documents is tedious, error-prone, time-consuming and person-power intensive • Current software tools for XML only simplify the implementation process • Information for XML documents still needs to be manually entered • Previous software tools for assisting data entry • Inaccurate • Do not support complex XML grammars

  6. Approach • Our goal: reduce the burden on the user by automating the data entry into XML documents • SmartXAutofill - an intelligent data entry assistant for predicting and automating inputs for XML documents • based on the contents of historical document collections in the same XML domain • Incorporate an ensemble classifier that integrates multiple internal classification algorithms into a single architecture • Each internal classifier uses approximate techniques from Machine Learning to predict and suggest a value for an empty XML field • Approximate match: predict the empty node values between the values in a historical collection of XML documents and the values in a partially filled document, e.g. probabilistic • Very different from current autofill systems which require a perfect match between the incomplete document and the values of stored documents

  7. Overview • User enters data into an XML form and moves cursor to an empty field • SmartXAutofill examines the data entered • SmartXAutofill examines the historical XML collection • Machine Learning algorithms predict what the data value should be • Weighting System learns and improves from past performance by rewarding algorithms that make correct predictions • Voting System forms a consensus decision • SmartXAutofill returns one or more suggestions for the current field • User selects one of the SmartXAutofill suggestions or enters another value

  8. Underlying Technology – Ensemble Learning • Problem: impossible to predict which classification algorithm will work best for what type of document • Solution: Ensemble classifier • A collection of a number of classification algorithms; each classifier provides predictions for the value of an XML node • Learn which individual algorithms provide better predictive accuracy for different XML domains and for different nodes in the XML documents in these domains • Adapt itself to the specific XML collection, and perform better than any individual predictive algorithm • Boosting is one of the most widely used ensemble method

  9. Underlying Technology – Ensemble Learning (cont’d) • Our ensemble boosts the internal classifiers based on their past performances through weighting the individual classifiers • Previous work in boosting combined the same type of classifier, learned by the same methodology, but trained on different examples • Our ensemble combines different types of classifiers into an integrated classification framework • Extra feature: collection of XML documents used for prediction are constrained by a “time window” • Only N latest documents are used • N is defined by the user • Allow the system to adapt itself to the type of documents being entered recently

  10. Ensemble Weighted Voting Example • Three classifiers provide three suggestions each • All classifiers have the same weight initially • Classifiers are modified based on their performance for different nodes in the XML domain • Classifier A makes three suggestions: • the top one receives a rank value of 3 • the second one of 2 • the third one of 1 • Rank values are multiplied by the weight of the classifier and then normalized by the sum of the weights of all the classifiers • Suggestion with the highest score is the one selected by the ensemble and presented to the user

  11. SmartXAutofill Demo Node Information displays data about the currently selected element Editor for the element “Title” Drop-down box containing the best suggested values Suggestion Information displays the top-ranked suggestions from each suggestor for the currently selected element Voting Information displays a bar for each possible suggestion - colored components show contribution of each suggestor in the vote Editor for the element “place”, which has two child elements, “room” and bldg Pop-up menu for adding new elements Weight Information displays historical accuracy of each suggestor for the currently selected element

  12. Testing Approach • To span the size and complexity dimensions, XML document data were collected from 11 domains • APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT, UniProf, UWM, and WSU • Size ranged from around 50 to 5000 documents • Between 20 and 420 nodes per document • Document collections were randomly separated into two sets: seed (10% of the collection or 100 documents), and training collections • Seed collection - historical information for making predictions • Training collection - trained the ensemble by modifying its weights based on the accuracy of the suggestion • Continuously trained the learning component and tested the system • Documents were randomly selected and all nodes were suggested in random order • Add documents from the training collection to the seed after used • Note: Classifier does not made suggestion for a particular field if there were no historical data for it or if every previous value for the field was unique, e.g. abstracts of papers

  13. Test Results for Different Domains

  14. Weights of selected XML nodes from iProClass domain

  15. Test Result Discussion • Different classification algorithms perform better for different domains • Ensemble classifier performed at least as well as the best performing internal classification algorithm for a domain • Different classifiers are preferred for different nodes

  16. Our Technology - SmartXAutofill • First methodology proven to intelligently predict, suggest and autofill data for XML documents • Learn and adapt itself to any XML domain without the need of custom algorithms • “Time window” allows the technology to adapt itself to the particular set of XML documents being filled at that time • Speed up data entry process for XML documents from 20% to 99%

More Related