Classifying Spend Descriptions with off-the-shelf Learning Components

Saikat Mukherjee, Dmitriy Fradkin, Michael RothIntegrated Data Systems DepartmentSiemens Corporate ResearchPrinceton, NJ Classifying Spend Descriptions with off-the-shelf Learning Components

Spend Analytics • Organizations are engaged in direct and indirect procurement or spend activities • Indirect spend is diffuse and cuts across the spectrum of goods and services • Proliferation of suppliers for similar goods and services • No purchasing synergies between units • Failure to stratify procurement • Inability to reach bargaining deals with suppliers • Hence, important for organizations to integrate spend activities • Integrating spend transactions involves associating each transaction to a hierarchy of • commodity codes such as ESN, UNSPSC • Manually associating transactions to commodity codes is not scalable • Large number of transactions • Large number of codes in any commodity scheme • Focus: Automated techniques for spend classification to commodity codes

ESN Commodity Scheme • ESN is a 3 level hierarchy with increasing specialization down the hierarchy • Each code represents a particular class of product or service • In all, 2185 classes across 3 levels M - Electrical Products MB – Power Supplies “10 KV UPS Battery” MBL – UPS Systems Challenge: Automatically classifying transaction descriptions to ESN Code

Challenges in Spend Classification • Hierarchical text categorization – commodity codes are hierarchical systems (e.g. ESN) • Sparse text description – most transaction descriptions have less than 5 words • Erroneous features in description: • Spelling errors • Merge errors – different words are joined together into a single word • (e.g. “MicrosoftOffice” instead of “Microsoft Office”) • Split errors – single word is broken up into multiple words • (e.g. “lap top” instead of “laptop”) • Descriptions in multiple languages – makes it difficult to apply linguistic techniques • Large volume of transactions – • could easily be around 0.5 million transactions per month • makes classifier training computationally challenging • Periodic retraining of classifiers – • commodity coding structures undergo revisions • new samples, with different descriptions, are continually added

Classifier: Feature Representation • Term vector representation of samples • 2 representation schemes: • Binary: term has weight 1 if it occurs in sample else weight 0 • Weighted: variation of TF-IDF using the hierarchical structure of commodity schemes • weight(f, n) = weight of feature f at node n = (Nfn/Nf) x (1 + log(Nn/Nfn)) • Nfn = number of leaf nodes in sub-tree rooted at n which has at least one positive • sample with feature f • Nf = total number of leaf nodes in entire ESN tree which have at least one positive • sample with feature f • Nn = total number of leaf nodes in sub-trees rooted at n and its sibling nodes • Nfn/Nf-> relative importance of sub-tree at n compared to rest of ESN tree • 1 + log(Nn/Nfn)-> penalizes features which occur in more leaf nodes at n

Classifier: Methods • (A) Support Vector Machine classifiers (LIBSVM implementation) • - binary classifier at each node • - multiclass approach not feasible due to memory and time constraints • - positive samples = all samples in sub-tree • - negative samples = all samples in sibling sub-trees • - C value = 1 (default) • SVM: linear support vector machines with binary features • SVM-W: linear support vector machines with weighted features • SVM-WB: linear support vector machines with weighted features and weight balancing • (B) Logistic Regression classifiers • - binary classifier as well as multi-class classifier • - default settings of parameters • BBR-L1: bayesian binary regression with Gaussian prior (ridge regression) • BBR-L2: bayesian binary regression with Laplace prior (lasso regression) • BBR-L2w: bayesian binary regression with Laplace prior and weighted features • BMR-L1: multi-class version of BBR-L1 • BMR-L2: multi-class version of BBR-L2

Classifier: Data for Experiments • ESN hierarchy: • 2185 nodes over all 3 levels • 17 first level nodes, 192 second level nodes, 1976 leaves • Training set size = 85,953 samples from 418 leaf nodes • Test set size = 42,742 samples from 380 leaf nodes • Feature space size from training set = 69,429 terms • Evaluate Precision-Recall breakpoints at different levels of the tree

PRB Results (select first level nodes) BBR-L1 performs best among the different classification methods at the top level

PRB Results (select second level nodes) BBR-L1 and BBR-L2W are competitive among the different classification methods at the second level

Overall accuracy at different levels BMR-L1 turns out to be the best classifier in overall leaf level accuracy

Feature Correction • Correct typos, merge, and split errors • Noisy channel model: P(O, C) = P(O|C) x P(C) • Intended sequence of characters, C, is generated with P(C) and due to noisy channel • converted into observed sequence of characters O with probability P(O|C) • Source model (P(C)): • smoothed frequency counting of 5-gram character sequences • CMU-Cambridge Language Modeling toolkit to create 5-gram models • Channel model (P(O|C): • Semi-automatically created training set • Typos – wi and wj paired if both start with same character and normalized edit distance • less than 0.8 • Split error – bi-gram wi, wj and term wk such that wk = concatenation of wi, wj and occurs • 10 times more than the bi-gram • Merge error – split a word at all character positions and check if resulting bi-gram occurs • more frequently than the original word • These candidates training examples were manually verified • 717 unique training pairs • 159 unique test cases

Feature Correction: Results Testing – given a test sample, generate all possible 1 character correction variations, score each variation using P(C), the source model, and 5-grams in the sample which are then joined with the channel model P(O|C). 1C-T = 1 character corrections considering each word and bi-gram in test sample separately 1C-S = 1 character corrections considering the whole test sample 2C-T = 2 character corrections considering each word and bi-gram in test sample separately

System • Java-based system for • classifying Spend descriptions • to ESN codes • Users can browse training samples and • ESN descriptions • Backend Derby database which stores • training data • Ability to train hierarchical SVM and • BMR classifiers through the system • Test in batch mode by loading a • database of samples • Corrected results from users can be • fed back to the system as training data

Related Work • There are many formulations for combining binary classifiers into a multiclass classifier: • error-correcting codes [Dietterich & Bakiri ‘95], 1-vs-each and 1-vs-all • [Friedman ‘96; Weston & Watkins ’98; Crammer & Singer ‘00]. • There are several approaches to hierarchical classification: • build a single "flat” classifier that would map an instance directly to a leaf; or • build a hierarchical classifier by constructing a model at every node; or • use a method developed specifically for hierarchical classification. • The 1st approach ignores the hierarchical structure of the data and usually leads to • worse results than the 2nd [Dumais & Chen ‘00; Koller & Sahami ‘97]. • Some recent work [Cai & Hoffman ’04; Rouse et al ’05] focused on the 3rd approach. • Both of these involved new formulations of SVM that take into account hierarchical structure of the classes. While the results are encouraging, usable implementations are currently not available. • [Brill & Moore ’00; Kolak & Resnik ‘05] have previously explored the noisy channel framework in computational linguistics.

Discussion • We have described how off-the-shelf learning tools can be applied to automated • spend classification: • experimental results with SVM and BMR classifiers; • a noisy channel framework for feature correction. • Incremental algorithms are the only way to reliably handle frequent retraining and increasingly large datasets: • for SVM: [Cauwenberghs & Poggio ’00; Laskov ett al. ’06] • for BBR: [Balakrishnan & Madigan ‘08]. • However, the accuracy tends to be lower than for batch methods, and off-the-shelf implementations are not readily available. • Improvements in accuracy can be achieved by careful selection of classifier parameters, • but such tuning can only be performed in the classifiers are extremely fast and scalable. • Additional information such as supplier names and purchase volumes could be used when available, especially in standardized ways such as with Dun and Bradstreet codes which could then be mapped to product types and commodity codes. • Our feature correction techniques are currently language agnostic and could be improved if transactions can be geographically localized and linguistic cues of corresponding languages incorporated.

Classifying Spend Descriptions with off-the-shelf Learning Components