Ranked Recall: Efficient Classification by Learning Indices That Rank

Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Many Category Learning (e.g. Y! Directory) Arts&Humanities Business&Economy Recreation&Sports Sports Photography History Contests Amateur Magazines Education college Over 100,000 categories in the Yahoo! directory basketball Given a page, quickly categorize… Larger for vision, text prediction,... (millions and beyond)

Supervised Learning • Often two phases: • Training • Execution/Testing features Class x1 1 0 3 1 x2 0 0 1 5 A Learnt classifier f (categorizer) x3 1 1 0 2 0 0 0 2 (unseen) instance class prediction(s) f 0 0 1 ? Often learn binary classifiers

Massive Learning • Lots of ... • Instances (millions, unbounded..) • Dimensions (1000s and beyond) • Categories (1000s and beyond) • Two questions: • How to quickly categorize? • How to efficiently learn to categorize efficiently?

Efficiency • Two phases (combined when online): • Learning • Classification time/deployment • Resource requirements: • Memory • Time • Sample efficiency

Idea • Cues in input may quickly narrow down possibilities => “index” categories • Like search engine, but learn a good index • Goal: learn to strike a good balance between accuracy and efficiency

Summary Findings • Very fast: • Train time: minutes versus hours/days (compared against one-versus-rest and top-down) • Classification time: O(|x|)? • Memory efficient • Simple to use (runs on laptop..) • Competitive accuracy!

Problem Formulation

Output: an index = sparse weighted directed bipartite graph (sparse matrix) categories features learn Input-Output Summary Input: tripartite graph categories features instances

Scheme • Learn a weightedbipartite graph • Rank categories retrieved • For category assignment, could use rank, or define thresholds, or map scores to probabilities, etc.

Three Parts to the Online of Solution • How to use the index? • How to update (learn) it? • When to update it?

Retrieval (Ranked Recall) features categories f1 c1 f2 c2 f3 c3 1. Features are “activated” 2. Edges are activated f4 c4 3. Receiving categories are activated c5 4. Categories sorted/ranked • Like use of inverted • indices • 2. Sparse dot products

Computing the Index • Efficiency: Impose a constraint on every feature’s maximum out-degree • Accuracy: Connect and compute weights so that some measure of accuracy is maximized..

Measure of Accuracy: Recall • Measure average performance per instance • Recall: The proportion of instances for which the right category ended up in top k • Recall at k = 1 (R1), 5 (R5), 10, … • R1=“Accuracy” when “multiclass”

Computational Complexity • NP-Hard! • The problem: given a finite set of instances (Boolean features), exactly one category per instance, is there an index with max out-degree 1, such that R1 on training set is greater than a threshold t ? • Reduction from set cover • Approximation? (not known)

How About Practice? • Devised two main learning algorithms: • IND treats features independently. • Feature Normalize (FN) doesn’t make an independence assumption; it’s online. • Only non-negative weights are learned.

Feature Normalize (FN) Algorithm • Begin with an empty index • Repeat • Input instance (features + categories), and retrieve and rank candidate categories • If margin is not met, updateindex

Three Parts (Online Setting) • How to use the index? • How to update it? • When to update it?

Index Updating • For each active feature: • Strengthen weights between active feature and true category • Weaken the other connections to the feature • Strengthening = Increase weight by addition or multiplication

features categories f1 c1 f2 c2 f3 c3 f4 c4 c5 Updating 1. Identify connection 2. Increase weight 3. Normalize/weaken other weights 4. Drop small weights

Three Parts • How to use an index? • How to update it? • When to update it?

A Tradeoff • To achieve stability (helps accuracy), we need to keep updating (think single feature scenario) • To “fit” more instances, we need to stop updates on instances that we get “right” Use of margin threshold strikes a balance.

Margin Definition • Margin = score of the true positive category MINUS score of highest ranked negative category • Choice of margin threshold: • Fixed, e.g. 0,0.1, 0.5, … • Online average (eg: average of the last 10000 margins + 0.1)

Salient Aspects of FN • “Differentially” updates, attempts to improve retrieved ranking (in “context”) • Normalizes, but from “feature’s side” • No explicit weight demotion/punishment! (normalization/weakening achieves demotion/reordering ..) • Memory/Efficiency conscious design from the outset • Very dynamic/adaptive: • edges added and dropped • Weights adjusted, categories reordered • Extensions/variations exit (e.g. each feature’s out-degree may dynamically adjust)

# of Instances # of features Avg vector length Avg labels per x |C| Domain statistics 9.4k 33k 10 80.9 1 Reuters 21578 20 News grp 20k 60k 20 80 1 industry 9.6k 69k 104 120 1 23k 23k 47k 414 76 2.08 Reuters RCV1 Ads 369k 301k 12.6k 27 1.4 70k 685k 14k 210 1 Web Jane Austin 749k 299k 17.4k 15.1 1 • Experiments are average of 10 runs, each run is a single pass, • with 90% for training, 10% held out • |C| is the number of classes, L is avg vector length, Cavg is average • Number of categories per instance

Smaller Domains 10 categories, 10k instances Keerthi and DeCoste, 06 (fast linear SVM) • Max out-degree = 25, min allowed weight = 0.01, • tested with margins 0, 0.1, and 0.5 and up to 10 passes • 90-10 random splits

Three Smaller Domains 20 categories, 20k instances

Three Smaller Domains 104 categories, 10k instances

3 Large Data Sets (top-down comparisons) ~500 categories, 20k instances ~12.6k categories, ~370k instances ~14k categories, ~70k instances

Accuracy vs. Max Out-Degree RCV1 Ads accuracy Web page categorization max out-degree allowed

Accuracy Accuracy vs. Passes and Margin # passes

Related Work and Discussion • Multiclass learning/categorization algorithms (top-down, nearest neighbors, perceptron, Naïve Bayes, MaxEnt, SVMs, online methods, ..), • Speed up methods (trees, indices, …) • Feature selection/reduction • Evaluation criteria • Fast categorization in the natural world • Prediction games! (see poster)

Summary • A scalable supervised learning method for huge class sets (and instances,..) • Idea: learn an index (a sparse weighted bipartite graph, mapping features to categories) • Online time/memory efficient algorithms • Current/future: more algorithms, theory, other domains/applications, ..

Ranked Recall: Efficient Classification by Learning Indices That Rank