Exploring Massive Learning via a Prediction System

Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net

Goal Convey a taste of the: • motivations/considerations/assumptions/speculations/hopes,… • The game, a 1st system, and its algorithms

Talk Overview • Motivational part • The approach: • The game (categories, …) • Algorithms • Some experiments

Fill in the Blank(s)! Would ---- you like ------ ------- ----- ------ ? your coffee with sugar

What is this object?

Categorization is Fundamental! “Well, categorization is one of the most basic functions of living creatures. We live in a categorized world – table, chair, male, female, democracy, monarchy – every object and event is unique, but we act towards them as members of classes.” From an interview with Eleanor Rosch (Psychologist, a pioneer on the phenomenon of “basic level” concepts) “Concepts are the glue that holds our mental world together.” From “The Big Book of Concepts”, Gregory Murphy

“Rather, the formation and use of categories is thestuff of experience.”Philosophy in the Flesh, Lakoff and Johnson.

Two Questions Arise classification system ? Repeated and rapid classification… • … in the presence of myriad classes • In the presence of myriad categories: • How to categorize efficiently? • How to efficiently learn to categorize efficiently?

Now, a 3rd Question .. • How can so many inter-related categories be acquired? • Programming them unlikely to be successful/scale: • Limits of our explicit/conscious knowledge • Unknown/unfamiliar domains • The required scale.. • Making the system operational..

Learn? … How? • “Supervised” learning (explicit human involvement) likely inadequate: • Required scale, or a good sign post: • ~millions of categories and beyond.. • Billions of weights, and beyond.. • Inaccessible “knowledge” (see last slide!) • Other approaches likely do not meet the needs (incomplete, different goals, etc): active learning, semi-supervised learning, clustering, density learning, RL, etc..

? Desiderata/Requirements(or Speculations) • Higher intelligence, such as advanced “advanced” pattern recognition/generation (e.g. vision), may require • Long term learning (weeks, months, years,…) • Cumulative learning (learn these first, then these, then these,…) • Massive Learning: Myriad inter-related categories/concepts • Systems learning • Autonomy (relatively little human involvement) What’s the learning task?

This Work: An Exploration • An avenue: “prediction games in infinitely rich worlds” • Exciting part: • World provides unbounded learning opportunity! (world is the validator, the system is the experimenter!.. and actively builds much of its own concepts) • World enjoys many regularities (e.g. “hierarchical”) • Based in part on “supervised” techniques!! (“discriminative”, “feedback driven”, supervisory signal doesn’t originate from humans )

higher level categories (bigger chunks) low level or “hard-wired” categories After a while (much learning) predict observe & update Prediction System In a Nutshell (e.g. words, digits, phrases, phone numbers, faces, visual objects, home pages, sites,…) (Text: characters, .. Vision: edges, curves,…) …. 0011101110000…. predict observe & update Prediction System

The Game • Repeat • Hide part(s) of the stream • Predict (use context) • Update • Move on • Objective: predict better ... subject to efficiency constraints • In the process: categories at different levels of size and abstraction should be learned

Research Goals • Conjecture: There is much value to be attained from this task • Beyond language modeling: more advanced pattern recognition/generation • If so, should yield a wealth of new problems (=> Fun)

Overview • Goal: Convey a taste of the motivations/considerations, the system and algorithms,.. • Motivation • The approach: • The game (categories, …) • Algorithms • Some experiments

Upshot • Takes streams of text • Make categories (strings) • Approx three hours on 800k documents • Large-scale discriminative learning (evidence better than than language modeling)

Caveat Emptor! • Exploratory research • Many open problems (many I’m not aware of … ) • Chosen algorithms, system org, or objective/performance measures, etc., etc… are likely not even near the best possible

Categories • Building blocks (atoms!) of intelligence? • Patterns that frequently occur • External • Internal.. • Useful for predicting other categories! • They can have structure/regularities, in particular: • Composition (~conjunctions) of other categories (Part-Of) • Grouping (~disjunctions)(Is-A relations)

Categories • Low level “primitive” examples: 0 and 1 or characters (“a”, “b”, .. ,“0”, “-”,..) • Provided to the system (easy to detect) • Higher/composite levels: • Sequence of bits/characters • Words • Phrases • More general: Phone number, contact info, resume, ...

Example Concept • Area code is a concept that involves both composition and grouping: • Composition of 3 digits • A digit is a grouping, i.e., the set {0,1,2,…,9} ( 2 is a digit ) • Other example concepts: phone number, address, resume page, face (in visual domain), etc.

Again, our goal, informally, is to build a system that acquires millions of useful concepts on its own.

Questions for a First System • Functionality? Architecture? Org? • Would many-class learning scale to millions of concepts? • Choice of concept building methods? • How would various learning processes interact?

Expedition: a First System • Plays the game in text • Begins at character level • No segmentation, just a stream • Makes and predicts larger sequences, via composition • No grouping yet

predictors next time step … New Jersey in … target Learning Episodes predictors (active categories) … New Jersey in … target (category to predict) window containing context and target In this example, context contains one category on each side

.. Some Time Later .. predictors … loves New York life … target (category to predict) window containing context and target • In terms of supervised learning/classification, in this learning activity (prediction games): • The set of concepts grows over time • Same for features/predictors (concepts ARE the predictors!) • Instance representation (segmentation of the data stream) changes/grows over time ..

Prediction/Recall features categories f1 c1 f2 c2 f3 c3 1. Features are “activated” 2. Edges are activated f4 c4 3. Receiving categories are activated c5 4. Categories sorted/ranked • Like use of inverted • indices • 2. Sparse dot products

features categories f1 c1 f2 c2 f3 c3 f4 c4 c5 Kronecker delta Updating a Feature’s Connections 1. Identify connection 2. Increase weight 3. Normalize/weaken weights 4. Drop tiny weights Degrees are constrained

Example Category Node (from Jane Austen’s) categories appearing before prediction weights 7.1 0.41 (keep local statistics) “and ” “nei” 0.087 0.13 “heart” “toge” 0.07 “ther ” 0.11 “ far” 0.057 “love ” 0.10 0.052 “ bro” “by ” A category nodes keeps track of various weights, such as edge (or prediction) weights, and predictiveness weights, and other statistics (e.g. frequency, first/last time seen), and updates them when it is activated as a predictor or target..

Network • Categories and their edges form a network • (a directed weighted graph, with • different kinds of edges ... ) • The network grows over time: millions of nodes • and beyond

When and How to Compose? • Two major approaches: • Pre-filter: don’t compose if certain conditions are not met (simplest: only consider possibilities that you see) • Post-filter: compose and use, but remove if certain conditions are not met (e.g. if not seen recently enough, remove) • I expect both are needed …

Some Composition (Prefilter) Heuristics • FRAC: If you see c1 then c2 in the stream, then, with some probability, add c=c1c2 • MU: use the pointwise mutual information between c1 and c2 • IMPROVE: take string lengths into account and see whether joining is better • BOUND: Generate all strings under length Lt.

Prediction Objective • Desirable: learn higher level categories (bigger/abstract categories are useful externally) • Question: how does this relate to improving predictions? • Higher level categories improve “context” and can save memory • Bigger, save time in playing the game (categories are atomic)

Objective (evaluation criterion) • The Matching Performance: Number of bits (characters) correctly predicted per unit time or per prediction action • Subject to constraints (space, time,..) • How about entropy/perplexity? Categories are structured, so perplexity seems difficult to use..

Versus new Linearity and Non-Linearity (a motivation for new concept creation) new???? n Aggregate the votes of “n”, “e”, and “w” to predict what comes next e w Which one predicts better? (better constrains what comes next)

Data • Reuters RCV1 800k news articles • Several online books of Jane Austen, etc. • Web search query logs

Some Observations • Ran on Reuters RCV1 (text body) ( simply zcat dir/file* ) • ~800k articles • >= 150 million learning/prediction episodes • Over 10 million categories built • 3-4 hours each pass (depends on parameters)

Observations • Performance on held out (one of the Reuters files): • 8-9 characters long to predict on average • Almost two characters correct on average, per prediction action • Can overfit/memorize! (long categories) • Current: stop category generation after first pass

Some Example Categories(in order of first time appearance and increasing length) cat name= "<" cat name= " t" cat name= ".</" cat name= "p>- " cat name= " the " cat name= "ation " cat name= "of the " cat name= "ing the " cat name= ""The " cat name= "company said " cat name= ", the company " cat name= "said on Tuesday" cat name= " said on Tuesday" cat name= "," said one " cat name= "," he said. cat name= "--------------------------------" cat name= "--------------------------------------------------------" cat name= "--------------------------------------------------------------- cat name= ". Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Tuesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Thursday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Wednesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "within 10 percentage points in either direction of the key 225-share Nikkei average over the next six month" cat name= "ing and selling rates for leading world currencies and gold against the dollar on the London foreign exchange and bullion "

Example “Recall” Paths From processing one month of Reuters: "Sinn Fei" (0.128) "n a seat" (0.527) " in the " (0.538) "talks." (0.468) " B" (0.0185) "rokers " **** The end: connection weight less than: 0.04 " Gas in S" (1) "cotland" (1.04) " and north" (1.18) "ern E" (0.572) "ngland" (0.165) "," a " (0.0542) "spokeswo" (0.551) "man said " (0.044) "the idea" (0.0869) " was to " (0.144) "quot" (0.164) "e the d" (0.0723) "ivision" (0.0671) " in N" (0.397) "ew York" (0.062) " where " (0.0557) "the main " (0.0474) "marque" (0.229) "s were " (0.253) "base" (0.264) "d. "" (0.0451) "It will " (0.117) "certain" (0.0691) "ly b" (0.0892) "e New " (0.353) "York" (0.112) " party" (0.0917) "s is goin" (0.559) "g to " (0.149) "end."" (0.239) " T" (0.104) "wedish " (0.125) "Export" (0.0211) " Credi" **** The end: connection weight less than: 0.04

Search Query Logs "bureoofi" (1) "migration" (1.13) "andci" (1.04) "tizenship." (0.31) "com www," (0.11) "ictions" (0.116) "zenship." **** The end: this concept wasn't seen in last 1000000 time points. Random Recall: "bureoofi" (1) "migration" (0.0129) "dept.com" **** The end: this concept wasn't seen in last 1000000 time points.

Much Related Work! • Online learning, cumulative learning, feature and concept induction, neural networks, clustering, Bayesian methods, language modeling, deep learning, “hierarchical” learning, importance/ubiquity of predictions/anticipations in the brain (“On Intelligence”, “natural computations”,…), models of neocortex (“circuits of the mind”), concepts and conceptual phenomena (e.g. “big book of concepts”), compression, ….

Summary • Large-scale learning and classification (data hungry, efficiency paramount) • A systems approach: Integration of multiple learning processes • The system makes it own classes • Driving objective: Improve prediction (currently: “matching” performance) • The underlying goal: effectively acquire complex concepts • See www.omadani.net

Current/Future • Much work: • Integrate learning of groupings • Recognize/use “structural” categories? (learn to “parse”/segment?) • Prediction objective.. ok? • Control over input stream, etc.. • Category generation.. What are good methods? • Other domains (vision,…) • Compare: language modeling, etc

Exploring Massive Learning via a Prediction System

Exploring Massive Learning via a Prediction System

Presentation Transcript

Exploring Literature: Active Learning

Model Evaluation and Selection via Prediction

Exploring Differential Equations via Graphics

Affordance Prediction via Learned Object Attributes

Massive MIMO System

Massive Graph Visualization System

A massive

A New Understanding of Prediction Markets via No-Regret Learning

Optimizing Hybrid Vehicles via Route Prediction

Exploring Correlation for Indirect Branch Prediction

Multi-Label Prediction via Compressed Sensing

Exploring Learning Styles

Learning English Via Movie MPAA Film Rating System

Exploring Mobile learning

Exploring Assessment for Learning

Exploring london Via Coach Tours

massive oprating information system

Multi-Label Prediction via Compressed Sensing

EXPLORING LEARNING THROUGH GAMES

Model Evaluation and Selection via Prediction