Harnessing AI to Create Insight from Text

Harnessing AI to Create Insight from Text Amanda Beedham Data Scientist , RSA

Contents • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments • Q & A

All About Context “You shall know a word by the company it keeps” (John Rupert Firth, 1957)

Word Embedding -Rapid Pace of Change

Why All the Investment in Word Embedding? • In order to use text data in predictive models must represent it numerically • What about one-hot encoding? • Problems: • High dimensionality means reduced efficiency • Words lose their meaning

Benefits of Word Embedding • Embedding layer maps each word to dense vector of numbers • Captures relationships between words • Finds different ways of saying the same thing • Understands words that are opposite in meaning • No need to build dictionaries

GloVe Word Embedding Business problem - given set of claims descriptions - Can we group claims? - For each group, can we understand type of claim? Steps • Data cleaning - stringR, textTinyR • Word Embedding - text2vec, textTinyR • Clustering - clusterR, wordcloud

Word AssociationsWhich words are most likely to occur near a target word?

Investigating Clusters of Claims Cluster 3 Cluster 1 Cluster 2

GloVe - The Story so Far • Successes: • Determined word similarity • Clusters made sense and represented different events • Results powerful but • Requires few hundred lines of code • Does not build supervised embeddings • text2vec: http://text2vec.org/index.html • gloVe: http://nlp.stanford.edu/projects/glove/

Starspace / Ruimtehol • StarSpace: developed by Facebook AI • R package ruimteholallows • Multi-label text classification • Word, sentence, document embeddings • Document and sentence similarity • Ranking web documents • Content-based/collaborative filtering-based recommendation

Pet Invoice Data: Predicting Category Using Tag Embedding

Can we Predict Category using Text? • Pet invoice data • 44k invoice lines • Any categories incorrectly assigned? • Predict unassigned categories?

Tag Embedding • Build Tag Embedding model to predict category • Input text - separated by spaces • Response - list of all categories - no spaces • Dataset split into train (35k), test (9k)

Model Build • Tag embedding runs in just a few lines of code

Creating Word Embeddings • Simple code • How do we interpret embeddings? • “inj” and “injection” - similar values - closely related

Finding Associated Words: “Anaesthesia” • Model found “anaesthetic”, “gen”, “anaesthesia”, “anaes”, “ga” relate to “anaesthesia” • No dictionaries provided

Finding Associated Words: “Imaging” • Model found “xray”, “radiography”, “radiograph”, “radiographic”, “exposures”, “radiology”, “plate” relate to “Imaging”

Predicting Category • “Metacam” predicted as “Drugs”:

How Well Does our Model Predict? • T-SNE plot on unseen data • Embeddings reduced to 2D • Overlay with category • If embeddingsgood, categories should form clusters

How Well Does our Model Predict? Actual • Unseen data, miss-classification rate = 4% Predicted

Finding Errors in Classification • Assigned = “Conditions”, model predicted = “Drugs” • Assigned = “Anaesthesia”, model predicted = “Imaging” • “metacam oral suspension ml give kg dose once daily with food stop if vomiting” • “prontosan wound gel ml” • “xray per extra plate without sedation”

Predicting Missing Category“Anaesthesia” – Typos or Abbreviations general anaesthic extended anaes isoflurane kg kg gen anaestheticp kg

Predicting Missing Category“Drugs” – Typos metacarninj dogs cats mgml ml per ml oomfortan lnj ml per ml q comforlanlnj ml per ml

Implementation Considerations • Word Embedding - powerful technique • The more data the better • Can be computationally intensive • Overcome with cloud servers and multi-core algorithms • Pre-trained embeddings available • Trained on large datasets • Embedding matrices very large - may need big machine to apply

Recent DevelopmentsOpen AI’s GPT-2 “The AI That's Too Dangerous to Release” Input text: “In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains…” Model completion: “The scientist named the population, after their distinctive horn, Ovid’s Unicorn…”

Recent Developments

Summary • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments

Any Questions?

Harnessing AI to Create Insight from Text