290 likes | 295 Views
Harnessing AI to Create Insight from Text. Amanda Beedham Data Scientist , RSA. Contents. Background to Word Embedding GloVe – grouping claims using text StarSpace – predicting category using text Implementation considerations Recent developments Q & A. All About Context.
E N D
Harnessing AI to Create Insight from Text Amanda Beedham Data Scientist , RSA
Contents • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments • Q & A
All About Context “You shall know a word by the company it keeps” (John Rupert Firth, 1957)
Why All the Investment in Word Embedding? • In order to use text data in predictive models must represent it numerically • What about one-hot encoding? • Problems: • High dimensionality means reduced efficiency • Words lose their meaning
Benefits of Word Embedding • Embedding layer maps each word to dense vector of numbers • Captures relationships between words • Finds different ways of saying the same thing • Understands words that are opposite in meaning • No need to build dictionaries
GloVe Word Embedding Business problem - given set of claims descriptions - Can we group claims? - For each group, can we understand type of claim? Steps • Data cleaning - stringR, textTinyR • Word Embedding - text2vec, textTinyR • Clustering - clusterR, wordcloud
Word AssociationsWhich words are most likely to occur near a target word?
Investigating Clusters of Claims Cluster 3 Cluster 1 Cluster 2
GloVe - The Story so Far • Successes: • Determined word similarity • Clusters made sense and represented different events • Results powerful but • Requires few hundred lines of code • Does not build supervised embeddings • text2vec: http://text2vec.org/index.html • gloVe: http://nlp.stanford.edu/projects/glove/
Starspace / Ruimtehol • StarSpace: developed by Facebook AI • R package ruimteholallows • Multi-label text classification • Word, sentence, document embeddings • Document and sentence similarity • Ranking web documents • Content-based/collaborative filtering-based recommendation
Can we Predict Category using Text? • Pet invoice data • 44k invoice lines • Any categories incorrectly assigned? • Predict unassigned categories?
Tag Embedding • Build Tag Embedding model to predict category • Input text - separated by spaces • Response - list of all categories - no spaces • Dataset split into train (35k), test (9k)
Model Build • Tag embedding runs in just a few lines of code
Creating Word Embeddings • Simple code • How do we interpret embeddings? • “inj” and “injection” - similar values - closely related
Finding Associated Words: “Anaesthesia” • Model found “anaesthetic”, “gen”, “anaesthesia”, “anaes”, “ga” relate to “anaesthesia” • No dictionaries provided
Finding Associated Words: “Imaging” • Model found “xray”, “radiography”, “radiograph”, “radiographic”, “exposures”, “radiology”, “plate” relate to “Imaging”
Predicting Category • “Metacam” predicted as “Drugs”:
How Well Does our Model Predict? • T-SNE plot on unseen data • Embeddings reduced to 2D • Overlay with category • If embeddingsgood, categories should form clusters
How Well Does our Model Predict? Actual • Unseen data, miss-classification rate = 4% Predicted
Finding Errors in Classification • Assigned = “Conditions”, model predicted = “Drugs” • Assigned = “Anaesthesia”, model predicted = “Imaging” • “metacam oral suspension ml give kg dose once daily with food stop if vomiting” • “prontosan wound gel ml” • “xray per extra plate without sedation”
Predicting Missing Category“Anaesthesia” – Typos or Abbreviations general anaesthic extended anaes isoflurane kg kg gen anaestheticp kg
Predicting Missing Category“Drugs” – Typos metacarninj dogs cats mgml ml per ml oomfortan lnj ml per ml q comforlanlnj ml per ml
Implementation Considerations • Word Embedding - powerful technique • The more data the better • Can be computationally intensive • Overcome with cloud servers and multi-core algorithms • Pre-trained embeddings available • Trained on large datasets • Embedding matrices very large - may need big machine to apply
Recent DevelopmentsOpen AI’s GPT-2 “The AI That's Too Dangerous to Release” Input text: “In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains…” Model completion: “The scientist named the population, after their distinctive horn, Ovid’s Unicorn…”
Summary • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments