210 likes | 412 Views
Learning About Medicine by Applying Machine Learning to User Generated Content: The Case of Anorexia. Elad Yom-Tov Microsoft Research Israel. Why medicine?. People use the Internet extensively: More than 77% of USA population use the Internet
E N D
Learning About Medicine by Applying Machine Learning to User Generated Content: The Case of Anorexia Elad Yom-Tov Microsoft Research Israel
Why medicine? • People use the Internet extensively: • More than 77% of USA population use the Internet • Every day, 55% of Americans use the Internet. They do so for an average of two hours. • More than 80% of Internet users search for medical information online, and significant medically-related activities happen on the Internet • Large-scale medical trials are expensive and time consuming. • Making sense of Internet data requires processing large amounts of data to produce meaningful insights *Pew survey, 2010
A lifestyle choice? “Thin is perfection, I'll die trying to achieve it” “Anorexia is a lifestyle, not a diet” “I only feel beautiful when I'm hungry”
Data: Users • All users who posted at least two photographs with a relevant tag (“thinspo”, “thinspiration”, “pro-ana”) • 162 users • All users who posted to eating disorder groups on Flickr • 71 users • Users who commented or favorited to at least two of the above-mentioned photos • 683 users
Data: Photos and links • Raw data: • 543,891 photographs • 2,229,489 comments • 642,317 favorite markings • 237,165 contact links • Labeling: • Users were labeled on a 5-point scale. • Kappa = 0.51 (p<10-5)
Contacts Comments Tags Favorites
Tag similarity • Modeled users with a TF-IDF weighted bag-of-tags • Average Cosine similarity: • Pro-anorexia: 0.259 • Pro-recovery: 0.202 • Pro-recovery to pro-anorexia: 0.225 • ROC: 0.52 • Tag usage: • “thinspiration”: 37% pro-anorexia, 7% pro-recovery • “pro-anorexia”: 1.7% pro-anorexia, 2.4% pro-recovery
Is exposing pro-anorexia users to pro-recovery comments an effective intervention?
Data Toolbar data over a period of 5 months, in which we identified two types of behavior: A total of 5,800,270 users searched for least one celebrity in the top 2.5% of PAS, of which 3,615 also made AASs. Anorexia queries We define anorexic activity searching (AAS) as one of the following: Tips for proana or anorexia “how to … ” and proana or anorexia. Proana buddy Celebrity queries • One of 3640 known celebrities • Each scored for the probability of them appearing in conjunction with the word “anorexia” • We refer to this probability as the Perceived Anorexia Score (PAS).
Clustering • Start with a matrix of users by celebrities • 9,188,983 users by 3,640 celebrities • Cluster using k-means with cosine similarity • Clusters are statistically significant by PAS, but not by occupation.
Adding the media effect • The Spearman correlation between the number of queries for a celebrity and the number of tweets was 0.63, so the bigger the peak (the “media buzz”), the more searches will occur. • When focusing on queries and tweets which mentioned anorexia, this correlation is 0.68. • AAS searchers were 1.9 times more likely to query for a high PAS celebrity in the days following a media peak compared to all other people, and 2.4 times more likely when the peak was associated with anorexia.
Summary • As people spend ever more time on the Internet, they generate content which we can use to understand (and later hopefully improve) health and healthcare • This content is especially useful when: • People have less of an incentive to lie, compared to the real world • Collecting data in the real world is hard • Activity is largely web-driven • BUT: Making sense of so much data requires integrating Machine Learning research with medical practice.