1 / 19

Extracting User Profiles from Large Scale Data

Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass. Extracting User Profiles from Large Scale Data. David Konopnicki. IBM Haifa Research Lab. Motivating Example. User Browsing.

berit
Download Presentation

Extracting User Profiles from Large Scale Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass Extracting User Profiles from Large Scale Data David Konopnicki IBM Haifa Research Lab

  2. Motivating Example User Browsing Keywords Modeling:for each user, report the most meaningful keywords to describe her profile. san-francisco peer michael jackson alive analysis Large scale content analysis for mass amount of users. Update users profiles Dashboard Profiles database AdvertisementSystem Track statistics about readers interests

  3. Contributions • User Profiling Framework: • User profile model • KL approach to weight user profile • Large scale implementation: • MapReduce flow • Experiments: • Quality analysis • Scalability analysis

  4. User Profiling Framework- Setting <userID, docID> <u1,d1> <u1,d2> <u2,d2> <docID, content> <d1,{bla,bla,bla}> <d2,{foo,foo}> logging targeting

  5. User Profiling - Definitions • Bag of words model (BOW) • Profile maintenance • User snapshot • Community snapshot

  6. User Profiling - Intuition • Find terms that are highly frequent in the user snapshot and separatethe most between the user and the community snapshots { Travel, Tennis ,Sport }

  7. User Profiling – Naïve approach • Term frequency: number of times a term t appears in document d- tf(t,d) • Document frequency: the number of documents containing the term t – df(t,D) frequent separate average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot

  8. Kullback-Leibler (KL) Divergence • Measures the difference between two probability distributionsP1andP2 : • KL measures the distance between the Community distribution and the User distribution • Each term is scored according to its contribution to the KL distance between the community and the user distributions. • The top scored terms are then selected as the user important terms. Community User

  9. User Profiling – KL method • Community marginal term distribution: • User marginal term distribution probability normalization factor average tf over the community snapshot Probability to find a term t in community snapshot l=0.001 Relative initial weight of term t Smoothing with the community snapshot

  10. HDFS HDFS HDFS HDFS ¯ TF MapReduce Flow HDFS Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: (u,d) output (u,1) Reducer: output (u,|Dj(u)|) // Sum TF |Dj(u)| Mapper: input: (t,tf(t,d),|Dj|) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, Dj)) //Avg Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|}) output ({u,t,|Dj(u)},{1}) Reducer: output ({u,t},{udf(t,Dj(u))}) UDF DF Mapper: input: ({t,d},tf(t,Dj)) output (t,1}) Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj}) Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)}) output (t,Nj}) Reducer: identity Nj P(t|Dj) Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|Dj)}) Reducer: identity

  11. HDFS HDFS HDFS HDFS HDFS MapReduce Flow- cont. w ∑w P(t|Dj(u))

  12. Experimental Data- quality analysis • Open Directory Project (ODP): • Categories are associated with manual labels • Considered as “ground-truth” in this work • Examples: • ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism” • Data Collection: • 100 different categories randomly selected from ODP • 100 documents randomly selected per category • A total collection size of about 10,000 Web pages • Evaluation: • A match is considered if the suggested label is identical, an inflection, or a Wordnet’s synonym to the manual label

  13. Results In how many cases, we got at least one correct term from the top-K terms. • KL outperforms all other approaches for features selection

  14. Experimental Data- scalability analysis • Blogger.com • Data Collection: • We crawled 973,518 blog posts from March 2007 until January 2009 • Total collection size of 5.45GB, with ~120,000 users • Cluster setting: • 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) • Hadoop 0.20.1 http://grannyalong.blogspot.com/ Blog entry

  15. Number of User Profiles Document ratio User profile ratio Time ratio • Runtime ratio is correlated with the number of user profiles ratio

  16. Data Size #user: chose 18,000 users between March-Apr 2007 • Runtime linearly increases with the increasing of data size

  17. Related Work • Content-based user profiling: • Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. • Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing • KL approach to user tasks: • Filter new documents that are not related to the user based on his profile. • Annotate a url with the most descriptive query term for a given user, based on his profile. • User targeting in large-scale systems: • Behavioral targeting system over Hadoop MapReduce. • Large scale CF technique for movies recommendations for users. • Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.

  18. Conclusions & Future Work • We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce • We showed quality and scalability results • We plan to extend the user model into semantic model • Extend the user profile to include structured data

  19. Thank You !

More Related