From Research of Social Media to Socially Mediated Research

From Research of Social Media to Socially Mediated Research 2010 HCIL Symposium Workshop - UMD Government Applications of Social Media Networks and Communities May 28, 2010 Natasa Milic-Frayling Microsoft Research Cambridge

Outline • Microsoft Research. Integrated Systems team, research areas and approach • ‘Social’ as a research topic: Modelling Human to Human Interaction in Technology Mediated Communities • ‘Social’ as facilitator of research Leveraging Communities of Practice.

Microsoft Research (MSR) MSR Cambridge • MSR Sites • Redmond, Washington (September 1991) • San Francisco, California (June 1995) • Cambridge, United Kingdom (July 1997) • Beijing, China (November 1998) • Silicon Valley, California (July 2001) • Bangalore, India (January 2005) • Cambridge, Massachusetts (July 2008) MSR New England Silicon Valley MSR Asia Redmond MSR India

Research Areas WEB AND ON-LINE COMMUNITIES CONTENT ANALYSIS AND RICH UI MOBILE AND CROSS PLATFORM MEDIA Information retrieval & NLP Machine Learning and Statistics HCI and Design Mathematical Modelling Graph Theory and Analysis Academic Disciplines Academic Disciplines

Research Areas WEB AND ON-LINE COMMUNITIES CONTENT ANALYSIS AND RICH UI MOBILE AND CROSS PLATFORM MEDIA Information retrieval & NLP Machine Learning and Statistics HCI and Design Mathematical Modelling Graph Theory and Analysis Team Gabriella Janez Annika Rachel Gerard Natasa Eduarda Gavin Jamie

Research Areas WEB AND ON-LINE COMMUNITIES CONTENT ANALYSIS AND RICH UI MOBILE AND CROSS PLATFORM MEDIA Information retrieval & NLP Machine Learning and Statistics HCI and Design Mathematical Modelling Graph Theory and Analysis Academic Disciplines Team Gabriella Janez Annika Rachel Gerard Natasa Eduarda Gavin Jamie Vinay Derek Hansen Elizabeth Bosnignore Dana Rotman Ben Shneiderman Marc Smith Cody Dunn Aleks Ignjatovic Tom Lee

Research Areas WEB AND ON-LINE COMMUNITIES CONTENT ANALYSIS AND RICH UI MOBILE AND CROSS PLATFORM MEDIA Projects InSite Live Web site structure analysis and decomposition into subsites Research Desktop Research in information management and tagging practices in the Desktop environment weConnect Investigating narrow-cast of personalized content in close relationships and potential for mobile advertising. Social Footprints Analysis of social interaction in online communities Social IR Extension of IR models with social network and models of approval, trust and reputation. VideoSnaps Investigating concepts and services for cross platform media editing and streaming. NodeXL Interactive graph analysis and visualization.

Research Areas WEB AND ON-LINE COMMUNITIES CONTENT ANALYSIS AND RICH UI MOBILE AND CROSS PLATFORM MEDIA Projects InSite Live Web site structure analysis and decomposition into subsites Research Desktop Research in information management and tagging practices in the Desktop environment weConnect Investigating narrow-cast of personalized content in close relationships and potential for mobile advertising. Social Footprints Analysis of social interaction in online communities Social IR Extension of IR models with social network and models of approval, trust and reputation. VideoSnaps Investigating concepts and services for cross platform media editing and streaming. NodeXL Interactive graph analysis and visualization. Research Platforms Principles, mechanisms, and tools for knowledge management. Trust and reputation. Shared summaries and overviews. Methodology – how to develop mobile and social applications. Integration with the ecosystem – pre-requisites for adoption Connect the quantitative analyses with the qualitative analyses.

social as a research topic interactionS in technology mediated communties

Community Question-Answering 2008 2006 2006 2005 2003 2002-06 2002

Community Question-Answering Question Answers

Content Organization, Browsing and Search Tags Topic categories

100 Most Frequent Tags on Live QnA

100 Most Frequent Tags on Live QnA Politics

100 Most Frequent Tags on Live QnA Fun, Life, People, Philosophy

Community Analysis and Health Index 85% of new users start with a question 72% never ask a question again 5% will engage in answering 61% of questions from new users don’t get more than 1 answer (23% get 0 answers) Towards a sustainable community • Support novice users in becoming active community participants • Support frequent users in increasing the volume and quality of their content contributions • Promote high quality contributions (for external exploitation – through search).

Example: Investigate QnA Voting Practice comment to Approach: • Statistical analysis of the user logs • Manual inspection of the content • Taxonomy of the users’ intent; to be evolved by the community of practice • Define the basic features of the individuals and governing assumptions • Derive a mathematical model of the voters metric. • Observe the properties with regards to the irregular voting behaviour: random voting or collusion. C V vote on comment to A A answer to answer to Q • Socialnetwork activities: • Answer to a question • Comment on an answer • Vote on the best answer

Which Answer to Vote On? • Different ‘best answer’ connotations The notion of the ‘best answer’ thus depends on the context and nature of the answers - from correctness and usefulness to entertainment value • Social bias Assignment of votes may be influenced by social and personal ties, voter’s perception, familiarity, and preferential treatment of familiar community members “Microsoft or Apple? Feel free to argue and point out their good and bad points. Also feel free to rebut or debate on other people's standpoint. Best argument/answer will get my friends’ and my "best answer" reward.” • Self-promotion Individuals’ aspirations to excel in their social status can adversely affect the quality of their contribution to the community.

Reliability as Conformity? • Reliability of a voter Relative reliability of two voters is determined by the proportion of all the voters who made the same choice of the best answer: The reliability scores represent a fixed-point for the function F – apply Brouwer Fixed Point Theorem.

Real Data Analysis ‘FUN’ ‘PHILOSOPHY’ Vote Count FP Method

Random Voting Simulate Random Voting by uniform distribution in place of Zipf’s Law We vary the percentage of affected questions (from 1% to 10%) and the percentage of voters who voted randomly (from 1% to 10%). The number of best answer changed is lower for fixed point score (right) than for plurality voting (left)

Ballot Stuffing • Simulate the collusion: fix the number of involved voters (‘stuffers’, here 4 and 10) and the percentage of questions affected (here 50%) • Both majority voting and fixed point scoring are susceptible to ballot stuffing • Fixed point scoring flags out the outliers and helps identifying collusion

Detecting Sybil Attack - Leveraging Social Networks • Social networks are Fast Mixing • Random walks quickly converge to stationary distribution • Sybil attacks induce a bottleneck cut • Fast mixing is disrupted • Knowledge of an apriori honest node • Breaks Symmetry Attack edges Honest Nodes Sybil Nodes

social as facilitator of research Leveraging communities of practice

Issue: the Scale and the Limitations of Humans • We require user input in order to inform the systems’ design and verify our hypotheses • In search we build test collections: • A set of topics, a corpus of documents, and relevance judgements for documents in the corpus • Question: how do we build test collections for books • Search over Web pages involves low cost of inspection of individual Web pages • Search over Book collections increases the cost due to the size and the coherence of topics across pages.

Web scenario

Book scenario …

Read’n Play SOCIAL GAME SUPPORT • Architecture comprises four functional layers • Implemented using Web services - no client based interaction with the content • Can be repurposed for other research projects USER ANNOTATIONS SEARCH AND NAVIGATION SUPPORT DATA STORE AND SEARCHABLE INDEX Text and Metadata Index OCR Text Database Image Database - Scanned Document Page

Social game • Reviewers • Conflicts • Explorers Reward for re-assessment (agreement is not necessary) Reward for finding mistakes in explorers’ work Reward for finding relevant content Penalty

Explore

Pilot Study Incentives for participation • Tangible, e.g., monetary, • Winners: Microsoft Hardware and software • All: Access to collected data • Intangible reward, e.g., fun, social gain • Leader board: Social status Participants • Open to everyone • 48 registered + 81 INEX participants • 17 contributed assessments (16 INEX participants) Collected data • Relevance assessments • 3,478 judged books with • 23,098 judged pages from • 29 topics • Log data • 32,112 navigational events • 45,126 judgement events • 2,970 ‘search inside a book’ events

Feasibility Averages across the 17 assessors • 7.2 days with activity, out of 42 • 11.4 hours judging time • 220 judged books Average effort • 7.3 minutes per relevant book, 2.7 minutes per irrelevant book (comparable to INEX 2003 ad hoc track) • 37 seconds per relevant page, 22 seconds per irrelevant page Extrapolated statistics • 1000 books takes 52.7 hours, 1 : 9 ratio of relevant : irrelevant • 33.3 days to judge one topic, with 95 minutes a day • 70 topics, 200 books per topic with 20 judges takes 36.9 days • 737 judges to complete task in one hour

Productivity Games

Summary • Understanding social media requires cross-disciplinary approach and new methods to study them • Defining the characteristics and metrics of ‘healthy communities’ is a challenging task. • ‘Social’ is increasing its role as an enabler for large scale experiments Generally, we need to be reflective of our methods and approaches we take when studying online communities.

Thank you Microsoft Research Cambridge https://research.microsoft.com/is

From Research of Social Media to Socially Mediated Research