1 / 61

Studying users behavior in chat rooms

Studying users behavior in chat rooms. DANSS January 25, 2004 Michael Rochkind. Agenda. Motivation Project goals What was done Results Conclusions. Motivation.

dakota
Download Presentation

Studying users behavior in chat rooms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Studying users behavior in chat rooms DANSS January 25, 2004 Michael Rochkind

  2. Agenda • Motivation • Project goals • What was done • Results • Conclusions

  3. Motivation • Need for simulations of interactive end-users to evaluate algorithms and system designs (e.g algorithms for estimation of multicast group size) • Difficulty to get real data (both technical and administrative) • Most researchers use trace collected for audio multicast of IETF conference talks in 1996

  4. Problems with the trace • Complete research field is based on a single trace • The trace is quite old (from 1996) • Collected from one specific type of service (audio conference). The exact nature of users is unknown. The behavior is not necessary the same as in other applications. • Impossible to validate the data or collect new one • Relatively little activity of members • Percentage of spurious joins/leaves is very high

  5. Statistical analysis of the trace • Different researchers got different statistical models for various parameters. • Ammar and Almeroth (the original trace creators) obtained exponential model for most parameters and Zipf distribution for long session stay time. • Aluf, Altman, Nain recently obtained from the same long trace lognormal distribution for both inter-arrival times and stay times. For short multicast session they obtained Weibull distributions for both inter-arrival and stay times. • Assumed uniform distribution of users (spatial)

  6. Project goals • To find a publicly available system which reasonably approximates multicast users behavior. • To develop tools for data retrieval so that it can be run by anyone, anytime. • To analyze the collected data

  7. Parameters of interest • Inter-arrival time • Session duration (on-time) • Number of logged in users (group size) • Users’ activity (messages, bytes) • Geographical distribution of users • Lifespan of multicast event (for short events) • Comparison with the “famous trace”

  8. First try - message boards (Yahoo) • Difficult to define term of user session. Many users send just one message. • Only active users can be seen (writers) • A lot of information is missing (about 50%) • Activity peaks when outstanding events happen

  9. Chat rooms • The model is similar to multicast group • Users explicitly join the room and leave it • Join/leave time and stay time are well-defined. • Every message sent to the room is received by all room members

  10. IRC- Internet Relay Chat protocol • Run over TCP/IP • Text-based teleconferencing • Client-server model • Can run in distributed fashion • Five big networks with many tens of thousands users and thousands of channels (rooms)

  11. IRC Servers • Form a backbone of IRC network • Connected together without circles (in the form of a spanning tree) • Handle clients connections • Each server knows about all other servers and all clients. C2 C1 S5 S1 S2 S3 S4 C3 C4 S6

  12. IRC clients • IRC client is anything connected to IRC server which is not another IRC server. • Any TCP enabled device can be IRC client • Distinguished by unique nickname • Each IRC server has the following info about each IRC client: • Nickname • Real name of the host where the client is running • Username of the client on that host • IRC server to which the client is connected

  13. IRC Channels • Parallel to the term “Chat room” • Named group of one or more users which will all receive messages addressed to that channel. • Created when first user joins the channel • Ceases to exits when last users leaves it • In case of network split the channel on each side has only those clients connected to the servers in the corresponding side. After network reconnection the channel is joined again.

  14. IRC network example C1 S5 S1 S2 C2 S4 S3 C3 C4 S6

  15. IRC message sending C1 S5 S1 S2 C2 S4 S3 C3 C4 S6

  16. IRC – new member joins to a channel • Channel X with members C1, C2, C3 • Client C4 joins the channel X join c4 C1 join c4 S5 S1 join c4 S2 join c4 C2 join c4 join c4 S4 join c4 S3 1. Join X join c4 C3 2. names c1, c2, c3 S6 C4

  17. IRC Channel Monitoring • Monitoring client written in Perl running under cron • We choose randomly 3 channels from the group of all channels with more than 100 users – #israel, #canada, #bosnia • Channel activity data was collected for a period of about 6 weeks.

  18. Log file format • <time> START • <time> EXIT • <time> JOIN <nickname> <country> • <time> PART|QUIT|KICK <nickname> • <time> PUBLIC <nick> <size> <country> • <time> NICK <old nick> <new nick> • <time> NAMES <list of nicks>

  19. 1053586971 START 1053587032 JOIN wponiw IL 1053587032 NAMES wponiw Teo_ i-NA mr_shark ^_kNibAL_ kaye_22 Old-Man^ CHA_555 klent Leila19f [Dan] kalanko1 Manifa21f jennider1 eu_sunt mangko18 hot^guy holly20f sad_beaut swimgirl ghazde ^^swt_guy pseudonym bing_23 topgirl23 sexYica creatza sergio9 ZaRa glance cookie^^ aileen` Ugly-GirL AFNAN EclipseM laurra-f garden cai applej SHUNSY fatcock kikelph mhaelee16 aGaTa Ercko lonebabe shellaine juulia priti2 HuntI2ess 1053587032 NAMES gienah Amanda^^ Jamali lishat18 cute_ashf jhen Horbit Sana18 AloneMan3 Errikka ext-ex Maysmile ynet02 poem_37M ann3 jelle love_less dreeve18 indai` adze LiWeiYi TokyoBoy blossom dummee man__ marichu earp danone jackdaw ^faraz^ ANGELA25 boby27 leah_ jossie shyrgil jade-17 kian arnulpo ally16 FiNG Carmina42 bangd sohail Janine33 anne--- joyce22 LUIE_M Travioli corn HOMBREJ2 sexybabes spyk2000 ^barbi3^ 1053587032 NAMES tumbleWED Gaby3 chynna^^ babyTH lenjie jherome Certified dj_france jane36 micay shah goerge24 bluediamo master_po Jypsy bassma Bobson^^ Fil24f dimple2 _THERE_ AloneGirL Naked_f shark_nyk morena23 Danniel_m Arwen_ ofw_park jimbern m40usa restie @PacZzZzZz blackstud davis He11razor +MultiMind mater Fearless Adnan_pk Er`mya Helena BrainDead CStrixAW` wooden birkof Cute_Girl Lisa_-- Megaframe barbara- 1053587032 NAMES Simple Loren23 Diana27 Cozzo NateDogg legendh Angel19 Mariah19 fedfed SUNSEEKER PRONET7 bestofmi D0gGi3` +Don_Juan MrNylons teapot SkiPerZ +Br0Th4 Linu|tech ShowerMia JenJen Mariahhh optimist @X 1053587032 JOIN D-A-D-I IN 1053587045 JOIN sydneyguy AU 1053587047 PUBLIC Certified 17 US 1053587053 PUBLIC Certified 13 US 1053587059 JOIN Mckay28 MT 1053587063 NICK CHA_555 ^zHTe 1053587068 PUBLIC Certified 31 US 1053587076 PART ^zHTe 1053587080 JOIN villain PH 1053587082 JOIN cryn PH 1053587095 JOIN static}x{ US 1053587098 PUBLIC Certified 31 US

  20. Inter-arrival time

  21. Inter-Arrival distribution – #bosnia occurrences Time (in sec) occurrences Time (in sec)

  22. Inter-Arrival distribution – #israel occurrences Time (in sec) occurrences Time (in sec)

  23. Inter-Arrival distribution – #canada occurrences Time (in sec) occurrences Time (in sec)

  24. Inter-Arrival distribution • Distrubution looks similar for all three channels • The distribution is heavy-tailed from two main reasons: • Network splits - add zero values (during reconnection) and big values (during the split) • Periods of low activity add tail (more actual for channels with non-uniform geographical distribution – like #bosnia)

  25. Inter-arrival time fits #israel • LogNormal distribution is the best in almost all cases • The only exception is InvGauss distribution using A-D and K-S for #israel • Exponential distribution is very far from being optimal #canada #bosnia

  26. The audio trace – inter-arrival fits • Inter-arrival time distribution is similar to IRC Channels • LogNormal/ InvGauss

  27. Session Duration

  28. Session duration distribution- #israel occurrences Duration (10^5 sec) occurrences Duration (in sec)

  29. Session duration distribution- #canada occurrences Duration (10^5 sec) occurrences Duration (in sec)

  30. Session duration distribution- #bosnia occurrences Duration (10^5 sec) occurrences Duration (in sec)

  31. Session duration distribution • Very heavy tail for two reasons: • Many users spent a lot of time in the channel • Robots

  32. Session duration fits #israel • BetaGeneral distribution gives best fit using Chi-Square and K-S tests any time that we limit the data samples • LogNormal is always on the second place (and best fit using A-D tests) • When we don’t limit the data samples LogNormal is the best. • Exponential is very far from being optimal #canada #bosnia

  33. The audio trace – session duration fits • Session durations is not similar -extremely heavy tail. • 90th percentile similar to IRC channels occurrences Time (in sec)

  34. The audio trace – session durations Long sessions (>1 min) • Long sessions are similar to IRC channels • The phenomenon of short sessions is unique to the audio trace. No analog in the IRC Channels Short sessions (< 1min)

  35. Main affecting factors • Network failures (splits) • Robots and long staying users • Geographical distribution of users

  36. IRC network splits • Any IRC server failure or link failure causes split. • For channel member a split looks like massive leave of users and reconnection looks as massive join of users. • Contribute big number of zeros to inter-arrival time (about 2 percent of joins come in groups) • Decrease session durations • Most splits lasts for up to 20 minutes

  37. Short (temporal) Splits • Heuristic: Find group of quits followed by a group of joins with the same users. • Finds only part of failures

  38. Split durations occurrences Duration (sec)

  39. Robots We define robot as any client who is logged in more than 8 hours in day in average. • Add constant to number of logged users • Add heavy tail to session durations • Don’t affect inter-arrival and join statistics

  40. Distribution of logged robots number occurrences Number of bots

  41. Robots session durations (channel #canada)

  42. Geographical distribution

  43. Geographical distribution during day hours

  44. Number of logged in users (channel size)

  45. Number of user joins per hour

  46. User traffic (Israel) Joins per hour Hour of day Channel size Hour of day

  47. User traffic (bosnia) Joins per hour Hour of day Channel size Hour of day

  48. User traffic (canada) Joins per hour Hour of day Channel size Hour of day

  49. User traffic as function of time of day – observations • The function is very stable over different days • The graph shape is mainly defined by geographical distribution of users • Has grate influence on other parameters distribution like number of on-line users, number of joins per hour.

  50. Joins per hour distribution - #israel Joins in hour occurrences Joins in hour

More Related