Studying users behavior in chat rooms
Download
1 / 61

Studying users behavior in chat rooms - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Studying users behavior in chat rooms. DANSS January 25, 2004 Michael Rochkind. Agenda. Motivation Project goals What was done Results Conclusions. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Studying users behavior in chat rooms' - dakota


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Studying users behavior in chat rooms

Studying users behavior in chat rooms

DANSS

January 25, 2004

Michael Rochkind


Agenda
Agenda

  • Motivation

  • Project goals

  • What was done

  • Results

  • Conclusions


Motivation
Motivation

  • Need for simulations of interactive end-users to evaluate algorithms and system designs (e.g algorithms for estimation of multicast group size)

  • Difficulty to get real data (both technical and administrative)

  • Most researchers use trace collected for audio multicast of IETF conference talks in 1996


Problems with the trace
Problems with the trace

  • Complete research field is based on a single trace

  • The trace is quite old (from 1996)

  • Collected from one specific type of service (audio conference). The exact nature of users is unknown. The behavior is not necessary the same as in other applications.

  • Impossible to validate the data or collect new one

  • Relatively little activity of members

  • Percentage of spurious joins/leaves is very high


Statistical analysis of the trace
Statistical analysis of the trace

  • Different researchers got different statistical models for various parameters.

  • Ammar and Almeroth (the original trace creators) obtained exponential model for most parameters and Zipf distribution for long session stay time.

  • Aluf, Altman, Nain recently obtained from the same long trace lognormal distribution for both inter-arrival times and stay times. For short multicast session they obtained Weibull distributions for both inter-arrival and stay times.

  • Assumed uniform distribution of users (spatial)


Project goals
Project goals

  • To find a publicly available system which reasonably approximates multicast users behavior.

  • To develop tools for data retrieval so that it can be run by anyone, anytime.

  • To analyze the collected data


Parameters of interest
Parameters of interest

  • Inter-arrival time

  • Session duration (on-time)

  • Number of logged in users (group size)

  • Users’ activity (messages, bytes)

  • Geographical distribution of users

  • Lifespan of multicast event (for short events)

  • Comparison with the “famous trace”


First try message boards yahoo
First try - message boards (Yahoo)

  • Difficult to define term of user session. Many users send just one message.

  • Only active users can be seen (writers)

  • A lot of information is missing (about 50%)

  • Activity peaks when outstanding events happen


Chat rooms
Chat rooms

  • The model is similar to multicast group

  • Users explicitly join the room and leave it

  • Join/leave time and stay time are well-defined.

  • Every message sent to the room is received by all room members


Irc internet relay chat protocol
IRC- Internet Relay Chat protocol

  • Run over TCP/IP

  • Text-based teleconferencing

  • Client-server model

  • Can run in distributed fashion

  • Five big networks with many tens of thousands users and thousands of channels (rooms)


Irc servers
IRC Servers

  • Form a backbone of IRC network

  • Connected together without circles (in the form of a spanning tree)

  • Handle clients connections

  • Each server knows about all other servers and all clients.

C2

C1

S5

S1

S2

S3

S4

C3

C4

S6


Irc clients
IRC clients

  • IRC client is anything connected to IRC server which is not another IRC server.

  • Any TCP enabled device can be IRC client

  • Distinguished by unique nickname

  • Each IRC server has the following info about each IRC client:

    • Nickname

    • Real name of the host where the client is running

    • Username of the client on that host

    • IRC server to which the client is connected


Irc channels
IRC Channels

  • Parallel to the term “Chat room”

  • Named group of one or more users which will all receive messages addressed to that channel.

  • Created when first user joins the channel

  • Ceases to exits when last users leaves it

  • In case of network split the channel on each side has only those clients connected to the servers in the corresponding side. After network reconnection the channel is joined again.


Irc network example
IRC network example

C1

S5

S1

S2

C2

S4

S3

C3

C4

S6


Irc message sending
IRC message sending

C1

S5

S1

S2

C2

S4

S3

C3

C4

S6


Irc new member joins to a channel
IRC – new member joins to a channel

  • Channel X with members C1, C2, C3

  • Client C4 joins the channel X

join c4

C1

join c4

S5

S1

join c4

S2

join c4

C2

join c4

join c4

S4

join c4

S3

1. Join X

join c4

C3

2. names c1, c2, c3

S6

C4


Irc channel monitoring
IRC Channel Monitoring

  • Monitoring client written in Perl running under cron

  • We choose randomly 3 channels from the group of all channels with more than 100 users – #israel, #canada, #bosnia

  • Channel activity data was collected for a period of about 6 weeks.


Log file format
Log file format

  • <time> START

  • <time> EXIT

  • <time> JOIN <nickname> <country>

  • <time> PART|QUIT|KICK <nickname>

  • <time> PUBLIC <nick> <size> <country>

  • <time> NICK <old nick> <new nick>

  • <time> NAMES <list of nicks>


1053586971 START

1053587032 JOIN wponiw IL

1053587032 NAMES wponiw Teo_ i-NA mr_shark ^_kNibAL_ kaye_22 Old-Man^ CHA_555 klent Leila19f [Dan] kalanko1 Manifa21f jennider1 eu_sunt mangko18 hot^guy holly20f sad_beaut swimgirl ghazde ^^swt_guy pseudonym bing_23 topgirl23 sexYica creatza sergio9 ZaRa glance cookie^^ aileen` Ugly-GirL AFNAN EclipseM laurra-f garden cai applej SHUNSY fatcock kikelph mhaelee16 aGaTa Ercko lonebabe shellaine juulia priti2 HuntI2ess

1053587032 NAMES gienah Amanda^^ Jamali lishat18 cute_ashf jhen Horbit Sana18 AloneMan3 Errikka ext-ex Maysmile ynet02 poem_37M ann3 jelle love_less dreeve18 indai` adze LiWeiYi TokyoBoy blossom dummee man__ marichu earp danone jackdaw ^faraz^ ANGELA25 boby27 leah_ jossie shyrgil jade-17 kian arnulpo ally16 FiNG Carmina42 bangd sohail Janine33 anne--- joyce22 LUIE_M Travioli corn HOMBREJ2 sexybabes spyk2000 ^barbi3^

1053587032 NAMES tumbleWED Gaby3 chynna^^ babyTH lenjie jherome Certified dj_france jane36 micay shah goerge24 bluediamo master_po Jypsy bassma Bobson^^ Fil24f dimple2 _THERE_ AloneGirL Naked_f shark_nyk morena23 Danniel_m Arwen_ ofw_park jimbern m40usa restie @PacZzZzZz blackstud davis He11razor +MultiMind mater Fearless Adnan_pk Er`mya Helena BrainDead CStrixAW` wooden birkof Cute_Girl Lisa_-- Megaframe barbara-

1053587032 NAMES Simple Loren23 Diana27 Cozzo NateDogg legendh Angel19 Mariah19 fedfed SUNSEEKER PRONET7 bestofmi D0gGi3` +Don_Juan MrNylons teapot SkiPerZ +Br0Th4 Linu|tech ShowerMia JenJen Mariahhh optimist @X

1053587032 JOIN D-A-D-I IN

1053587045 JOIN sydneyguy AU

1053587047 PUBLIC Certified 17 US

1053587053 PUBLIC Certified 13 US

1053587059 JOIN Mckay28 MT

1053587063 NICK CHA_555 ^zHTe

1053587068 PUBLIC Certified 31 US

1053587076 PART ^zHTe

1053587080 JOIN villain PH

1053587082 JOIN cryn PH

1053587095 JOIN static}x{ US

1053587098 PUBLIC Certified 31 US



Inter arrival distribution bosnia
Inter-Arrival distribution – #bosnia

occurrences

Time

(in sec)

occurrences

Time

(in sec)


Inter arrival distribution israel
Inter-Arrival distribution – #israel

occurrences

Time

(in sec)

occurrences

Time

(in sec)


Inter arrival distribution canada
Inter-Arrival distribution – #canada

occurrences

Time

(in sec)

occurrences

Time

(in sec)


Inter arrival distribution
Inter-Arrival distribution

  • Distrubution looks similar for all three channels

  • The distribution is heavy-tailed from two main reasons:

  • Network splits - add zero values (during reconnection) and big values (during the split)

  • Periods of low activity add tail (more actual for channels with non-uniform geographical distribution – like #bosnia)


Inter arrival time fits
Inter-arrival time fits

#israel

  • LogNormal distribution is the best in almost all cases

  • The only exception is InvGauss distribution using A-D and K-S for #israel

  • Exponential distribution is very far from being optimal

#canada

#bosnia


The audio trace inter arrival fits
The audio trace – inter-arrival fits

  • Inter-arrival time distribution is similar to IRC Channels

  • LogNormal/ InvGauss



Session duration distribution israel
Session duration distribution- #israel

occurrences

Duration

(10^5 sec)

occurrences

Duration

(in sec)


Session duration distribution canada
Session duration distribution- #canada

occurrences

Duration

(10^5 sec)

occurrences

Duration

(in sec)


Session duration distribution bosnia
Session duration distribution- #bosnia

occurrences

Duration

(10^5 sec)

occurrences

Duration

(in sec)


Session duration distribution
Session duration distribution

  • Very heavy tail for two reasons:

  • Many users spent a lot of time in the channel

  • Robots


Session duration fits
Session duration fits

#israel

  • BetaGeneral distribution gives best fit using Chi-Square and K-S tests any time that we limit the data samples

  • LogNormal is always on the second place (and best fit using A-D tests)

  • When we don’t limit the data samples LogNormal is the best.

  • Exponential is very far from being optimal

#canada

#bosnia


The audio trace session duration fits
The audio trace – session duration fits

  • Session durations is not similar -extremely heavy tail.

  • 90th percentile similar to IRC channels

occurrences

Time

(in sec)


The audio trace session durations
The audio trace – session durations

Long sessions (>1 min)

  • Long sessions are similar to IRC channels

  • The phenomenon of short sessions is unique to the audio trace. No analog in the IRC Channels

Short sessions (< 1min)


Main affecting factors
Main affecting factors

  • Network failures (splits)

  • Robots and long staying users

  • Geographical distribution of users


Irc network splits
IRC network splits

  • Any IRC server failure or link failure causes split.

  • For channel member a split looks like massive leave of users and reconnection looks as massive join of users.

  • Contribute big number of zeros to inter-arrival time (about 2 percent of joins come in groups)

  • Decrease session durations

  • Most splits lasts for up to 20 minutes


Short temporal splits
Short (temporal) Splits

  • Heuristic: Find group of quits followed by a group of joins with the same users.

  • Finds only part of failures


Split durations
Split durations

occurrences

Duration

(sec)


Robots
Robots

We define robot as any client who is logged in more than 8 hours in day in average.

  • Add constant to number of logged users

  • Add heavy tail to session durations

  • Don’t affect inter-arrival and join statistics


Distribution of logged robots number
Distribution of logged robots number

occurrences

Number

of bots


Robots session durations channel canada
Robots session durations (channel #canada)






User traffic israel
User traffic (Israel)

Joins

per hour

Hour

of day

Channel

size

Hour

of day


User traffic bosnia
User traffic (bosnia)

Joins

per hour

Hour

of day

Channel

size

Hour

of day


User traffic canada
User traffic (canada)

Joins

per hour

Hour

of day

Channel

size

Hour

of day


User traffic as function of time of day observations
User traffic as function of time of day – observations

  • The function is very stable over different days

  • The graph shape is mainly defined by geographical distribution of users

  • Has grate influence on other parameters distribution like number of on-line users, number of joins per hour.


Joins per hour distribution israel
Joins per hour distribution - #israel

Joins

in hour

occurrences

Joins

in hour


Joins per hour distribution bosnia
Joins per hour distribution - #bosnia

Joins

in hour

occurrences

Joins

in hour


Joins per hour distribution canada
Joins per hour distribution - #canada

Joins

in hour

occurrences

Joins

in hour


Data traffic israel
Data traffic (Israel)

Msg

per hour

Hour

of day

Bytes

Per hour

Hour

of day


Data traffic bosnia
Data traffic (bosnia)

Msg

per hour

Hour

of day

Bytes

Per hour

Hour

of day


Data traffic canada
Data traffic (canada)

Msg

per hour

Bytes

Per hour


Data traffic observations
Data traffic observations

  • Two graphs are highly correlated due to the nature of the messages.

  • Some exceptions coming from robots violating the game rules.

  • Some correlation with number of logged in users but much more flat.




Short multicast event
Short multicast event

  • 10 – start joining

  • 40 – most participants joined

  • 50 – last particip. joins. Event starts.

  • 110 – event ends

  • 120 –participants leave

  • 190 – users leave

Time (minutes)


Short multicast event data traffic
Short multicast event (data traffic)

msgs

  • Time resolution – 5 min.

bytes

Time (minutes)


Conclusions
Conclusions

  • Modeling of multicast groups behavior through IRC users is possible.

  • It’s difficult to fit empirical data into pure analytical models due to the combination of different factors (user types, system failures etc). Simulation process must take into account all these factors

  • The famous audio log is inadequate with respect to some important parameters

  • Traditional assumption about uniformity of spatial distribution is not always correct

  • Data logs and scripts are available for use


ad