things about trace analysis n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Things about Trace Analysis PowerPoint Presentation
Download Presentation
Things about Trace Analysis

Loading in 2 Seconds...

play fullscreen
1 / 44

Things about Trace Analysis - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

Things about Trace Analysis. Wei-jen Hsu In class presentation for CIS6930 wjhsu@ufl.edu (Advisor: Ahmed Helmy). Objective. More background knowledge related to trace-based study Details about the trace format – an intro for one of the assignments Share the experience in trace analysis.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Things about Trace Analysis' - darshan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
things about trace analysis

Things about Trace Analysis

Wei-jen Hsu

In class presentation for CIS6930

wjhsu@ufl.edu

(Advisor: Ahmed Helmy)

objective
Objective
  • More background knowledge related to trace-based study
  • Details about the trace format – an intro for one of the assignments
  • Share the experience in trace analysis
why trace analysis
Why trace analysis?
  • Traces provide the “realism” of how the system work
    • Verification of established system
    • Diagnosis of system operation (identify faults)
    • Identifying design flaws
    • Large-scale properties (e.g. self-similar traffic)
    • Understand how a new system works
    • Provide domain knowledge for analysis work
    • Verifying an idea
typical work flow for trace analysis
Typical Work Flow for Trace Analysis
  • Build the system
  • Identify point(s) of trace collection and the methodology used
  • Obtain the data
  • Clean-up and sanity check
  • Analyze the data and post processing
  • Explain the results
  • Apply the results to further study or modify the existing system
wlan traces study
WLAN Traces Study
  • It starts back around 2000
    • WLAN was new, people wanted to understand how people used it (usage study)
    • Surveys v.s. trace
    • Work by Tang and Baker (’00), Kotz and Essien (’02) are pioneer examples
      • Statistics of usage (# of users, amount of traffic, etc.)
wlan traces study1
WLAN Traces Study
  • Mobility-related
    • MIT work (home location, prevalence, and persistence)
    • UCSD (PDA users)
    • WLAN mobility model (INFOCOM05, T-model, T++-model)
  • Other user properties
    • Handoff
    • Pause time distribution
trace format
Trace Format
  • For association
    • Usually with format

(Node_id, start_time, location, end_time)

    • But with various ways to get you there….
      • Syslog: Event-based
      • SNMP: Polling
  • USC raw trace
    • Wireless association (time start/stop switch-port MAC)
    • DHCP log (time MAC IP)
    • Traffic log
slide8

Trace Format Example

  • USC wireless association trace

(Time Start/Stop Switch_IP Switch_port MAC_of_node)

Mon Oct 10 01:16:52 Start 172.16.8.245 31005 0:30:65:f9:c0:ae

Mon Oct 10 01:17:00 Stop 172.16.8.245 21044 0:e:35:99:64:d1

Mon Oct 10 01:17:02 Start 172.16.8.245 31015 0:11:24:df:c0:3a

  • USC DHCP trace

(Time IP_of_nodeMAC_of_node)

Jan 27 00:21:19 207.151.229.50 0:18:f3:10:ea:4c

Jan 27 00:21:20 207.151.232.184 0:18:de:33:7:92

Jan 27 00:21:20 207.151.229.50 0:18:f3:10:ea:4c

  • USC traffic trace

(Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?” Packet_number Data_size)

0127.23:59:42.925 0127.23:59:44.905 128.125.253.143 53 207.151.239.208 1795 17 0 3 1368

0127.23:59:42.925 0127.23:59:52.677 63.236.56.237 80 207.151.239.208 3257 6 2 4 192

work with the trace
Work with the Trace
  • An exercise:

“Does the Encounter-Relationship graph change with respect to time??”

  • From WLAN traces,

We find “encounters” to measure inter-node relationship

Note: Is this a good assumption??

encounter distribution

0.5

Not many for WLAN users. On avg. only 2%~7% of population

Encounter distribution
  • How many other nodes does a node encounter with?

Prob. (unique encounter fraction > x)

encounter relationship graph

loner

Group of good friends…

Cliques with random links to join them

Encounter-Relationship graph
  • Imagine that there is a link to connect the node pairs if they ever encounter with each other … What does the graph look like?

But, is ER grapha connected graph?

What are its properties?

encounter relationship graph1

In most cases DR reaches close to final value in less than 1 day.

Encounter-Relationship graph
  • To our surprise, ER graphs are connected!!

Disconnected Ratio (%)

encounter relationship graph2

Random Graph

- Low path length,

- Low clustering

SmallWorld graph

Regular Graph

- High path length

- High clustering

Encounter-Relationship graph
  • What are the graph properties of the relationship graphs?

High clustering as regular graph

Low path length as random graph

encounter relationship graph3
Encounter-Relationship graph
  • Relationship graphs are SmallWorld graph
    • High clustering coefficient, low avg. path length

Normalized CC and PL

work with the trace1
Work with the Trace
  • An exercise:

“Does the Encounter-Relationship graph change with respect to time??”

    • Chop the trace into multiple segments
    • Analyze the average clustering coefficient and average path length of the resultant graph
    • How to deal with changing population?
    • Does the encounter duration matter?
work with the trace2
Work with the Trace
  • Ask questions! What to look for from the trace?
    • Its importance
    • Its implication
    • Its potential usage
    • Its alternative solutions
  • Apply new techniques to look into the data
  • Find/Create interesting data sets
lessons learned
Lessons Learned
  • You need a lot of patience and care
    • Exceptions in the data
    • Flaws in your assumption
  • You need a lot of hard-drive space too!
  • You need good questions
    • For each question there are multiple ways to come up with an answer
    • New questions require new data sets and tools
  • You need to read a lot of papers
more potential direction
More Potential Direction
  • Mobility modeling/prediction
  • Data mining and clustering
  • Behavior-aware service/advertisements
  • Behavior-aware routing
    • Caveat: Over-generalization from WLAN to futuristic networks (such as DTN)?
  • Re-examine assumptions in earlier work
related skills
Related Skills
  • General programming (C/C++)
  • Perl/shell script/awk
  • Matrix manipulation (MATLAB)
  • Statistics software (R)
    • http://www.r-project.org/
  • Clustering/Machine learning
  • Principal component analysis/ Singular value decomposition
    • http://www.cs.cmu.edu/~elaw/papers/pca.pdf
  • Data mining? Database analysis?
good online resources
Good Online Resources
  • MobiLib

http://nile.cise.ufl.edu/MobiLib

    • Links to various traces, USC trace and some processing tools download
  • CRAWDAD

http://crawdad.cs.dartmouth.edu/

    • Various traces download, related papers
references
References
  • [Stanford] D. Tang and M. Baker, “Analysis of a Local-area Wireless Network”
  • [Stanford2] D. Tang and M. Baker, “Analysis of a Metropolitan-area Wireless Network”
  • [Dartmouth] D. Kotz and K. Essien, “Analysis of a Campus-wide Wireless Network”
  • [Dartmouth2] T. Henderson, D. Kotz, and I. Abyzov, “The Changing Usage of a Mature Campus-wide Wireless Network”
  • [MIT/IBM] M. Balazinska and P. Castro, “Characterizing Mobility and Network Usage in a Corporate Wireless Local-area Network”
references1
References
  • [UCSD] M. McNett and G. Voelker, “Access and Mobility of Wireless PDA Users”
  • [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu, “Characterizing Flows in Large Wireless Data Networks”
  • [USC] D. Bhattacharjee, A. Rao, C. Shah, M. Shah, and A. Helmy, “Empirical Modeling of Campus-wide Pedestrian Mobility: Observations on the USC Campus”
  • [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu, and A. Helmy, “Weighted Waypoint Mobility Model and Its Impacts on Ad Hoc Networks”
references2
References
  • [Dartmouth] M. Kim and D Kotz, “Methodology for Classifying Mobile Users and Access Points”
  • [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive Wi-Fi mobility data”
  • [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and V. Rangan, “Characterizing User Behavior and Network Performance in a Public Wireless LAN”
  • [INFOCOM05] C. Tuduce and T. Gross, “A Mobility Model Based on WLAN Traces and its Validation”
  • [T++-model] D Lelescu, UC Kozat, R Jain, M Balakrishnan, “Model T++: an empirical joint space-time registration model”
  • [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T: an empirical model for user registration patterns in a campus wireless LAN”
mobility observations from wlans
Skewed location visiting preferences

Nodes spend 95% of time at top 5 preferred locations.

Heavily visited “preferred spots”

Periodical re-appearance

Nodes show up repeatedly at the same location after integer multiples of days.

Periodical “daily/weekly schedules”

Mobility Observations from WLANs
mobility observations from wlans1
Mobility Observations from WLANs
  • Problems of simple random models (random walk, random waypoint, random direction)
    • No preferred locations in spatial domain (uniform nodal distribution across space)
    • No structure in time domain (homogeneous behavior across time)
    • Nodes behave statistically identical to one another
  • Benefit: Math analysis tractability
  • Can we improve realism and not sacrifice math tractability?
time variant community model
Time-variant Community Model
  • Skewed location visiting preferences
    • Create “communities” to be the preferred destination
    • Each node can have its own community
  • Periodical re-appearance
    • Create structure in time – Periods
    • Node move with different parameters in periods
    • Repetitive structure

75%

25%

time variant community model1

Prob of re-appearance

Avg. fraction of online time

Avg. fraction of online time

Time gap (days)

Time-variant Community Model
  • Major trends of mobility characteristics preserved (extensions later)
  • In addition, mathematical tractability is retained
introduction
Introduction
  • Wide-spread WLAN deployments create large-scale infrastructures.
    • Large number of users lead to large scale management and design issues.
  • We need methods to quantify, summarize, and compare long-run trends (in the order of months) of individual user associations
    • Usage model / association model
    • Personalized services
    • Behavior aware ads / monetization
    • Behavior-aware routing protocols
questions
Questions
  • Q1. How to quantify user association consistency?
    • (Challenge) What is a proper representation of user association, and how do we measure consistency?
  • Q2. How do we summarize long run user association patterns?
    • (Challenge) How to utilize existing data reduction techniques?
  • Q3. How to group users with similar association patterns?
    • (Challenge) How to quantify the similarity of user association patterns?
    • How to reduce computational complexity?
  • Contribution: Generic methods to address these questions and empirically validated using USC and Dartmouth WLAN traces.
representation of user association patterns

(library, 1:30PM-2:30PM)

(office, 10AM-12PM)

(class, 6PM-8PM)

Representation of User Association Patterns
  • We choose to represent summary of user association in each day by a single vector.
  • For a given day d, user association vector is defined by a n-element vector a = {aj : the percentage of online time the user i spends at APj on day d}.
    • The elements of a vector sum to 1.
    • Use zero vector for off-line users.
  • The elements in the vectors quantify the relative importance (or, attraction) of the AP to the user.

Association vector: (library, office, class) =(0.2, 0.4, 0.4)

q1 user association consistency
Q1. User Association Consistency
  • User i is consistent, if its daily association vectors can be grouped into few clusters (e.g., less than 10% of the number of days).
  • Evaluation: use hierarchical clustering with Manhattan distance measure (L1)
    • Distance between two vectors is at most 2.
q1 user association consistency1
Q1. User Association Consistency
  • Hierarchical Clustering
    • Start: Each vector is a single-member cluster.
    • Recursion: Two closest clusters are merged.
    • End: Until remaining clusters have distances larger than a threshold
q1 user association consistency2
Q1. User Association Consistency

Distribution of Number ofclusters under cut-offthreshold 0.9

80% of users show at most9 clusters of “behavior modes”during the 94-day trace

*complete link: Distance between clusters =distance between the furthest components inthe considered clusters

Observation: many users are multimodal but with much less association modes than total number of days in the trace period.

q2 summarizing user associations

Daily association vector

Q2. Summarizing user associations
  • Association matrix: concatenate user association vectors for all days into a matrix.
  • To summarize, perform SVD and store the top-k eigen values/vectors.
  • What value of k we have to use for a good representation of the matrix?
    • Captured matrix power =
  • How much is the reconstruction error?
    • Matrix norms ||X-Xk||p/||X||pwhere
q2 summarizing user associations1
Q2. Summarizing user associations

Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices

Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05)

Observation: although users are multi-modal,a few major modes dominate its behavior

q2 summarizing user associations2

Daily association vector

Q2. Summarizing user associations
  • Association matrix: concatenate user association vectors for all days into a matrix.
  • To summarize, perform SVD and store the top-k eigen values/vectors.
  • What value of k we have to use for a good representation of the matrix?
    • Captured matrix power =
  • How much is the reconstruction error?
    • Matrix norms ||X-Xk||p/||X||pwhere
q2 summarizing user associations3
Q2. Summarizing user associations

Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices

Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05)

Observation: although users are multi-modal,a few major modes dominate its behavior

q3 similarity metrics between users
Q3. Similarity Metrics between Users
  • Naive method to compare similarity between user i and j:
    • Intuition: for every daily association vector of i, if there is a similar association vector for j, then (i,j) have similar behavior.
    • From user i, pick association vector aid of user i on day d.
    • Find the association vector of user j, denoted by ajd’ , which is the nearest to aid
  • Find average of |ajd’ - aid| over all days d.
  • Drawback: expensive
    • O(nd^2) for each pair
    • Lots of file reads for large dataset …. Read raw data
  • Need a faster method which reads summaries
q3 similarity metrics between users1
Q3. Similarity Metrics between Users
  • Compare the similarity of the eigen-vectors obtained from SVD.
  • Similarity between users determined by weighted inner products of eigen vectors.
    • wi = proportion of power of singular vector
    • D(U,V) = 1 - Sim(U,V)
  • Are the 2 metrics similar?
    • 0.911 correlation coefficient for studied users.
q3 similarity metrics between users2
Q3. Similarity Metrics between Users
  • Are we able to get clusters with similar users?
  • Compare the PDF/CDF for inter- and intra- cluster users (Example: 200 clusters).
q3 similarity metrics between users3
Q3. Similarity Metrics between Users
  • Take users in the same clusters and concatenate the asso. matrices, and perform SVD and find power captured by top k eigen vectors.
  • Also take random users and concatenate the eigenvectors and do the same.
  • There is a clear distinction between the 2 clustering methods.

*straight-forward = similarity decided based onpair-wise comparison of association vectors

*feature-based = similarity decided based on singular vectors

q3 similarity metrics between users4
Q3. Similarity Metrics between Users
  • For all clusters, use a scatter plot to show the power captured by top-4 eigenvectors. (distance-based cluster vs random cluster)