- 208 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Things about Trace Analysis' - darshan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Things about Trace Analysis

Wei-jen Hsu

In class presentation for CIS6930

wjhsu@ufl.edu

(Advisor: Ahmed Helmy)

Objective

- More background knowledge related to trace-based study
- Details about the trace format – an intro for one of the assignments
- Share the experience in trace analysis

Why trace analysis?

- Traces provide the “realism” of how the system work
- Verification of established system
- Diagnosis of system operation (identify faults)
- Identifying design flaws
- Large-scale properties (e.g. self-similar traffic)
- Understand how a new system works
- Provide domain knowledge for analysis work
- Verifying an idea

Typical Work Flow for Trace Analysis

- Build the system
- Identify point(s) of trace collection and the methodology used
- Obtain the data
- Clean-up and sanity check
- Analyze the data and post processing
- Explain the results
- Apply the results to further study or modify the existing system

WLAN Traces Study

- It starts back around 2000
- WLAN was new, people wanted to understand how people used it (usage study)
- Surveys v.s. trace
- Work by Tang and Baker (’00), Kotz and Essien (’02) are pioneer examples
- Statistics of usage (# of users, amount of traffic, etc.)

WLAN Traces Study

- Mobility-related
- MIT work (home location, prevalence, and persistence)
- UCSD (PDA users)
- WLAN mobility model (INFOCOM05, T-model, T++-model)
- Other user properties
- Handoff
- Pause time distribution

Trace Format

- For association
- Usually with format

(Node_id, start_time, location, end_time)

- But with various ways to get you there….
- Syslog: Event-based
- SNMP: Polling
- USC raw trace
- Wireless association (time start/stop switch-port MAC)
- DHCP log (time MAC IP)
- Traffic log

- USC wireless association trace

(Time Start/Stop Switch_IP Switch_port MAC_of_node)

Mon Oct 10 01:16:52 Start 172.16.8.245 31005 0:30:65:f9:c0:ae

Mon Oct 10 01:17:00 Stop 172.16.8.245 21044 0:e:35:99:64:d1

Mon Oct 10 01:17:02 Start 172.16.8.245 31015 0:11:24:df:c0:3a

- USC DHCP trace

(Time IP_of_nodeMAC_of_node)

Jan 27 00:21:19 207.151.229.50 0:18:f3:10:ea:4c

Jan 27 00:21:20 207.151.232.184 0:18:de:33:7:92

Jan 27 00:21:20 207.151.229.50 0:18:f3:10:ea:4c

- USC traffic trace

(Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?” Packet_number Data_size)

0127.23:59:42.925 0127.23:59:44.905 128.125.253.143 53 207.151.239.208 1795 17 0 3 1368

0127.23:59:42.925 0127.23:59:52.677 63.236.56.237 80 207.151.239.208 3257 6 2 4 192

Work with the Trace

- An exercise:

“Does the Encounter-Relationship graph change with respect to time??”

- From WLAN traces,

We find “encounters” to measure inter-node relationship

Note: Is this a good assumption??

Not many for WLAN users. On avg. only 2%~7% of population

Encounter distribution- How many other nodes does a node encounter with?

Prob. (unique encounter fraction > x)

Group of good friends…

Cliques with random links to join them

Encounter-Relationship graph- Imagine that there is a link to connect the node pairs if they ever encounter with each other … What does the graph look like?

But, is ER grapha connected graph?

What are its properties?

In most cases DR reaches close to final value in less than 1 day.

Encounter-Relationship graph- To our surprise, ER graphs are connected!!

Disconnected Ratio (%)

- Low path length,

- Low clustering

SmallWorld graph

Regular Graph

- High path length

- High clustering

Encounter-Relationship graph- What are the graph properties of the relationship graphs?

High clustering as regular graph

Low path length as random graph

Encounter-Relationship graph

- Relationship graphs are SmallWorld graph
- High clustering coefficient, low avg. path length

Normalized CC and PL

Work with the Trace

- An exercise:

“Does the Encounter-Relationship graph change with respect to time??”

- Chop the trace into multiple segments
- Analyze the average clustering coefficient and average path length of the resultant graph
- How to deal with changing population?
- Does the encounter duration matter?

Work with the Trace

- Ask questions! What to look for from the trace?
- Its importance
- Its implication
- Its potential usage
- Its alternative solutions
- Apply new techniques to look into the data
- Find/Create interesting data sets

Lessons Learned

- You need a lot of patience and care
- Exceptions in the data
- Flaws in your assumption
- You need a lot of hard-drive space too!
- You need good questions
- For each question there are multiple ways to come up with an answer
- New questions require new data sets and tools
- You need to read a lot of papers

More Potential Direction

- Mobility modeling/prediction
- Data mining and clustering
- Behavior-aware service/advertisements
- Behavior-aware routing
- Caveat: Over-generalization from WLAN to futuristic networks (such as DTN)?
- Re-examine assumptions in earlier work

Related Skills

- General programming (C/C++)
- Perl/shell script/awk
- Matrix manipulation (MATLAB)
- Statistics software (R)
- http://www.r-project.org/
- Clustering/Machine learning
- Principal component analysis/ Singular value decomposition
- http://www.cs.cmu.edu/~elaw/papers/pca.pdf
- Data mining? Database analysis?

Good Online Resources

- MobiLib

http://nile.cise.ufl.edu/MobiLib

- Links to various traces, USC trace and some processing tools download
- CRAWDAD

http://crawdad.cs.dartmouth.edu/

- Various traces download, related papers

References

- [Stanford] D. Tang and M. Baker, “Analysis of a Local-area Wireless Network”
- [Stanford2] D. Tang and M. Baker, “Analysis of a Metropolitan-area Wireless Network”
- [Dartmouth] D. Kotz and K. Essien, “Analysis of a Campus-wide Wireless Network”
- [Dartmouth2] T. Henderson, D. Kotz, and I. Abyzov, “The Changing Usage of a Mature Campus-wide Wireless Network”
- [MIT/IBM] M. Balazinska and P. Castro, “Characterizing Mobility and Network Usage in a Corporate Wireless Local-area Network”

References

- [UCSD] M. McNett and G. Voelker, “Access and Mobility of Wireless PDA Users”
- [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu, “Characterizing Flows in Large Wireless Data Networks”
- [USC] D. Bhattacharjee, A. Rao, C. Shah, M. Shah, and A. Helmy, “Empirical Modeling of Campus-wide Pedestrian Mobility: Observations on the USC Campus”
- [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu, and A. Helmy, “Weighted Waypoint Mobility Model and Its Impacts on Ad Hoc Networks”

References

- [Dartmouth] M. Kim and D Kotz, “Methodology for Classifying Mobile Users and Access Points”
- [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive Wi-Fi mobility data”
- [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and V. Rangan, “Characterizing User Behavior and Network Performance in a Public Wireless LAN”
- [INFOCOM05] C. Tuduce and T. Gross, “A Mobility Model Based on WLAN Traces and its Validation”
- [T++-model] D Lelescu, UC Kozat, R Jain, M Balakrishnan, “Model T++: an empirical joint space-time registration model”
- [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T: an empirical model for user registration patterns in a campus wireless LAN”

Skewed location visiting preferences

Nodes spend 95% of time at top 5 preferred locations.

Heavily visited “preferred spots”

Periodical re-appearance

Nodes show up repeatedly at the same location after integer multiples of days.

Periodical “daily/weekly schedules”

Mobility Observations from WLANsMobility Observations from WLANs

- Problems of simple random models (random walk, random waypoint, random direction)
- No preferred locations in spatial domain (uniform nodal distribution across space)
- No structure in time domain (homogeneous behavior across time)
- Nodes behave statistically identical to one another
- Benefit: Math analysis tractability
- Can we improve realism and not sacrifice math tractability?

Time-variant Community Model

- Skewed location visiting preferences
- Create “communities” to be the preferred destination
- Each node can have its own community
- Periodical re-appearance
- Create structure in time – Periods
- Node move with different parameters in periods
- Repetitive structure

75%

25%

Avg. fraction of online time

Avg. fraction of online time

Time gap (days)

Time-variant Community Model- Major trends of mobility characteristics preserved (extensions later)
- In addition, mathematical tractability is retained

Introduction

- Wide-spread WLAN deployments create large-scale infrastructures.
- Large number of users lead to large scale management and design issues.
- We need methods to quantify, summarize, and compare long-run trends (in the order of months) of individual user associations
- Usage model / association model
- Personalized services
- Behavior aware ads / monetization
- Behavior-aware routing protocols

Questions

- Q1. How to quantify user association consistency?
- (Challenge) What is a proper representation of user association, and how do we measure consistency?
- Q2. How do we summarize long run user association patterns?
- (Challenge) How to utilize existing data reduction techniques?
- Q3. How to group users with similar association patterns?
- (Challenge) How to quantify the similarity of user association patterns?
- How to reduce computational complexity?
- Contribution: Generic methods to address these questions and empirically validated using USC and Dartmouth WLAN traces.

(office, 10AM-12PM)

(class, 6PM-8PM)

Representation of User Association Patterns- We choose to represent summary of user association in each day by a single vector.
- For a given day d, user association vector is defined by a n-element vector a = {aj : the percentage of online time the user i spends at APj on day d}.
- The elements of a vector sum to 1.
- Use zero vector for off-line users.
- The elements in the vectors quantify the relative importance (or, attraction) of the AP to the user.

Association vector: (library, office, class) =(0.2, 0.4, 0.4)

Q1. User Association Consistency

- User i is consistent, if its daily association vectors can be grouped into few clusters (e.g., less than 10% of the number of days).
- Evaluation: use hierarchical clustering with Manhattan distance measure (L1)
- Distance between two vectors is at most 2.

Q1. User Association Consistency

- Hierarchical Clustering
- Start: Each vector is a single-member cluster.
- Recursion: Two closest clusters are merged.
- End: Until remaining clusters have distances larger than a threshold

Q1. User Association Consistency

Distribution of Number ofclusters under cut-offthreshold 0.9

80% of users show at most9 clusters of “behavior modes”during the 94-day trace

*complete link: Distance between clusters =distance between the furthest components inthe considered clusters

Observation: many users are multimodal but with much less association modes than total number of days in the trace period.

Q2. Summarizing user associations

- Association matrix: concatenate user association vectors for all days into a matrix.
- To summarize, perform SVD and store the top-k eigen values/vectors.
- What value of k we have to use for a good representation of the matrix?
- Captured matrix power =
- How much is the reconstruction error?
- Matrix norms ||X-Xk||p/||X||pwhere

Q2. Summarizing user associations

Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices

Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05)

Observation: although users are multi-modal,a few major modes dominate its behavior

Q2. Summarizing user associations

- Association matrix: concatenate user association vectors for all days into a matrix.
- To summarize, perform SVD and store the top-k eigen values/vectors.
- What value of k we have to use for a good representation of the matrix?
- Captured matrix power =
- How much is the reconstruction error?
- Matrix norms ||X-Xk||p/||X||pwhere

Q2. Summarizing user associations

Only top 6 singular vectorsare needed to capture at least90% of power for more than 95% of association matrices

Reconstruction error of low-rank approximationis low (5 singular vectorsgive error < 0.05)

Observation: although users are multi-modal,a few major modes dominate its behavior

Q3. Similarity Metrics between Users

- Naive method to compare similarity between user i and j:
- Intuition: for every daily association vector of i, if there is a similar association vector for j, then (i,j) have similar behavior.
- From user i, pick association vector aid of user i on day d.
- Find the association vector of user j, denoted by ajd’ , which is the nearest to aid
- Find average of |ajd’ - aid| over all days d.
- Drawback: expensive
- O(nd^2) for each pair
- Lots of file reads for large dataset …. Read raw data
- Need a faster method which reads summaries

Q3. Similarity Metrics between Users

- Compare the similarity of the eigen-vectors obtained from SVD.
- Similarity between users determined by weighted inner products of eigen vectors.
- wi = proportion of power of singular vector
- D(U,V) = 1 - Sim(U,V)
- Are the 2 metrics similar?
- 0.911 correlation coefficient for studied users.

Q3. Similarity Metrics between Users

- Are we able to get clusters with similar users?
- Compare the PDF/CDF for inter- and intra- cluster users (Example: 200 clusters).

Q3. Similarity Metrics between Users

- Take users in the same clusters and concatenate the asso. matrices, and perform SVD and find power captured by top k eigen vectors.
- Also take random users and concatenate the eigenvectors and do the same.
- There is a clear distinction between the 2 clustering methods.

*straight-forward = similarity decided based onpair-wise comparison of association vectors

*feature-based = similarity decided based on singular vectors

Q3. Similarity Metrics between Users

- For all clusters, use a scatter plot to show the power captured by top-4 eigenvectors. (distance-based cluster vs random cluster)

Download Presentation

Connecting to Server..