1 / 34

User Behavior Analysis in Wi-Fi network

User Behavior Analysis in Wi-Fi network. Anna Rosenberg Supervisor: Orly Avner. Overview. The goal of this project: to analyze a Wi-Fi network’s APs to model the wireless clients using the network The contributions of this project: analysis of Access Points

regis
Download Presentation

User Behavior Analysis in Wi-Fi network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User Behavior Analysis in Wi-Fi network Anna Rosenberg Supervisor: OrlyAvner

  2. Overview • The goal of this project: • to analyze a Wi-Fi network’s APs • to model the wireless clients using the network • The contributions of this project: • analysis of Access Points • the use of k-means and g-means algorithms for clustering the network’s users

  3. Previous Work • "Modeling client arrivals at access points in wireless campus-wide networks (Maria Papadopouli, HaipengShen, ManolisSpanakis)" • models of the arrival processes of clients at APs as a time-varying Poisson process with different arrival-rate function • analyzing the traffic load characteristics (e.g., bytes, number of packets, associations, distinct clients, type of clients) • clustering the APs based on their visit arrival and on the building type

  4. Previous Work • Characterizing user behavior and network performance in a public wireless LAN. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, 2002. (AnandBalachandran, Geoffrey Voelker, ParamvirBahl, and VenkatRangan) • Their overall analysis of user behavior shows that: • Users are evenly distributed across all APs and user arrivals are correlated in time and space • User arrivals can be correlated into the network according to a two-state Markov-Modulated Poisson Process (MMPP). • There is an implicit correlation between session duration and average data rates. Longer sessions typically have very low data requirements. Most of the sessions with high average data rate are very short.

  5. Previous Work • Modeling users’ mobility among Wi-Fi access points.( Minkyong Kim, David Kotz) • Networks messages were collected on the Dartmouth campus • Modeling user movements between APs • Clustering the APs based on their peak hour

  6. Data • Router (Sniffer) • Packets: • MAC address of the access points • MAC address of the user • Source/Destination IP addresses • Size of the packet • The time it was received

  7. IEEE 802.11 Architecture • Cells (called Basic Service Set or BSS) • Base Station (called Access Point or in short AP). • Access Points are connected through backbone (called Distribution System or DS) • The examined network: 16 APs

  8. Arrival Rate at APs • AP1, AP8: Active from midday till the evening Active only in the evening

  9. Analyzing the arrival rate with different averaging windows

  10. Analyzing the arrival rate with different averaging windows

  11. Users • 3273 users • The transmission rate :

  12. Coherence with the time of lectures and breaks • Users are active during the breaks and not active during the lectures that last 50-55 minutes.

  13. Visit duration • How to define a visit? We chose 30 minutes as a maximal inter-arrival time between two packets that can be considered as packets of one visit.

  14. Features • The average characteristics: • Average visit duration • Average inter-arrival times between the visits • Average traffic • Number of visits • Total number of days in the system The stdof inter- arrival times The std of traffic The std of visit duration

  15. Features • No typical clusters that can be found among the networks users: Av. inter visit times vs. Number of visits Av. inter visit times vs. Av. visit duration

  16. Features Av. traffic vs. Av. visit duration Av. traffic vs. Number of visits

  17. Clustering • Unsupervised learning problem • Finding a structure in a collection of unlabeled data • Collection of objects which are “similar” • Distance measure

  18. K-Means Clustering • Features: • Average visit duration • Average inter-visit times • Average traffic per packets • Maximal distance between visits • Minimal distance between visits

  19. Results of K-Means Clustering • K=2 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits

  20. Results of K-Means Clustering • K=3 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits

  21. Results of K-Means Clustering • K=4 Av. inter visit times vs. Av. traffic per packet Av. visit duration vs. Av. inter visit times Max. distance between visits vs. Min. distance between visits

  22. K-Means Clustering: conclusion • k-means clustering algorithm based on average characteristics of networks’ users can’t produce any isolated clusters. That is why we conclude that the algorithm based on average characteristics can’t cluster well the networks’ users. • Possible reasons for unsuccessful clustering: • Using feature set that doesn’t provide enough information about the system • Not enough samples • Using Euclidian distance

  23. G-Means Clustering Algorithm • The right number k of clusters to use is often not obvious • Based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution • The standard statistical significance level α - desired probability of incorrectly splitting

  24. G-means • Different feature set provides more data points • Each point consists of the following components: • The visit duration • The inter time between this visits and the previous visit • Number of packets that were sent during the visit • The average amount of data that was accessed during the visit • Normalize the data components to get proper results even with simple Euclidean distance metric • 50 users with maximal number of visits: 3457 points • Users with more than 10 visits: 572 users, 15105 points

  25. G-means • The dependence of number of clusters on α:

  26. G-means results • 70 clusters • α = 0.0001 58 visits, 8788 packets, 30 clusters; the most common clusters:11, 20, 29 and 35. 59 visits, 28777 packets, 31 clusters; the most common cluster 30

  27. Evaluation • Purity - the set of clusters - the set of classes Example: the majority class and number of members of the majority class for the three clusters are: x,5(cluster 1); o,4(cluster 2); and ◊,3(cluster 3). Purity is (1/17)×(5+4+3)≈0.71

  28. Evaluation • The dependence of the purity on α

  29. Evaluation • New Evaluation Measure the level of possibility of representing each user by one typical cluster N – total number of users - number of samples contained in the most common cluster of user i - total number of samples of user i Example: There are 3 users: x, o and ◊. Number of samples contained in the most common class and total number of samples for the three user are: 5,8(user x); 4,5(user o); and 3,4(user ◊). E=(1/3)×(5/8 +4/5 +3/4)≈0.725

  30. Evaluation • The dependence of the evaluation measure Eon α

  31. G-Means Clustering: conclusion • g-means clustering algorithm based on the points that consist of the 4 characteristics (that were described earlier) can’t represent each user by one typical cluster. That is why we conclude that this algorithm can’t cluster well the networks’ users. • Possible reasons for unsuccessful clustering: • Using feature set that doesn’t provide enough information about the system • Not enough samples • Using Euclidian distance

  32. Conclusions • The Access Points’ arrival rate is coherent with the time of lectures and breaks. • The APs show low activity during the lectures and high activity during the breaks. • k-means clustering algorithm based on average characteristics of networks’ users can’t produce any isolated clusters. That is why we conclude that the algorithm based on average characteristics can’t cluster well the networks’ users. • g-means clustering algorithm based on the points that consist of the 4 characteristics (that were described earlier) can’t represent each user by one typical cluster. That is why we conclude that this algorithm can’t cluster well the networks’ users.

  33. Future work • Select another subset of features • Use another clustering algorithm • Try to collect more data samples

  34. Questions

More Related