1 / 34

Vijay Atluri

Towards a Unified Index Scheme for Mobile Data and Customer Profiles in a Location-Based Service Environment. Vijay Atluri. Joint Work with Nabil R. Adam and Mahmoud Youssef. C enter for I nformation M anagement I ntegration and C onnectivity ( CIMIC ) Rutgers University

Download Presentation

Vijay Atluri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Unified Index Scheme for Mobile Data and Customer Profiles in a Location-Based Service Environment Vijay Atluri Joint Work with Nabil R. Adam and Mahmoud Youssef Center for Information Management Integration and Connectivity (CIMIC) Rutgers University Partially supported by a grant from NSF

  2. Outline • Introduction to the Mobile commerce environment • Research Problem • Our proposed solution: Unified Index Scheme • Summary and Future Work

  3. M-Commerce: Market Opportunities for Location-based Services • The global subscriber base for mobile location services • Expected to exceed 680 million users by the end of 2006 • About 50% of all mobile subscribers • More than 70% of mobile Internet users • Location services revenue • $2 billion at the end of 2002 • More than $18.5 billion by the end of 2006 • Forecast suggests • 31% will be generated in Western Europe • 22% in the USA • 47% in Japan and rest of the world Source: European Location Based Advertising (ELBA) (2003)

  4. Location Based Service Environment: A Scenario • A Location Service (LS) tracks customers location • The store queries the LS to obtain list of customers in the proximity • The store sends offers to the customers who satisfy the profile and location criteria • The customer receives the offer • Customers are moving in the proximity of a shopping mall • A store in the mall wants to attract the customers at the point where they are likely to buy • Store offers are personalized according to the customer profile (e.g. age, sex, salary, etc.) and preferences Source: ELBA Yellow Map

  5. Personalization • Mobile Commerce applications use mass personalization by: • tracking customer profiles • querying the LS about their location and projected path • However, personalization raises privacy concerns

  6. Addressing Privacy Concerns The problem of customer privacy has to be studied in various contexts: • Customer’s interests • Enjoy m – commerce convenience • Protect their private data • Businesses’ interests • Ideally would like to reach the customers only when they are likely to buy • Federal laws • Guarantee customers the rights to protecting their data • Guarantee businesses the right to collect legitimate data

  7. The problem with Current Environment and a Solution The Problem: The customer has to trust too many merchants for her profile A Proposed Solution: The customer to trust only one third-party It is plausible then to have the LS act as the third party

  8. Objective • Storing profiles in the Location Service • reduces privacy invasion risks, yet • the amount of data would be much larger and the queries would be more complex • Our goal is to enhance • performance • Accuracy of query processing

  9. Location Data • Customers are modeled as moving objects • Location data is continuously updated as customers move • Traditional databases cannot handle such data for the following reasons: • The rate of update is unprecedented • Queries submitted are spatio-temporal in nature. • These queries may address future location • A new paradigm “Moving Objects Databases” has emerged to handle such data

  10. Queries on Moving Objects • A point-in-time query with a spatial window • retrieve customers who are currently in the shopping mall • Time interval query with a spatial window (future query) • retrieve customers who will pass by the motel in the next 30 minutes • Continuous query • retrieve all the customers who are within 300 feet of the store • All these queries are modeled as a time interval and a spatial window

  11. Data Modeling of Moving Objects 1 2 3 4 5 6 7 8 9 10 • The Moving Objects Spatio-Temporal Model (MOST) [Sistla et al. 97] • Location is a linear function of time • No need for update unless the motion parameters change or a threshold is reached • This technique turns the problem into indexing of line segments (motion lines) • Buckets Approach [Song and Roussopoulos 99] The Hash Index is updated only when the object changes its sub-area 1 2 3 4 Region being tracked Hash Index 5 6 7 8 9 10

  12. Customer Profiles • A Customer Profile is “a hierarchical collection of personal information” (OPS 97) • Example: • Personal Contact (Name, Address, Telephone, ID, Email, Language) • Demographic (Date of birth, Gender, Marital status, Income, Education) • Business (Profession, Title, Industry, Company details) • …..

  13. Queries on Moving Objects and Customer Profiles • A point-in-time query • retrieve customers who are currently in the shopping mall with age = [18-23], sex=female • Time interval • retrieve customers who will pass by the motel in the next 60 minutes with state  PA • Continuous query • retrieve all the customers who are within 300 feet of the store with kids_4_to_8=T, Salary < $20K • Many queries may include a large number of attributes

  14. Queries on two Databases The Unified Index Scheme • With two databases, the best query plan would take three steps • a multidimensional query on profiles database (t1) • a spatio-temporal query on moving objects database (t2) • a join operation (t3) • Pros • The query is performed in one pass • Cons • The dimensionality in the unified index is higher (K+L) • The number of records in the unified index (M) is less than the number of records in the profiles database (N)

  15. Current Work Index Techniques for Moving Objects • Hash-based Indexes • Produce high level of false positives • Significant work: • Song and Roussopoulos 1999 • Tree-based Indexes • Suffer from the curse of dimensionality • Significant Work • Elbassioni et al. 2002 • Saltenis et al. 2000 • Kollios et al. 1999 • Tayeb et al. 1997 Non-Moving Objects indexes such as X-Tree (BKK96) and the Pyramid Tree (BBK98) have their own limitations

  16. Some Simple Solutions that do not Work • The geometric relations in low dimensionality do not hold in high dimensionality • Almost all the records are recalled even at few dimensions • Obtaining the exact Amin and Amax in high dimensional space is very difficult (non-convex optimization) • improving that approach by adding a Cosine Distance • yielded unsatisfactory results • The recalled area was [Dmin, Dmax]  [Amax, Amin] • Experimental Study on Euclidean Distance & Cosine Distance as a Hash Function • The distance is calculated from origin • The incoming query is transformed into a distance interval [Dmin, Dmax] • The Problem with this approach is the dead space around the query window

  17. Some Observations Investigating the possibility of using existing indexes showed that: • Tree-based indexes do not scale up in dimensionality • Established in literature (Otterman 1992, Berchtold et al. 1998, C. Aggrawal and Yu 2000) • Hash-based indexes using Euclidean/Cosine distances have very little room for improvement • Investigated experimentally • A new indexing scheme that supports moving objects and multidimensional data is needed • the nature of data, types of queries, and desired precision should be the design basis for an application-specific indexing scheme • our experimental study and the literature on high-dimensional indexing in other domains (e.g., Data Mining, IR, Image Processing, etc) shows that “one size for all” approach is not a good solution

  18. Characteristics of Profiles Data • Very low rate of update (almost static) • Large portion of the attributes is Binary and Categorical • Continuous attributes (interval-scaled) are usually modeled as Ordinal Categorical attributes • Considerable correlations among many of the attributes

  19. Goals and Issues • Performance • Profiles data seem to lend itself to clustering • A clustering-based approach can be much more efficient than other existing approaches • A clustering-based approach would be approximate • Accuracy • Which, and how many, clusters to select • How to achieve the desired level of accuracy

  20. Our Approach • Cluster the customers based on their profiles (categorical clustering algorithm) • Construct a TPR tree for each cluster Profiles Database Clustered Profiles Data Corresponding Location Data Cluster 1 TPR-1 Cluster 2 TPR-2 . . . Cluster n TPR-n

  21. The Effect of Breaking a TPR-tree into Multiple Coinciding Trees Number of I/O’s per Query Number of Points in the Tree The Single Tree Outperforms the 10 Trees by a Factor About 2

  22. The different steps • Improving Accuracy • Sort the attributes based on their accuracy of the clustering • Prune the categories of each attribute • Re-cluster using Pruned scheme, then Re-build the scheme from the new clusters • The classification Process • Query Processing

  23. Improving Accuracy Attribute: Salary Group Attribute: Salary Group 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 1 Freq. 8,544 7,244 1,754 30,524 3,999 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.0179 2 Freq. 11,754 18,066 18,191 95 7,267 … … Cluster 1 2 3 4 5 Cluster 1 2 3 4 5 … … Combined52,065 55,373 Assigned 30,524 36,257 Combined0.1287 0.1548 Assigned 0.0755 0.0897 • Step 1: Sort the attributes based on their accuracy of the clustering • For each attribute construct a pivot table • Number of customers in each category in each cluster • Compute the probability of each cell in the pivot table • Compute the classification factor for that attribute • Sort the attributes based on the classification factor Classification Factor = A/T T= sum(combined) A = sum(assigned)

  24. Improving Accuracy 8 Prob. 0 0 0.00003 ~0 0 Attribute: Salary Group 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.0179 … Cluster 1 2 3 4 5 … 0.00003 0.00003 Combined0.1287 0.1548 Assigned 0.0755 0.0897 • Step 2: Prune the categories of each attribute to improve performance, reduce the memory requirement • some attributes have values that have very little contribution to the classification scheme • Removing these values reduces the amount of calculations during query processing. • Set a pruning threshold(h) • A category is pruned if Where combined (x) is the combined probability of category x m is the number of categories To Be Pruned

  25. At 1% Pruning Threshold, 16% of the categories are eliminated

  26. The Classification Procedure 0 0 0 0 . . . . . . 0 0 Classification Array Record . . . Attribute Age Salary . . . Value 1 5 • A record can be guided to a cluster(s) as follows: • Start with a classification array of zero probabilities • Follow the order of the sorted attributes • For each Pivot Table of an attribute: • Find that attribute in the record • lookup that attribute’s value in the record • Find the column corresponding to that value in the pivot table • Add that column to the classification array • After last pivot table is visited, the cluster(s) that have the max probability is selected for classification Attribute: Age 5 P51 P5k P51 Pivot Table 1 P52 + . . . P5k Attribute: Salary 1 2 P11 P1k P11 P1 Pivot Table n P12 P2 + . . . . . . P1k Pk P1 Final Array P2 . . . Pk C1 C2 Cn Target Cluster(s)

  27. Improving Accuracy Profiles • Step 3: Re-cluster the data using the pruned classification scheme and re-build the tables to eliminate mis-clustered records • Re-classify each record in the database using the pruned classification tables. • Re-classification procedure is the same as the Classification Procedure • Rebuild the pivot tables using the frequencies obtained in the re-classification • Re-compute the probabilities Classification Scheme Clusters Updated Classification Scheme

  28. Query Processing Profiles Database TPR 1 Cluster 1 List of IDs Cluster 2 Cluster n TPR n • Break a query into: (1) Profiles query, and (2) Location query • Process the profiles query • The answer to the profiles query is a pointer to one or more clusters • Process the location query on the moving objects tree(s) corresponding to the selected cluster(s). • The resulting list of IDs can be further processed to eliminate false positives

  29. Processing a Point Query on Profiles 0 0 (2) 0 0 . . . . . . (1) (3) 0 0 Attribute: Age 5 P51 P5k P51 P52 (4) + . . . Attribute: Salary 1 2 P11 P1k P11 P1 Pivot Table n P12 P2 + . . . . . . P1k Pk P1 Final Array P2 . . . Pk C1 C2 Cn Query Query Answer Array • Query Processing is closely similar to classification • A point query is guided to a cluster(s) as follows: Attribute Age Salary Value 1 5 • Start with a Query Answer Array of zero probabilities • Follow the order of the sorted attribute Pivot Table 1 • For each Pivot Table of an attribute (e.g. Age): P5k • If that attribute in the query • lookup that attribute’s value in the query (e.g., 5) • Find the column corresponding to that value in the pivot table • Add that column to the Query Answer Array • If not in the query, ignore • After last pivot table is visited, the cluster(s) that have the max probability is selected for classification Target Cluster(s)

  30. Processing Range Query on Profiles Attrib: Age Sex Salary … Search Key 5 1 1-2 … Query Answer Array cumulative 0.01 0.03 0.45 0.08 0.0 Cluster 1 2 3 4 5 • After retrieving a query key • if the query key is a single value, proceed as point query. • If the query key is range (e.g., Salary), add all the columns representing the categories in the range to the Query Answer Array Attribute: Salary Group (Pruned Prob. Table) 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.018 6 Prob. 0.0003 0.0086 0.0378 0.0720 0.0003 … Cluster 1 2 3 4 5 … Combined 0.0812 0.1278 0.14 + +

  31. Achieving Desired Accuracy • The final classification array includes the cumulative probabilities assigned by the attributes to each cluster. • There is a tradeoff between the number of clusters to be considered (accuracy) and the performance of the scheme. • We adopt the F-score as a measure of accuracy • The question becomes: find the number of clusters k to process the query on such that Where piis the ithscore from top niis the number of records in the ith cluster

  32. Insertion and Deletion • The profile of a customer entering the service area of the Location Service must be inserted in the active database (from the reference profiles database) • Similarly, a customer leaving that area should be deleted from the active database. • Both operations start by classifying the customer to a cluster based on her profile. The actual operation and deletion is performed on the corresponding TPR-tree

  33. Summary • Studied the TPR-tree behavior, and Euclidean/Cosine Distance hashing • Presented a unified indexing scheme for Location-based service environment based on clustering and classification • To achieve the goal of preserve the customer privacy • The scheme overcomes performance problems facing high-dimensional indexes • The accuracy of the output is controlled through the number of clusters to be processed • An experimental Study on the accuracy of the proposed scheme

  34. Future Work • Will study the effect of changing the index parameters (e.g., buffer size, page size, etc) on the performance of the TPR-tree • Enforce the access control for selective exposure of profile information to merchant • Enhance the unified index with access authorizations

More Related