1 / 17

Multidimensional Data

Multidimensional Data. Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example : data mining information about sales. A sale is described by (store, day, item, color, size, etc.). Sale = point in 5­dim space.

royrobert
Download Presentation

Multidimensional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multidimensional Data • Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. • Example: data mining information about sales. • A sale is described by (store, day, item, color, size, etc.). Sale = point in 5­dim space. • A customer is described by (age, salary, pcode, marital­status, etc.). Typical Queries • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'' • Nearest neighbor : ``If I am at coordinates (x,y), what is the nearest McDonalds.'' • They are expressible in SQL. Do you see how?

  2. SQL • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'‘ SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; • Nearest neighbor : ``If I am at coordinates (a,b), what is the nearest McDonalds.'‘ Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ );

  3. Big Impediment • For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE­clause. Approaches • Index on attributes independently. • Intersect pointers in main memory to save disk I/O. • Problem: Does this structure help with the nearest­neighbor? • Multiple key index: Index on one attribute provides pointer to an index on the other.

  4. Computing aggregates • Sales(day,store,item,color,size) • Such relations are called “Data cube” • Example query: • “Summarize the sales of white shirts by day and store.” SELECT FROM Sales s WHERE s.item=‘Shirt’ AND s.color=‘white’ GROUP BY day, store;

  5. Attempt at using B-trees for MD-queries • Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) • Range query {(x,y) : 450  x 550, 450  y 550} • B-tree indexes with pointer lists on x and on y • 100,000 pointers (i.e. 1,000,000/10) for x, and same for y • 10,000 pointers for answer (found by pointer intersection) • Root of each B-tree in main memory • Suppose leaves have avg. 200 keys  500 disk I/O in each B-tree to get pointer lists  1000 + 2(for intermediate B-tree level) disk I/O’s • Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s • Sum 11,002 disk I/O’s • Sequential scan of file = 10,000 I/O’s (100 tuples per block)

  6. Nearest Neighbor query using B-trees • Turn NN to (10,20) into a range-query {(x,y):10-d x 10+d, 20-d  y 20+d } • Possible problem: • No point in the selected range • The closest point inside may not be the answer • Solution: re-execute range query with slightly larger d

  7. NN-queries, example • Same relation Points and its indexes on x and y as before, and Query: NN to (10,20) • Choose d = 1  range-query = {(x,y): 9x 11, 19y21} • 2000 points in [9,11], same in [19,21]  each dimension = 10+1 I/O’s to get pointers (+1 is because points with x=9 may not start just at the beginning of the leaf) • With an extra I/O for the intermediate node for each index  24 + 1 disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points • However, if d is too small, we have to run another range query with a larger d

  8. Grid files (hash-like structure) • Divide data into stripes in each dimension • Rectangle in grid points to bucket • Example: database records (age,salary) for people who buy gold jewelry. Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260)

  9. Grid file

  10. Operations Lookup Find coordinates of point in each dimension --- gives you a bucket to search. Nearest Neighbor Lookup point P . Consider points in that bucket. • Problem: there could be points in adjacent buckets that are closer. • Example: NN of (45; 200). • Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. • Buckets on border may contain points not in range. • Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute • Problem: must search a whole row or column of buckets.

  11. Insertion • Use overflow buckets, or split stripes in one or more dimensions • Insert (52,200). Split central bucket, for instance by splitting central salary stripe • The blocks of 3 buckets are to be processed. • In general the blocks of n buckets are to be processed during a split. • n is the number of buckets in the chosen direction • Very expensive.

  12. Insertion • Insert (52,200). Split central bucket, for instance by splitting central salary stripe (One possibility)

  13. Grid files Advantages • Good for multiple-key search • Supports PM, RQ, NN queries Disadvantages • Space, management overhead • Need partitioning ranges that evenly split keys • Possibility of overflow buckets for insertion

  14. Partitioned hashing I • If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). • A preferable option is to design the hash function so it produces some number of bits, say k. These k bits are divided among n attributes. • I.e. the hash function h is a concatenation of n hash functions, one for each dimensional attribute. • h = (h1, …, hn) • the bucket where to put a tuple (v1, …, vn) is computed by concatenating the bit sequences h1(v1)…hn(vn).

  15. Partitioned hashing II • If we have a hash table with 10-bit bucket numbers (1024 buckets), we could devote four bits to attribute a and the remaining six bits to attribute b. • We hash using ha and hb. • If we are given a partial match query specifying only the value of a, we compute ha(a), which could be, say 0101. Then, we locate all the relevant buckets, which are: 0101000000 to 0101111111.

  16. Partitioned hashing III • Example: Gold jewelry with • first bit = age mod 2 • bits 2 and 3: salary mod 4 • Works well for: • partial match (i.e. just an attribute specified) • Bad for: • range • Nearest Neighbors queries

  17. Grid files vs. partitioned hashing • If many dimensions  many empty cells in grid. While partitioned hashing is OK. • Both support exact and partial match queries. • Grid files good for range and Nearest Neighbor queries, while partitioned hashing is not at all.

More Related