1 / 12

Data Mining on Streams

Data Mining on Streams.

Download Presentation

Data Mining on Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining on Streams • We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to use spatially oriented techniques) since a runlist can be truncated at one end and appended to at the other very easily (a Ptree, even a 1-D Ptree cannot accommodate such activity gracefully. However, if the data is spatial and there is a need for the continuity advantage of 2-D Ptrees, then Ptrees should be used!). • We begin with some slides reviewing Ptrees, RunLists and etc. Then move to stream Data Mining.

  2. A table, R(A1..An), is a horizontal structure (set of horizontal records) Ptrees vertical partition; compress each vertical bit slice into a basic Ptree; R( A1 A2 A3 A4) -->R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structure Processed vertically (scans) 0 0 0 0 1 0 1 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 3. 2nd half is not pure1  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 10 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not  0 0 0 1 0 1 01 5. 2nd half of 2nd half is  1 0 1 0 6. 1st half of 1st of 2nd is  1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 horizontally process these Ptrees using one multi-operand logical AND operation. processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11:

  3. Run Lists: Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) R( A1 A2 A3 A4) -->R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 0 0 0 0 1 0 1 1 • 1st run is Pure0  0:000 • truth:start R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 2. 2nd run is Pure1  1:100 3.3rd run is Pure0  0:101 4. 4th run is Pure1  1:110 RL11 RL12 RL13 RL21 RL22 RL23 RL31 RL32 RL33 RL41 RL42 RL43 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:001 0:010 1:100 0:101 1:110 0:000 1:100 0:101 1:110 1:000 0:100 1:101 Eg, to count, 111 000 001 100s, use “pure111000001100”: RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43 Run Lists: record the type and start-offset of pure runs. E.g., RL11: RL11 0:000 1:100 0:101 1:110 (to complement, flip purity bits)

  4. R11 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 Other Indexes on RunLists start length 0RL110000:40101:1 MRL111000:8 We could put Pure0-Run, Pure1-Run and Mixed-Run Indexes on RLs: RL1100:011:10000:101 11:11001:1000 1RL110100:10110:2 Or since we would not traverse the RL very often make it a link list and just concat indexes START 0RL110000:40101:1 1RL110100:10110:2 MRL111000:8

  5. R11 R34 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 Indexed RunLists ANDing RL34 0RL340000:20101:4 1RL341011:5 RL11^34 MRL340010:3:101 1001:2:10 0RL11^340000:40101:4 1RL11^340100:1 RL11 0RL110000:40101:1 MRL11^341000:7:1010101 1RL110100:10110:2 MRL111000:8:01011111

  6. R11 R34 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 Indexed RunLists ANDingAnd RL0s 1st, then? RL34 0RL340000:20101:4 1RL341011:5 MRL340010:3:101 1001:2:10 RL11^34 0RL11^340000:40101:4 . . . RL11 0RL110000:40101:1 1RL110100:10110:2 MRL111000:8:01011111

  7. R11 R34 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 Indexed Pure RunLists (no mixed) ANDing. Only need 0RLs!Of course, you need 1RL’s to use as 0RL-comps (maintain 1RLs or construct 0RL-comps on the fly?)To get 1-counts, count 0’s and subtract from total. 0RL340000:20011:1 0101:4 1010:1 0RL110000:40101:1 1000:1 1010:1 1100:1 1110:1 0RL11^340000:40101:4 1010:1 1100:1 1110:1 0-count = sum of lengths = 11 1-count=16-11 = 5

  8. 4 3 5 2 1 1 1 1 1 1 1 1 Intra-Run cursors 1 1 1 1 1 1 1 1 1 1 3 4 R11’ R11 R34 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 2, 0RL11’^342, 1, 7, 6, 8, 9, 1, 1, 1, 1, 1, Zero RunLists ANDing of 34 and 11’ (with pure1 gaps)(0RL11’ is 0RL11 with a prefixed 0). 0RL342,1,1,1,4,1,1,5 0RL11’0,4,1,1,2,1,1,1,1,1,1,1,1 The 1count of the result is Total minus the 0count or 16 – 13 = 3 So, the coding of this AND program seems straight forward following the animation An intra-run cursor for each operand and a list cursor for each operand and one for the result. We, of course, need the 1RLs too (e.g., for 0RL of a complement). Next let’s allow the red gaps to be mixed and insist that the gaps in a 0RL and its corresponding 1RL be compatible.

  9. R11’ R11 R34 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 4 5 6 7 2 3:100 4:1010 0rl ANDing of 34 and 11’ with selected mixed gaps, differentiated by a prefix bit. We will use colors on the slides, pure gap=1, mixed gap=0 0rl342,3:101,4,1,1,5 0rl114,12:101101010101 1rl340,11:00101000010,5 1rl110,6:000010,2,8:01010101 0rl342,3:101,4,1,1,5 Note we have to flip mixeds 0rl11’0,6:111101,2,8:10 1 01010

  10. 1 1 1 1 1 1 3 1 1 R11’ R34 R11 2 3 4 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 5 6 5 01010 zmrl11’^342, 3 100 4 Take the philosophy that we will follow a pointer to long mixed runs only when necessary. Otherwise we will sequence straight across. zmrl342,3,4,1,1,5 101 zmrl114,1,1,2,0,8 01010101 zmrl342,3,4,1,1,5 zmrl11’0,4,1,1,2,0,8

  11. R11 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 When the 16-bit window moves left (e.g., add 100 to 0rl11). zmrl114,1,1,2,0,8 01010101 zmrl11 0,1,5,1,1,2,0,5 01010 0rl11’4,1,1,2,1,1,1,1,1,1,1,1 0rl11’ 0,1,6,1,1,2,1,1,1,1,1

  12. Network Security Application(Network security through Vertical Structured data) • Network layers do their own partitioning • Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) • Fragmentation/Reassembly, Segmentation/Reassembly • Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network • A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless. • Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). • Send one Ptree per packet • Send intra-message packets separately • Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. • The message is only meaningful after destination demux-ing • Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! • It seems like there ought to be a whole range of killer ideas associated with the concept of using vertical structuring data within network transmission units • Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)

More Related