Matúš Ondreička Superised by Prof. Jaroslav Pokorný Faculty of Mathematics and Physics

Efficient Top-k Searching According to User Preferences Based on Fuzzy Functions with Usage of Tree-Oriented Data Structures Matúš Ondreička Superised by Prof. Jaroslav Pokorný Faculty of Mathematics and Physics Department of Software Engineering Charles University in Prague Czech Republic

Research - outline • introduction • top-k problem, user preferences, fuzzy functions • related work • technical solutions • Tree-Oriented Data Structures • set of B+-trees • multidimensional B+-tree • multidimensional B+-tree with lists • MD-algorithm, MXT-algorithm • experiments, current results • motivation of future research Matúš Ondreička

Top-k problem • top-k searching • the (few) best k objects with more attributes • k objects with the highest ratting • according to user preferences • based on fuzzy functions • efficient top-k searching • without accessing all the objects • allow the full support of model of user preferences • local preferences • global preferences Matúš Ondreička

Model of user preferences • local preferences • objects are preferred according to one attribute • an attribute's domain is continuous • modeled with an fuzzy function fU(x): xA→ [0, 1] • an attribute's domain is discrete • evaluating of each value ACER := 0.6, APPLE := 1.0, DELL := 0.9, SONY := 0.8 • global preferences • objects are preferred according more attributes • modeled with an aggregation function @U(x): ( f1U(x), ..., fmU(x) )→ [0, 1] • e.g. weighted average 100% 1 fU(x) 0% 0 0€ 1000€ xA w1 . f1U(x) + ... + wm . fmU(x) w1+ ... + wm @U(x) = Matúš Ondreička

Motivation and related work • XML, multimedia, the Web, etc. • relational databases • Ilyas, Beskales, Soliman: A survey of top-k query processing techniques in relational database systems. 2008. • ranking functions • query optimalization • Fagin's algorithms • Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66, 2003. • only support of a monotone ranking functions • based on sorted lists • no supporting of local user preferences • BASIC MOTIVATION FOR OUR RESEARCH Matúš Ondreička

Usage of B+-tree • local user preference • by fuzzy function • on monotonous interval • moving in leaf level • ‘’ways’’ in leaf level • continuously on all ‘’ways’’ • comparing objects on different ‘’ways’’ • choosing the biggest on all the ‘’ways’’ • obtaining objects • during the computation of algorithm • with ratings • in descending order by fuzzy function fU 0.2 0.5 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 C Q D G R S T K M F Y E N U H w1 w2 w3 w4 w5 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Matúš Ondreička

Fagin's algorithms • TA (threshold algorithm) and NRA (no random access) • searches the best k objects • accordingto monotone aggregate function @ • without accessing all objects • preconditions • a set of objects X with values of m attributes A1, ..., Am • objects from the set X are stored in m lists L1, ..., Lm • lists contain pairs (x, ax) • lists are sorted in descending order • monotone aggregation function @ • multi-user solution • lists are based on B+-tree • algorithm can get pairs (x, fU(x)) fromB+-treesequentially • in descending orderaccording to user's fuzzy functionfU(x) L2 L3 L1 (x1, 1.0) (x1, 1.0) (x3, 1.0) (x4, 0.8) (x2, 0.8) (x4, 0.8) (x3, 0.6) (x3, 0.6) (x6, 0.6) (x5, 0.4) (x4, 0.4) (x1, 0.4) (x2, 0.2) (x5, 0.2) (x2, 0.2) (x6, 0.0) (x6, 0.0) (x5, 0.0) A1 B+-tree A2 B+-tree A3 B+-tree x1 x2 x3 x4 x5 x6 x5 x2 x1 x6 x4 x3 x6 x2 x5 x3 x4 x1 Matúš Ondreička

A1 A2 A3 A 0.0 0.4 0.3 B 0.0 0.4 0.5 0.0 1.0 0.5 C 0.0 1.0 0.5 D E 0.5 0.9 1.0 F 1.0 0.0 0.0 G 1.0 0.0 0.0 H 1.0 0.0 0.7 I 1.0 0.4 0.7 J 1.0 0.7 0.4 K 1.0 0.7 0.6 0.0 0.5 1.0 0.4 1.0 0.9 0.0 0.4 0.7 0.3 0.5 0.5 1.0 0.0 0.7 0.7 0.4 0.6 A B C E F H I J K D G Multidimensional B-tree • MDB-tree • allows to index set of objectsby m > 1 attributes in one data structure • mlevels, values of one attribute are stored in each level • nodes are B+-trees, whose leaf nodes are linked in two directions Matúš Ondreička

MD-algorithm • search the best k objects in a multidimensional B-tree (MDB-tree) without getting all the objects • principle of MD-algorithm • MD-algorithmsearches MDB-tree with the recursive procedure • it uses the temporary list TK of the best actual kobjects • analogically to Fagin’s TA-algorithm • it uses the best rating B(S) of B+-tree S • monotone aggregate function @ • definition • B(S)of B+-tree Sin i-th level of MDB-tree • B(S) = @(k1, ..., ki-1, 1, ..., 1) • example: @(xA1, xA2)= xA1 + xA2 B(S)=1+1= 2.0 0.4 0.7 0.8 B(S)=0.8+1= 1.8 1.0 0.8 0.3 0.6 0.3 0.5 B(S)=0.8+0.7= 1.5 A D F H B E Matúš Ondreička C G

Searching the best 3 objects TK object rating 1st 1 S1 2nd 3rd 0.8 0.3 0.6 1.0 1.0 0.0 0 f1U(x) B(S2)=1.0+1+1= 3.0 S2 1 S6 S8 0.6 0.6 0.2 0.1 0.4 0.6 0.3 0.0 0.4 0.9 1.0 0.5 0 f2U(x) B(S3)=1.0+0.6+1= 2.6 1 S7 S4 S3 S9 S5 S10 0.5 0.5 0.6 0.6 0.2 0.2 0 0.8 1.0 1.0 0.5 0.9 0.1 0.8 1.0 0.7 0.5 0.7 1.0 0 f3U(x) 2.1 2.1 2.2 1.8 B(S)=1.0+0.6+0.5= B(S)=1.0+0.6+0.6= B(S)=1.0+0.6+0.2= 2.1 2.2 1.8 I Y H U F G Z M C X D R L K W P M C Matúš Ondreička V E J Q A S T Q B O

MXT-algorithm • based on integration of MD-algorithm and TA-algorithm • uses new data structure: multidimensional B+-tree with lists • first n attributes (nominal) • stored and searched in the same way as in MD-algorithm • last m - n attributes (ordinal) • stored as groups of m - n Fagin's sorted lists • searched by instances of Fagin's TA-algorithm A1 A1 A2 A3 A4 0.6 1.0 0.0 0.1 x1 1.0 0.7 1.0 0.3 x2 1.0 0.7 0.8 0.0 A2 A2 A2 A2 1.0 0.7 0.5 1.0 x3 0.3 0.7 0.2 1.0 0.6 0.4 0.7 0.3 1.0 1.0 0.7 0.4 0.7 x4 x5 1.0 0.7 0.2 0.1 x6 1.0 0.7 0.0 0.6 A3 A4 A3 A4 A3 A4 A3 A4 A3 A4 A3 A4 A3 A4 {x1, 1.0} {x3, 1.0} {x2, 0.8} {x4, 0.7} ... … … … … {x3, 0.5} {x6, 0.6} A3 A4 A3 A4 {x4, 0.4} {x1, 0.3} {x5, 0.2} {x5, 0.1} {x6, 0.0} {x2, 0.0} Matúš Ondreička

An example of results • implemented top-k algorithms • TA-algorithm, MD-algorithm, MXT-algorithm • using lists based on B+-trees implementation in Java • data structures have been tested in memory (not on disk) • tests results • the number of obtained objects • real data • 8 822 flats for rent in Prague • ||dom(District)|| = 10 • ||dom(Type)|| = 10 • ||dom(Area)|| = 229 • ||dom(Price)|| = 411 • real user's preferences • user prefers flats of some types • in specific districts, • smaller prices and bigger areas Matúš Ondreička

Motivation, future research • improvements of performance of algorithms • heuristics • to monitor a distribution of the key values in nodes • improvement of data structures. • automatic arrangement levels in MDB-tree with lists, manage empty values • parallel computing • in MXT-algorithm construction, instances of TA-algorithm would be computed concurrently • different models of user preferences • attribute dependencies between more attributes • similarity measures • to find k objects most similar to an object can be user preference • user feedback • After running of first top-k query user tune his/her preferences and execute next top-k query • different data models • very large data sets • tree-oriented data structure allow to dynamise the environment while solving a top-k problem • data streams • tree-oriented data structure as a sliding window • approximations, uncertain data, heterogeneous data • web environment • more information resources distributed on the web Matúš Ondreička

An application TreeTopK Matúš Ondreička

Thank You for attention! Matúš Ondreička

Matúš Ondreička Superised by Prof. Jaroslav Pokorný Faculty of Mathematics and Physics