Advancing Intelligent Querying with AI Techniques in Research

“Artificial Intelligence” in my research Seung-won HwangDepartment of CSEPOSTECH

Recap • Bridging the gap between under-/over-specified user queries • We went through various techniques to support intelligent querying, implicitly/automatically from data, prior users, specific user, and domain knowledge • My research shares the same goal, with some AI techniques applied (e.g., search, machine learning)

The Context: query top-3 houses select * from housesorder by [ranking function F]limit 3 Rank Formulation Rank Processing ranked results e.g., realtor.com

Overview Usability:Rank Formulation query top-3 houses select * from housesorder by [ranking function F]limit 3 Rank Formulation Rank Processing Efficiency:Processing Algorithms ranked results e.g., realtor.com

Part I: Rank Processing • Essentially a search problem (you studied in AI)

Limitation of Naïve approach Merge step Sort step new (search predicate) : x F = min(new,cheap,large) k = 1 a:0.90, b:0.80, c:0.70, d:0.60, e:0.50 cheap (expensive predicate) : pc û û û Algorithm b:0.78 d:0.90, a:0.85, b:0.78, c:0.75, e:0.70 large (expensive predicate) : pl û û û b:0.90, d:0.90, e:0.80, a:0.75, c:0.20 • Our goal is to schedule the order of probes to minimize the number of probes

a b c a:0.9 a:0.85 b:0.8 b:0.8 c:0.7 c:0.7 d:0.6 d:0.6 e:0.5 e:0.5 global schedule : H(pc, pl) 0.85 0.75 0.75 0.78 0.90 0.78 Unnecessary probes initial state pr(a,pc) =0.85 pr(a,pl) =0.75 e d a b c e d b b goal state

Search Strategies? • Depth-first • Breadth-first • Depth-limited / iterative deepening (try every depth limit) • Bidirectional • Iterative improvement (greedy/hill climbing)

Best First Search • Determining which node to explore next, using evaluation function • Evaluation function: • exploring more on object with the highest “upper bound score” • We could show that this evaluation function minimizes the number of evaluation, by evaluating only when “absolutely necessary”.

Necessary Probes? • Necessary probes • probe pr(u,p) is necessary if we cannot determine top-k answers until probing pr(u,p), where u: object, p: predicate Let global schedule be H(pc, pl) top-1: b(0.78) 0.85 0.75 0.75 ≤0.90 Can we decide top-1 without probing pr(a,pc)? 0.78 0.90 0.78  No pr(a,pc) necessary! 0.75 0.20 0.20 0.90 0.90 0.60 0.70 0.80 0.50

a:0.9 a:0.85 b:0.8 b:0.78 b:0.78 b:0.8 b:0.8 a:0.75 a:0.75 a:0.75 c:0.7 c:0.7 c:0.7 c:0.7 c:0.7 d:0.6 d:0.6 d:0.6 d:0.6 d:0.6 e:0.5 e:0.5 e:0.5 e:0.5 e:0.5 global schedule : H(pc, pl) 0.85 0.75 0.75 0.78 0.90 0.78 Unnecessary probes pr(a,pc) =0.85 pr(a,pl) =0.75 pr(b,pc) =0.78 pr(b,pl) =0.90 Top-1 b:0.78

Generalization Random Access Sorted Access r = ¥ (impossible) r =1 (cheap) r = h (expensive) Unified Top-k Optimization [ICDE05a/TKDE] NRA, StreamCombine FA, TA, QuickCombine CA, SR-Combine s =1 (cheap) FA, TA, QuickCombine NRA, StreamCombine s = h (expensive) MPro [SIGMOD02/TODS] s = ¥ (impossible)

Just for Laugh: Adapted from Hyountaek Yong’s presentation  Strong nuclear force Electromagnetic force Weak nuclear force Gravitational force Unified field theory

FA TA NRA CA MPro Unified Cost-based Approach

Generality • Across a wide range of scenarios • One algorithm for all

Adaptivity • Optimal at specific runtime scenario

Cost based Approach • Cost-based optimization • Finding optimal algorithmfor the given scenario, with minimum cost, from a space  •  • Mopt

Evaluation: Unification and Contrast (v. TA) Unification: For symmetric function, e.g., avg(p1, p2), framework NC behaves similarly to TA Contrast: For asymmetric function, e.g., min(p1, p2), NC adapts with different behaviors and outperforms TA cost cost N T N depth intop2 depth intop2 T N depth intop1 depth intop1

Part II: Rank Formulation Usability:Rank Formulation query top-3 houses select * from housesorder by [ranking function F]limit 3 Rank Formulation Rank Processing Efficiency:Processing Algorithms ranked results e.g., realtor.com

Learning F from implicit user interactions Using machine learning technique (that you will learn soon!) to combinequantitative model for efficiency and qualitative model for usability • Quantitative model • Query condition is represented as a mapping F of objects into absolute numerical scores • DB-friendly, by attaining the absolute score on each object • Example F( )=0.9 F( )=0.5 • Qualitative model • Query condition is represented as a relative ordering of objects • User-friendly by alleviating user from specifying the absolute score on each object • Example >

A Solution: RankFP (RANK Formulation and Processing) For usability, a qualitative formulation front-endwhich enables rank formulation by ordering samples For efficiency, a quantitative ranking function F which can be efficiently processed yes Over S: RF» R*? ranking R* over S Q: select * from housesorder by Flimit k ranking function no Function Learning: learn newF 5 4 3 F 2 1 ranked results processing of Q Sample Selection: generate new S sample S (unordered) Rank Processing Rank Formulation

Challenge: Unlike a conventional learning problem of classifying objects into groups, we learn a desired ordering of all objects Solution:We transform ranking into a classification on pairwise comparisons [Herbrich00] learning algorithms: a binary classifier + - F Task 1: RankingClassification classification view: ranking view: c > b > d > e > a pairwise comparison classification c a-b - b - b-c d + c-d e a + d-e - a-c … … [Herbrich00] R. Herbrich, et. al. Large margin rank boundary for ordinal regression. MIT Press, 2000.

Challenge: With the pairwiseclassification function, we need to efficiently process ranking. Solution:developing duality connecting F also as a global per-object ranking function. Task 2: ClassificationRanking F(a-b)? F(a)=0.7 F(a-c)? F(a-d)?….. • Suppose function F is linearClassification View:Ranking View:F(ui-uj)>0  F(ui)- F(uj)>0  F(ui)> F(uj) b a • Rank with F(.)e.g., F(c)>F(b)>F(d)>… c e d

Task 3: Active Learning • Finding samples maximizing learning effectiveness • Selective sampling: resolving the ambiguity • Top sampling: focusing on top results • Achieving >90% accuracy in <=3 iterations (<=10 ms) F F

Using Categorization for Intelligent Retrieval • Category structure created a-priori (typically a manual process) • At search time: each search result placed under pre-assigned category • Susceptible to skew  information overload

Categorization: Cost-based Optimization • Categorize results automatically/dynamically • Generate labeled, hierarchical category structure dynamically based on the contents of the tuples in the result set • Does not suffer from problems as in a-priori categorization • Contributions: • Exploration/cost models to quantify information overload faced by an user during an exploration • Cost-driven search to find low cost categorizations • Experiments to evaluate models/algorithms

Thank You!

Advancing Intelligent Querying with AI Techniques in Research