Download Presentation
## Query-Based Data Pricing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Query-Based Data Pricing**ParaschosKoutris PrasangUpadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington PODS 2012**Motivation**• Data is increasingly sold and bought on the web • Websites that sell data: • AggData[www.aggdata.com] • Xignite (financial data) [www.xignite.com] • Gnip (social media) [www.gnip.com] • Data marketplace services: • Windows Azure Marketplace (100+ datasets) [datamarket.azure.com] • Infochimps (15,000 datasets) [www.infochimps.com] Query-based pricing customized for buyers**Current Pricing (1)**• A fixed price for the whole dataset or for a specific set of views • Example:CustomLists • USA Business Database for $399 • Email addresses for $299 • Businesses in WA for $199 • Limitations: • Restaurants in WA ? • Businesses in cities with population >100,000 ?**Current Pricing (2)**• API Subscriptions (Azure Marketplace, Infochimps) • Allow queries over the data • Pay by number of transactions (page of results)**Issues With Pricing**• Buyers today need to buy a superset of the data they are interested in • Sellers can’t easily anticipate all possible queries that buyers might ask • Solution: we need a more flexiblepricing scheme, parameterizedby queries**Outline**• The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections**The Pricing Framework**• The seller defines price points (view-price pairs): S = { (V1,p1), (V2,p2), … } • A buyer can buy anyquery Q • The system will compute priceDS(Q) Buyer Q(D) ? Seller priceDS(Q) Pricing System + Database D V1,p1 V2,p2 …**Instance-Based Determinacy**Definition. V = V1,…,Vkdetermine Q given D, denoted D ⊢ V ↠ Q, if: forall D’, if V(D) = V(D’), then Q(D) = Q(D’) Intuitively, “V1,…, Vk determine Q” means that Q(D) can be answered only from V1(D),…,Vk(D), without accessing the database instance D**Arbitrage-Free**• Axiom 1. • Given D, the pricing function priceD(Q) is arbitrage-free if for all views V1, …, Vk and query Qwhere D ⊢ V1, …, Vk↠ Q: • priceD(Q) ≤ priceD(V1) + … + priceD(Vk) Suppose V determines Q and priceD(Q) > priceD(V). Then, we can • buy V(D) for priceD(V) • compute Q(D) from V(D) • now we have answered Q at some price p<priceD(Q)**Discount-Free**Axiom 2. The pricing function priceD(Q) should not offer any other additional discounts except for the explicit price points defined by the seller. • The intuition is that the price points represent discounts that the seller offers relative to the price of the whole database • A pricing function is discount-free if it is maximal**Example: Origami Database**Database S Price points Get all dragon origami for $2 Get all red origami for $3 What is the price of the entire database? Q(x,y,z) :- S(x,y,z) Exhausts the active domain V1, V2, V3, V4determine Q: price(Q) ≤ $8W1, W2, W3determine Q: price(Q) ≤ $9 price(Q)=$8**Example: Origami Database**R T S p(σcolor)=$50 p(σshape)=$99 p(σshape)=$2 p(σcolor)=$5 What is the price of the full join? Q(x,y,z,u,v) :- R(x,u), S(x,y,z), T(y,v)**Outline**• The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections**The Query Pricing Formula**• Given: • Price points S = {(V1,p1),…,(Vk, pk)} • Database instance D • Query Q. • Compute: priceDS(Q) • Properties: (a) arbitrage-free, (b) discount-free, (c) priceDS(Vi)=pi • If it exists, we say that the price points are consistent • Method: • Consider all subsets of V ={V1,…,Vk} that determine Q • Let C be the subset with the minimum price, Σi pi, for Viin C • Define pD(Q) = Σi pi Theorem. The price points are consistentiffpD(Vi)=pi for any price point i=1,…,k (b) priceDS(Q) = pD(Q) is the uniquearbitrage-free, discount-free pricing function that agrees with the price points 15**Discussion**• If the result of Q1 is always a subset of Q2, should Q1 be priced less than Q2? No! Example: • V(x,y) :- Fortune500(x,y)Q(x,y) :- Fortune500(x,y), StrongBuyRec(x) • price(Q) >> price(V) • We ignore computation costs in our framework • Cost of computing query Q • Q(D)=f(V(D)), but f can be hard to compute**Outline**• The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections**Determinacy**Definition. [Instance-dependent] V determines Q given D, denoted as D ⊢ V ↠ Q, if: forall D’, if V(D’) = V(D), then Q(D) = Q(D’) [Nash, Segoufin, Vianu ‘07] Definition. [Instance-independent] V determines Q, denoted as V ↠ Q, if: forall D, D’, if V(D) = V(D’), then Q(D) = Q(D’) V ↠ Q iffthere exists a function f such that Q(D) = f(V(D)) for all D ifffor every D, we have that D ⊢ V ↠ Q**Complexity Of Determinacy**Open Question: is the bound on the combined complexity tight?**Complexity Of Pricing**• Corollary. • Deciding whether priceDS(Q) ≤ k is: • Combined complexity [input S, D]: Σp2 • Data complexity [input D]: coNP-hard Proposition. Pricing is at least as hard as determinacy How do we deal with the hardness of computation?**Outline**• The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections**Restricting Price Points to Selections**• A seller can specify only the prices of selectionqueries of the form σR.X=a: prices on columns • The domain of each column is finite and known to buyers and sellers • Price points on selections is how prices are set in most cases today**Dichotomy Theorem**Theorem. Assuming selection views only, for any Conjunctive Query w/o self-joins Q, one of the following holds (data complexity): priceQS(D) is in PTIME checking whether priceQS(D)≤k is NP-complete • PTIME: • Q(x,y,z,u,v) :- R(x,u),S(x,y,z),T(y,v) [Chains] • Q(x1,…,xk) :- R1(x1,x2),…,Rk(xk,x1) [Cycles] • NP-complete: • Q(x) :- R(x,y) [Projections] • Q(x,y,z) :- R(x,y,z),S(x),T(y),U(z)**Algorithm For PTIME Cases**• The algorithm uses a reduction to maximum flow • Edges of finite capacity represent price points • A set of edges of finite cost is a cutiff they determine the query • Example: • Chain query Q(x,y):-R(x),S(x,y),T(y) S R T Dom(X) = {a1,a2,a3,a4} Dom(Y)= {b1,b2,b3}**S**Flow Graph R T R T a4 b1 a3 b2 a2 b3 a1 a4 b1 a3 b2 a2 b3 A set of edges of finite cost is a cutiff they determine the query a1 S**Conclusions**• Summary: • The seller sets prices to some views, while the system computes the price of any query • Interesting application of query determinacy • Complexity: dichotomy for CQs w/o self-joins • Future Work: • Pricing in the presence of updates • How do we overcome pricing for intractable queries? • Connection of pricing and privacy