Private Keyword Search on Streaming Data

Private Keyword Search on Streaming Data Rafail Ostrovsky William Skeith UCLA (patent pending)

Motivating Example • The intelligence community collects data from multiple sources that might potentially be “useful” for future analysis. • Network traffic • Chat rooms • Web sites, etc… • However, what is “useful” is often classified.

Current Practice • Continuously transfer all data to a secure environment. • After data is transferred, filter in the classified environment, keep only small fraction of documents.

Filter Storage Classified Environment ¢¢¢! D(1,3)! D(1,2)! D(1,1)! D(3,1) D(1,1) D(1,2) D(2,2) D(2,3) D(3,2) D(2,1) D(1,3) D(3,3) ¢¢¢! D(2,3)! D(2,2)! D(2,1)! Filter rules are written by an analyst and are classified! ¢¢¢! D(3,3)! D(3,2)! D(3,1)!

Current Practice • Drawbacks: • Communication • Processing

How to improve performance? • Distribute work to many locations on a network • Seemingly ideal solution, but… • Major problem: • Not clear how to maintain privacy, which is the focus of this talk

Storage E(D(1,2)) E(D(1,3)) Filter ¢¢¢! D(1,3)! D(1,2)! D(1,1)! Classified Environment Decrypt Storage E(D(2,2)) Filter ¢¢¢! D(2,3)! D(2,2)! D(2,1)! Storage D(1,2) D(1,3) D(2,2) Storage Filter ¢¢¢! D(3,3)! D(3,2)! D(3,1)!

Example Filter: • Look for all documents that contain special classified keywords, selected by an analyst • Perhaps an alias of a dangerous criminal • Privacy • Must hide what words are used to create the filter • Output must be encrypted

More generally: • We define the notion of Public Key Program Obfuscation • Encrypted version of a program • Performs same functionality as un-obfuscated program, but: • Produces encrypted output • Impossible to reverse engineer • A little more formally:

Public Key Program Obfuscation

Privacy

Related Notions • PIR (Private Information Retrieval) [CGKS],[KO],[CMS]… • Keyword PIR [KO],[CGN],[FIPR] • Program Obfuscation [BGIRSVY]… • Here output is identical to un-obfuscated program, but in our case it is encrypted. • Public Key Program Obfuscation • A more general notion than PIR, with lots of applications

What we want Filter Storage ¢¢¢! D(1,3)! D(1,2)! D(1,1)!

This is matching document #1 This is a Non-matching document This is a Non-matching document This is matching document #2 This is a Non-matching document This is matching document #3

How to accomplish this?

Several Solutions based on Homomorphic Encryptions • For this talk: Paillier Encryption • Properties: • Plaintext set = Zn • Ciphertext set = Z*n2 • Homomorphic, i.e., E(x)E(y) = E(x+y)

Simplifying Assumptions for this Talk • All keywords come from some poly-size dictionary • Truncate documents beyond a certain length

D Dictionary . . . (g,gD) ¤= ¤= ¤= Output Buffer

Here’s another matching document • Collisions cause two problems: • Good documents are destroyed • 2. Non-existent documents could be fabricated This is matching document #2 This is matching document #1 This is matching document#3

We’ll make use of two combinatorial lemmas…

How to detect collisions? • Append a highly structured, (yet random) k-bit string to the message • The sum of two or more such strings will be another such string with negligible probability in k • Specifically, partition k bits into triples of bits, and set exactly one bit from each triple to 1

100|001|100|010|010|100|001|010|010 010|001|010|001|100|001|100|001|010 010|100|100|100|010|001|010|001|010 = 100|100|010|111|100|100|111|010|010

Detecting Overflow > m • Double buffer size from m to 2m • If m < #documents < 2m, output “overflow” • If #documents > 2m, then expected number of collisions is large, thus output “overflow” in this case as well. • Not yet in eprint version, will appear soon, as well as some other extensions.

More from the paper that we don’t have time to discuss… • Reducing program size below dictionary size (using  – Hiding from [CMS]) • Queries containing AND (using [BGN] machinery) • Eliminating negligible error (using perfect hashing) • Scheme based on arbitrary homomorphic encryption

Conclusions • Private searching on streaming data • Public key program obfuscation, more general than PIR • Practical, efficient protocols • Many open problems

Thanks For Listening!

Private Keyword Search on Streaming Data