Introduction to Research - PowerPoint PPT Presentation

Introduction to research l.jpg
Download
1 / 69

Introduction to Research. Data Management and Database http://www.cs.fsu.edu/~lifeifei lifeifei@cs.fsu.edu Feifei Li. Outline. Background My Research Focus and Experience Some Problems I have worked on Current Interest and Activity My Experience as a PhD student Q&A. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Introduction to Research

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to research l.jpg

Introduction to Research

Data Management and Database

http://www.cs.fsu.edu/~lifeifei

lifeifei@cs.fsu.edu

Feifei Li


Outline l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


Outline3 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


A short history class l.jpg

A Short History Class

  • Undergraduate study in Tsinghua University (1997) (China) + Nanyang Technological University (Singapore) (1998-2002)

    • B. Applied Science

  • PhD study in Boston University (2002-2007)

    • Computer Science, Research Area: Database

  • Now…


Outline5 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


Research focus l.jpg

Research Focus

  • Data Management in General and (roughly in the order of the sequence I worked on) :

    • Efficient indexing, querying and managing large scale databases, or high dimensional databases

    • Spatio-temporal databases and applications

    • Sensor and Stream databases

    • Privacy preservation issues for data management

    • Query security for various types of data models

    • Uncertain databases and data cleaning


Experience l.jpg

Experience

  • SDE intern at M$ SQL server group, summer 2005 (Redmond, WA)

  • Research intern at IBM T. J. Watson Research Center (Hawthorne, NY), database research group, summer 2006

  • Visiting research student at AT&T Labs Research (Florham Park, NJ), database research group, winter 2006 and spring 2007

  • Research intern at MSR, database research group,summer 2007


Outline8 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


Outline9 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

    • Retrieving structured data from Web

    • Spatio-temporal databases

    • Sensor databases

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


The first step l.jpg

The First Step

  • My FYP (Final Year Project), around 2000-2001

    • Analyze and build structures of different websites

      • How to automate this process??

      • View a website as a tree structure and?

      • Given a group of similar websites, summarizing a suitable schema…

    • Retrieve information from certain part(s) of a website as specified by the user

      • With the structure information obtained at the first step

      • Why bother? Information integration, BBC in favors of Bush and CNN ‘hates’ him, then what’s the response to event A?

      • Another issue: semi-structured data (HTML) to structured data (XML)


So what happened l.jpg

So What Happened


Possible research problems l.jpg

Possible Research Problems

  • Automatic Schema Identification

    • Given a collection of data sources, find a common schema that maximally describes the dataset.

  • Information retrieval & search from Web

    • IR techniques (IR is a separate field by itself, unstructured data) + database techniques (structured data), how to combine the two?

    • Google

  • Information Integration

    • Given data source A and data source B, both refers to the same schema, but with (slightly) different instances, how to link/combine the two?


Then boston l.jpg

Then Boston

  • Quite a pleasant transition in the summer: Singapore (90+ year round) to Boston (80 in the summer)

  • Winter:

    • Anyway…


Outline14 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

    • Retrieving structured data from Web

    • Spatio-temporal databases

    • Sensor databases

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


What to do now l.jpg

What to Do Now?

  • Spatio-temporal databases and applications

    • Why? My advisor was in this area and…

  • Examples:

    • Indexing higher dimensional data:

      • 1d- B+ tree, 2d, 3d, 4d, …? kd-tree, R-tree

      • Space partitioning vs. Data partitioning

    • Queries

      • eg: continuous nearest neighbor query– continuously find the closest gas station when I am driving from Boston to NY.

    • Moving object

      • On Euclidian space

      • On a road network


Indexing high dimensional data r tree l.jpg

Indexing High Dimensional Data: R-tree

  • eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page

I

C

A

G

H

F

B

J

E

D


Example l.jpg

A

H

D

F

G

B

E

I

C

J

Example

  • F=4

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D


Example18 l.jpg

P1

A

H

D

F

P2

G

B

E

I

P3

C

J

P4

Example

  • F=4

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D


R trees search l.jpg

P1

A

H

D

F

P2

G

B

E

I

P3

C

J

P4

R-trees:Search

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D


R trees search20 l.jpg

P1

A

H

D

F

P2

G

B

E

I

P3

C

J

P4

R-trees:Search

P1

P3

I

C

A

G

H

F

B

J

E

P4

P2

D


Query in spatio temporal databases l.jpg

Query in Spatio-Temporal Databases

  • Trip Planning Queries (TPQ):

    • Given a starting location, a destination and arbitrary points of interest try to find the best possible trip.

  • Example:

    • Minimize the total traveling time from Boston to Providence, while visiting a post office, a hardware store and a gas station.


Visual example l.jpg

Visual Example

  • We can minimize the total distance, time, etc.

  • We can have different categories of points of interest (gas stations, hotels, etc.).

Home

Work

Gas station


The nearest neighbor algorithm l.jpg

The Nearest Neighbor Algorithm

B2

A2

S

D

C2

B3

B1

C1

A1

  • Yields a 2m+1 - 1 approximation where m is the total number of categories.


The minimum distance algorithm l.jpg

The Minimum Distance Algorithm

B2

A2

S

D

C2

B3

B1

C1

A1

  • Yields an m-approximation where m is the total number of categories.


Search over r tree and road network l.jpg

Search over R-Tree and Road network

  • R-Tree:

    • Euclidian space, how to utilize R-tree to speed up the search?

  • Road network:

M

D

p

A

S


Outline26 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

    • Retrieving structured data from Web

    • Spatio-temporal databases

    • Sensor databases

  • Current Interest and Activity

  • My Experience as a PhD student

  • Q&A


Sensor network model l.jpg

Sensor Network Model

  • Large set of sensors distributed in a sensor field.

  • Communication via a wireless ad-hoc network.

  • Node and links are failure-prone.

  • Sensors are resource-constrained

    • Limited memory, battery-powered, messaging is costly.


Sensor databases l.jpg

Sensor Databases

  • Useful abstraction:

    • Treat sensor field as a distributed database

      • But: data is gathered, not stored nor saved.

    • Express query in SQL-like language

      • COUNT, SUM, AVG, MIN, GROUP-BY

    • Query processor distributes query and aggregates responses

    • Exemplified by systems like TAG (Berkeley/MIT) and Cougar (Cornell)


A motivating example l.jpg

A Motivating Example

  • Each sensor has a single sensed value.

  • Sink initiates one-shot queries such as: What is the…

    • maximum value?

    • mean value?

  • Continuous queries are a natural extension.

B

6

D

3

2

A

G

10

7

I

J

H

6

H

F

C

9

4

12

E

1


Avg aggregation no losses l.jpg

AVG Aggregation (no losses)

  • Build spanning tree

  • Aggregate in-network

    • Each node sends one summary packet

    • Summary has SUM and COUNT of sub-tree

  • Reliability problem when there are losses (common for sensor network)

B

6

6,1

D

3

2

A

2,1

9,2

G

10

15,3

7

6,1

10,1

I

J

H

6

26,4

12,1

H

F

C

9

9,1

4

12

10,2

E

1

AVG=70/10=7


Avg aggregation naive l.jpg

AVG Aggregation (naive)

  • What if redundant copies of data are sent?

  • AVG is duplicate-sensitive

    • Duplicating data changes aggregate

    • Increases weight of duplicated data

B

6

6,1

D

3

2

A

2,1

9,2

G

10

15,3

7

6,1

22,2

I

J

H

6

12,1

26,4

12,1

H

F

C

9

9,1

4

12

10,2

E

1

AVG=82/11≠7


Avg aggregation tag l.jpg

AVG Aggregation (TAG++)

  • Can compensate for increased weight

    • Send halved SUM and COUNT instead

  • Does not change expectation!

  • Only reduces variance

B

6

6,1

D

3

2

A

2,1

9,2

G

10

15,3

7

6,1

16,0.5

I

J

H

6

6,0.5

20,3.5

6,0.5

H

F

C

9

9,1

4

12

10,2

E

1

AVG=70/10=7


Avg aggregation list l.jpg

AVG Aggregation (LIST)

  • Can handle duplicates exactly with a list of <id, value> pairs

  • Transmitting this list is expensive!

  • Lower bound: linear space is necessary if we demand exact results.

B

6

B,6

D

3

2

A

A,2

B,6;D,3

G

10

A,2;G,7;H,6

F,1;I,10

7

H,6

I

J

H

6

C,9;E,1;F,1;H,4

F,1

F,1

H

F

C

9

C,9

4

12

C,9;E,1

E

1

AVG=70/10=7


Count sketches l.jpg

COUNT Sketches

  • Problem: Estimate the number of distinct item IDs in a data set with only one pass.

  • Constraints:

    • Small space relative to stream size.

    • Small per item processing overhead.

    • Union operator on sketch results.

  • Exact COUNT is impossible without linear space.

  • First approximate COUNT sketch in [FM’85].

    • O(log N) space, O(1) processing time per item.


Counting paintballs l.jpg

Counting Paintballs

  • Imagine the following scenario:

    • A bag of n paintballs is emptied at the top of a long stair-case.

    • At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way.

Looking only at the pattern of marked steps, what was n?


Counting paintballs cont l.jpg

Counting Paintballs (cont)

B(n,1/2)

  • What does the distribution of paintball bursts look like?

    • The number of bursts at each step follows a binomial distribution.

    • The expected number of bursts drops geometrically.

    • Few bursts after log2 n steps

B(n,1/4)

1st

2nd

B(n,1/2 S)

S th

B(n,1/2 S)


Counting paintballs cont37 l.jpg

Counting Paintballs (cont)

  • Many different estimator ideas [FM'85,AMS'96,GGR'03,DF'03,...]

  • Example: Let pos denote the position of the highest unmarked stair,

    E(pos) ≈ log2(0.775351 n)

    2(pos) ≈ 1.12127

  • Standard variance reduction methods apply

  • Either O(log n) or O(log log n) space


Application to sensornets l.jpg

Application to Sensornets

  • Each sensor computes k independent sketches of itself using its unique sensor ID.

    • Coming next: sensor computes sketches of its value.

  • Use a robust routing algorithm to route sketches up to the sink.

  • Aggregate the k sketches via in-network XOR.

    • Union via XOR is duplicate-insensitive.

  • The sink then estimates the count.

  • Similar to gossip and epidemic protocols.

  • How about SUM and other aggregates??


Count vs link loss grid l.jpg

COUNT vs Link Loss (grid)


Outline40 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

    • Privacy Preservation

    • Query Security

  • My Experience as a PhD student

  • Q&A


Privacy preservation l.jpg

Privacy Preservation

Sum=$7,000

It is not legal to query about individual person’s salary. However, we are

Interesting (and often time legal) at retrieving the avg. what do you do?

Basic Intuition:

Add Identical Independent Distributed Random (IID) Noise with Zero Mean

Perturb the data… How?

Add random noise…in a particular way

Sum=$0

Sum=$7,000


How about multiple attributes multi dimensional data l.jpg

How about Multiple Attributes (multi-dimensional data)?

  • Is IID noise really preserving the privacy??


Principal component analysis pca l.jpg

Principal Component Analysis: PCA

i.i.d Noise


Principal component analysis pca44 l.jpg

Principal Component Analysis: PCA

Correlated Noise


Pca based data reconstruction l.jpg

A*

Added Noise: Utility

Removed Noise

σ2

Projection Error

A~

Remaining Noise

Privacy

PCA Based Data Reconstruction

A: Original Data

A*: Perturbed Data

A~: Reconstructed Data

A

Principal

Direction


Pca based data reconstruction46 l.jpg

Added Noise: Utility

σ2

A*

Projection Error

A~

Remaining Noise

Privacy

PCA Based Data Reconstruction

Correlated Noise!

A: Original Data

A*: Perturbed Data

A~: Reconstructed Data

A

Principal

Direction


Data perturbation main idea l.jpg

Data Perturbation: main idea

  • Observations

    • The amount of the random noise controls privacy/utility tradeoff

    • i.i.d (identical independently distributed) noise does not preserve the privacy! Not well enough

  • Lesson learned

    • Noise should be correlated with original data


How about streaming data l.jpg

How about Streaming Data?

  • Streaming data: Data continuously arrives , no global data is available, hence cannot get the global trends.

Online

Correlated Noise

Correlated Noise

i.i.d Noise


Outline49 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

    • Privacy Preservation

    • Query Security

  • My Experience as a PhD student

  • Q&A


Example of data publishing l.jpg

Example of Data Publishing

www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hadjieleftheriou:Marios.html

www.sigmod.org/dblp/db/indices/a-tree/h/Hadjieleftheriou:Marios.html


Outsourced database for better query services l.jpg

Outsourced Database for Better Query Services

Company with headquarters in US

Servers that are close to local clients and

maintained by local business partners


Publishing data and outsourcing query service l.jpg

Publishing Data and Outsourcing Query Service

Network

0 1 1 0 0 1 … 1 1 0 …

IP Traffic Streamcoming from

Gigascope:analysis tool by

Results

statistics


Data publishing model him02 l.jpg

Data Publishing Model [HIM02]

Owner: publish data

Servers: host (or monitor) the data and provide query services

Clients: query the owner’s data through servers

(possibly = owner)

clients

/

servers

owner

H. Hacigumus, B. R. Iyer, and S. Mehrotra, ICDE02


Information security issues l.jpg

Information Security Issues

  • The third-party (server) cannot be trusted

    • Malicious intent

    • Compromised equipment

    • Unintentional errors (e.g. bugs)


Problem 1 injection l.jpg

Problem 1: Injection

Select * from T where 5<A<11

client

owner

Returns 7, 8, 9

server


Problem 2 drop l.jpg

Problem 2: Drop

Select * from T where 5<A<11

client

owner

Returns 7

9

ri+1

server


Query authentication goals l.jpg

Query Authentication: Goals

Query Correctness

results do exist in the owner's database

Query Completeness

no records have been omitted from the result

Query Freshness

latest available answer (in case of updates)


General approach l.jpg

General Approach

Authenticated Structures

Verification Object (VO)

Query results

clients

servers

owner


Merkle hash tree m89 l.jpg

Sign(h1..8,SK)

h1..8

h1..4

h5..8

h12

h34

h56

h78

h1

h2

h3

h4

h5

h6

h7

h8

Merkle Hash Tree[M89]

Collision resistant hash function any change in the

tree will lead to a different hash value for the root

Digital signature of the root  no one except the owner

could produce the signature

Single signature to sign many messages

Hash function is publicly known

Ver(h1..8,  ,pK)=valid?

h1..8

h1..4

h5..8

h12=

H(h1|h2)

h56

h78

h5

h6

m1

m2

m3

m4

m5

m5

m6

m6

m7

m8

R. C. Merkle. CRYPTO, 1989


Merkle b mb tree natural extension for range query l.jpg

Merkle B(MB) Tree: Natural Extension for Range Query

410

720

250

320

410

600

720

t1

t2

t3

t4

t5

  • Extend it with hash information:

leaf node

Kj

hj=H(tj)

Ki

hi=H(ti)

Use a B+-tree instead of a binary search tree:


Merkle b mb tree natural extension for range query61 l.jpg

p10

h10

p11

k11

h11

Merkle B(MB) Tree: Natural Extension for Range Query

p0

h0

p1

k1

h1

h1

pf

kf

hf

h1=H(h10|…|h1f)

h10

h11

For root node, =Sign(h0|…|hf)


How about streaming data outsourced stream model stock trading monitoring l.jpg

How about Streaming Data?Outsourced stream model: stock trading monitoring

Servers

(bloomberg)

Q

Provider:

A stock broker

Register Queries:

Sliding window query and/or

One shot query

Clients


And other model revisiting the cisco at t example l.jpg

And Other Model?Revisiting the CISCO – AT&T Example

Network

Gigascope

IP Traffic Stream

0 1 1 0 0 1 … 1 1 0 …

statistics

CISCO owns the Network Traffic Data: He is both the data owner and the client!

lawyers: sign the trust agreement

Could we help? (computer scientists)


Outline64 l.jpg

Outline

  • Background

  • My Research Focus and Experience

  • Some Problems I have worked on

  • Current Interest and Activity

    • Uncertainty Database and Data Cleaning , Talk to me if you’d like to learn more..

  • My Experience as a PhD student

  • Q&A


My lessons l.jpg

My Lessons

  • Have a strong motivation

    • You are doing a PhD for yourself, not for FSU, not for your advisor

  • Find a topic that attracts you the most

    • PhD could be frustrating and boring + You are almost broke as a student.. So why not do sth that you have the greatest interest at?


My lessons66 l.jpg

My Lessons

  • Meet your advisor as often as possible

    • He is almost the only one that really cares about your PhD and knows what you are doing…

  • Make connections

    • Whenever possible, conferences, industry labs etc and work on your communication skills (including writing the papers)


My lessons67 l.jpg

My Lessons

  • Read Papers

    • As much as you can! It’s never enough

  • Finally

    • Work hard , enjoy your life and good luck!! 


Questions l.jpg

Questions?

  • Thank you!


Back to count sketches l.jpg

Back to COUNT Sketches

  • The COUNT sketches of [FM'85] are equivalent to the paintball process.

    • Start with a bit-vector of all zeros.

    • For each item,

      • Use its ID and a hash function for coin flips.

      • Pick a bit-vector entry.

      • Set that bit to one.

  • These sketches are duplicate-insensitive:

{x}

1

0

0

0

0

{y}

0

0

1

0

0

{x,y}

1

0

1

0

0

  • "A,B (Sketch(A) XOR Sketch(B)) = Sketch(A  B)


  • Login