Lecture 3 data data structures and reasoning in science
Download
1 / 44

CSC 599: Computational Scientific Discovery - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Lecture 3: Data, Data Structures, and Reasoning in Science. CSC 599: Computational Scientific Discovery. Outline. Computers Helping Scientists Data in Science Data Catalogs Data structures for scientific knowledge Equations Symbolic knowledge Simulations and scientific reasoning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CSC 599: Computational Scientific Discovery' - rumer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 3 data data structures and reasoning in science

Lecture 3: Data, Data Structures, and Reasoning in Science

CSC 599: Computational Scientific Discovery


Outline
Outline

Computers Helping Scientists

Data in Science

  • Data

  • Catalogs

    Data structures for scientific knowledge

  • Equations

  • Symbolic knowledge

    Simulations and scientific reasoning

  • Do simulations

  • Architectures for sci. reasoning

  • QSIM

    Next time:

  • Probabilistic reason and graphical models


Computers cooperating with scientists
Computers cooperating with Scientists

Play to the strengths of computers

  • Fast

  • Accurate (we hope!)‏

  • Don't get bored

  • Good for simulations of complex systems!

    Avoiding their weaknesses

  • No “common sense”

    • Garbage In, Garbage out

    • Calculation bugs

  • Time to develop/maintain code

    • Use someone else's program, if exists


Don t just code it up
Don't Just Code-It-Up!

Do you really need a computer?

  • Analytical solutions are cooler

  • Integrate! (If you can)‏

  • Ordinary differential equation

    • System of eqns & derivatives w.r.t. one independ var

      • Eg. time

    • Laplace and z-transformations

    • Perturbation expansions and discrete time eqns

  • Partial differential equations

    • Another variable can change

      • For fluids: pressure, temperature, etc.

    • Separation of variables

  • See Gershenfeld “The Nature of Mathematical Modeling”


Data in science
Data in Science

You make a measurement, should record?

  • Value (numeric or conceptual)‏

  • Precision

    • For measured numbers: Mean and std deviation

    • For concepts: set identity

      “car” vs. “Toyota” vs. “Prius”

  • Domain

    • System constraints

      • Range definition limits

      • System extent limits

      • Saturation limits

    • Instruments constraints

      • Range definition limits

      • Detection limits

      • Reliable limits

    • Data constraints

      • Observed min/max w/in system extent and detect limits


Data in science 2
Data in Science (2)

You make a measurement, should record?

  • Units and Dimensions

  • When and Where

    • Speed of light expected to be observer-independent

    • Count of robins? First bloom of dandelions?

  • With what equipment

    • Eg. camera model and film type

  • By whom?

    • Implicit paradigm

    • Yesterday's Chinese “guest stars” are today's supernovas

  • Why?

    • What are you trying to look for?

    • Might hint at what your data do not show


Laboratories where the data comes from
Laboratories: Where the Data Comes From

Places where observations are made

  • Astronomical Observatories

  • Particle Physics Accelerators

    • CERN

    • FermiLab

  • Biology Labs

    • Human Genome Project

      Issues

  • Distributed observations

    • Human Genome Project

  • Collaborative observations


Catalogs
Catalogs

What is (and is not) Recorded

Distributed Nature of Catalogs

Synonyms in Catalogs

Missing and Error Values in Catalogs

Permanence of Catalogs


What is and is not recorded
What is (and is not) recorded

Not record “outside of bounds”

  • Space

    • Outside of area of interest

  • Time (maybe Money!)‏

    • Budget ran out, no more money for sensing

  • Magnitude (strength)‏

    “Outside of bounds” may be a “soft” concept

  • USGS records earthquake

    • All Mw > 7

    • Most Mw > 6

    • Has Mw≈ 5 and below close to seismometers

      Q: Is it true that the only small earthquakes are close to seismometers?


Distributed nature of catalogs
Distributed Nature of Catalogs

Distributed databases

  • Hopefully designed properly

    • Redundancy

    • Reliably online up-time

      Central repository

  • Centralizes data available from other dbs

    • Example: some biochemical dbs

  • May use

    • inconsistent terms,

    • different missing value conventions, etc

  • May be out-of-date relative to source dbs

  • Format may be “least common denominator” of more specialized formats


Synonyms and different meanings for a given word
Synonyms and Different Meanings for a Given Word

Concepts

  • ASCII or UNICODE string

    Consider the term “robin”

  • In N. America:

    • “American robin”

    • “Turdus migratorius

  • In Europe

    • “European robin”

    • “Erithacus rubecula”

  • What about these terms?

    • “thrush family bird”

    • “songbird”

    • “bird”

      Ontologies can help (if used)‏


Missing and errors values
Missing and Errors Values

Missing values often coded as sentinel values

  • There might be more than one!

  • Esp. for “required” fields

    • Clerks were required to get customer telephone #

    • Most common tele number? (111) 111-1111

      Errors values

  • Probably a bigger problem before computers

  • Computers can solve some problems:

    • Sanity rules

    • Parity, checksum (corruption), nonce (security)‏

  • Computers can add problems too

    • NASA launched satellite to look at atmospheric ozone

    • Satellite reported very low O3 levels by south pole

    • Data was not believed, BUT WAS TRUE (Ozone hole)‏


Permanence of catalog
Permanence of Catalog

How long does the media last?

  • Is the data viewed as temporary or worth archiving?

    • Is more data generated than can be stored?

    • Proprietary?

    • Subject to privacy issues?

  • Backup policy?

  • Organization that holds the data

    • Tech savvy?

    • Have money?

    • Well-organized in general?

  • Redundant sites?

    • Banks do this, but they have $ and are protecting $


The importance of prediction
The Importance of Prediction

Scientists do a lot of things:

  • Prediction underlies much

    Scientific assertions

  • Numbers: Equations

  • Concepts: Rules, decision trees, production sys

  • Probabilities: Graphical models

    • Next time!


Equations
Equations

“Strongest” form of knowledge

Defines relationships between quantities

F = ma

  • Force and acceleration acting in same direction

  • Mass is scalar

    For N variables: N-1 knowns -> 1 unknown

    Scalar or Vector

    F = ma


Equations cont d
Equations, cont'd

Are they always generalizations?

The Drake Equation:

N = R* * fp * ne * fl * fi * fc * L

Where:

N = Number of civilizations emitting radio transmissions

R* = rate of Sun-like star formation

fp = fraction of stars with planets

ne = average number of inhabitable planets

fl,fi,fc = probability of life, intelligence and civilization

L = average duration of a civilization

So far N = 0, is this equation useful?

Do we know any of these terms well?

What if N was non-zero, would the eqn be useful?


Computing with symbols
Computing with symbols

Rules

  • If dog(X) and wet(X) then smelly(X)‏

    Decision trees

    Production systems

  • Like rules, but special memory to hold what is true


Simulations
Simulations

Dimensions

  • Deterministic or stochastic?

  • Discrete or continuous?

  • Steady-state or time-varying?

  • Linear or non-linear?

  • Batch or real-time?

    Phases of development

  • Real system to logical abstraction

  • Appropriate datastructs/algorithms/math technique

  • Implement algorithm as program

  • Validate implementation


Real system to logical abstraction
Real System to Logical Abstraction

  • Define the problem

    • What are the objectives?

    • What question(s) would you like to answer?

    • What do you anticipate answering in the future?

  • When in doubt, leave detail out

    • You may not need it

      • “A theory should be as simple as it needs to be, and no simpler” A. Einstein

    • You can always add it in future

  • Spend the time turning vague statement of goals into better defined one(s)‏

    • Revisit as necessary

    • (That's just good software engineering!)‏


Appropriate datastructure algorithm math technique
Appropriate Datastructure/Algorithm/Math Technique

Setting up the model

  • Start with what's known

  • Go simple -> complex

  • Iterate

  • Develop large models modularly

  • Model only what is necessary

  • Make (and state) assumptions and hypotheses

  • State constraints

  • Alternate between top-down & bottom up

    Best data structure/algorithm?

  • Data Structures and algorithms class

    Best math technique?

  • Numerical methods class


Validation
Validation

Define some “test cases” before hand

  • Define (input,output) pairings

  • Should range in complexity

  • Should be checked:

    • By scientists, or

    • By well-known (e.g. “textbook”) cases, or

    • By hand (and then double-checked by someone else)‏

      Just good software engineering!


Implementation
Implementation

Procedural languages

Object oriented languages

Artificial Intelligence Languages

Production systems

Simulation languages


Procedural languages
Procedural Languages

Examples

  • Fortran, C

    Fast

  • Compilers map language to machine code easily

  • Decades worth of optimizing compilers

    Large body of libraries for scientific

  • Among first programming languages

  • Decades worth of optimizing libraries

    Have to think in machine terms, not domain terms


Object oriented languages
Object Oriented Languages

Examples

  • C++, Java, C#, (Eiffel, Smalltalk)‏

    Object Oriented Nature

  • Classes can represent scientific objects

  • Methods can represent scientific transitions

  • C++ can link with C libraries

    Still may be a hassle to program objects


Artificial intelligence languages
Artificial Intelligence Languages

Examples

  • Prolog, Lisp, Scheme

    Prolog: Predicate Logic (Horn Clause) based

    Facts:

    sentiveNose(sniffer).

    dog(fido). wet(fido).

    Rules:

    smelly(X) :- dog(X), wet(X).

    offendedBy(Y,X) :- smelly(X), sensitiveNose(Y).

    Possible queries:

    Is sniffer offended by fido? offendedBy(sniffer,fido).

    Does fido offend anyone? offendedBy(A,fido).

    Is sniffer offended by anyone? offendedBy(sniffer,B).

    Is anyone offended by anyone else? offendedBy(A,B).


Artificial intelligence langs 2
Artificial Intelligence Langs (2)

Lisp

  • It's all functions operating on lists:

    (defun computeProperties (object newObj)‏

    (if (and (getProperties object dog)‏

    (getProperties object wet)‏

    )‏

    (addProperty object smelly newObj)‏

    )‏

    )‏


Artificial intelligence langs 3
Artificial Intelligence Langs (3)

Advantages

  • Encourage thinking in domain space

  • Prolog: modular rules

  • Lisp: functions operating on lists good for symbolic processing

    Disadvantages:

  • Prolog: not good for floating pt numbers

    • Exact matching not compatible with float pt rounding

  • Lisp:

    • Handling loops and variables somewhat contrived


Production systems
Production Systems

Examples

  • SOAR, OPS5, Mycin?

    Models human thought

    Production sys = Rules + Working Memory

  • Rules = if-then statements

    if dog(X) and wet(X) then smelly(X)‏

    if smelly(X) and sensitiveNose(Y) then offendedBy(Y,X)‏

  • Working memory = models human memory

    dog(fido), wet(fido), sensitiveNose(sniffer)‏

  • Computation

    1st round: compute smelly(fido)‏

    2nd round: compute offendedBy(sniffer,fido)‏


Production systems cont d
Production Systems, cont'd

Advantages

  • Productions are inherently modular

  • Can have constraints on working memory elements (“wme's”)

    • Better models human cognitive limitations

  • RETE algorithm matches wmes and rules efficiently

  • Rules = laws, WMEs = state

    When rules conflict, what happens?

  • Architecture has to do something

  • OPS5: More specific rule wins

  • SOAR: Create problem space to resolve issue

  • MYCIN: Rules w/certainty factors (weighted vote)‏

  • Is what the architecture does what you want?


Simulations languages
Simulations Languages

Examples

  • QSIM, SPICE (analog circuits), Scilab

    Qualitative simulations

  • Very precise, not very general:

    d2x/dt2(t) = -9.8 m/sec2, x(t0) = 2m, v(t0) = 0 m/sec

  • Less precise, more general

    d2x/dt2(t) = -g, x(t0) = x0, v(t0) = v0

  • Even less precise, even more general

    dx/dt = M- (monotonically dec. fnc),

    x(t0) = 0 .. x0 .. infinity


Qsim 1
QSIM (1)

Values

Either landmark values

0, MAX_CAPACITY, t0, inf

Or ranges between landmark values

Not full but not empty: (0,MAX_CAPACITY),

Time before t1 but after t0: (t0,t1)‏

Variables:

Have both a value and derivative

<0,inc> <(0,MAX_CAPACITY),inc> <MAX_CAPACITY,std>

Functions:

Either monotonically increasing or decreasing

M+, M-

Derivatives: either increasing, decreasing or steady:

inc, dec, std.


Qsim 2 classic example
QSIM (2): Classic Example

U-Tube: Two tubes connected by pipe

  • Both have

    • Some initial fluid level

      amtA: 0..AMAX..inf amtB: 0..BMAX..inf

    • Pressure dependent upon level

      pressureA = M+(amtA) pressureB = M+(amtB)‏


Qsim 3
QSIM (3)

Other constraints:

  • Total fluid only in A & B, and is constant:

    amtA + amtB = total constant(total)‏

  • Flow depends on pressure difference:

    pAB = pressureA – pressureB

    flowAB = M+(pAB)‏

    d(amtB)/dt = flowAB d(amtA)/dt = -flowAB

  • Knowledge of quantities:

    total: 0..inf

    amtA: 0..AMAX..inf amtB: 0..BMAX..inf

    pressureA: 0..inf pressureB: 0..inf

    pAB: -inf..0..inf

    flowAB: -inf..0..inf

  • Correspondence between values

    pressureA & amtA: (0,0), (inf,inf)‏

    pressureB & amtB: (0,0), (inf,inf)‏

    flowAB & pAB: (-inf, -inf), (0,0), (inf,inf)‏


Qsim 4
QSIM (4)

Note correspondence w/ordinary diffy eqn

d(amtB)/dt = f(g(amtA) – h(amtB))‏

f,g,h  M+

Qualitative state is dynamic

Imagine filling tank A (ignore tank B)‏

Need to represent amtA and its derivative:

amtA(t):

t0: <0,inc>

(t0,t1): <(0,AMAX),inc>

t1: <AMAX,std>


Qsim 5 predicting behavior
QSIM (5): Predicting behavior

1. Give initial (at least some) conditions:

“Tank A full”

t = t0: amtA = <AMAX,?>

“Tank B empty”

t = t0: amtB = <0,?>


Qsim 6
QSIM (6):

2. Correspondence propagation on init state

amtB = 0 therefore pressureB = 0

amtA = AMAX therefore pressureA = (0,inf)‏

amtA=AMAX && amtB = 0

therefore total = (0,inf)‏

pressureA = (0,inf) and pressureB = 0

therefore pAB = (0,inf)‏

constant(total) therefore d(total)/dt = std

pAB = (0,inf) therefore flowAB = (0,inf)‏

flowAB = (0,inf) therefore d(amtA)/dt = dec

d(amtB)/dt = inc

d(amtA)/dt = dec therefore d(pressureA)/dt = dec.

d(amtB)/dt = dec therefore d(pressureB)/dt = dec.


Qsim 7
QSIM (7)

(Init state correspondence propagate, cont'd)‏

Additional constraint

d(pressureA)/dt = dec && d(pressureB)/dt = inc

therefore d(pAB)/dt = dec

Last propagation

d(pAB)/dt = dec therefore d(flowAB)/dt = dec

So we have at t = t0:

amtA = <AMAX,dec> pressureA = <(0,inf),dec>

amtB = <0,inc> pressureB = <0,inc>

pAB = <(0,inf), dec> flowAB = <(0,inf), dec>

Total = <(0,inf), std>


Qsim 8
QSIM (8)

3. Predicting the next state:

What events can make the next state?

  • A variable can reach a limit or landmark

  • A variable may move off a landmark

  • A variable may start or stop moving

    6 variables moving to limits/landmarks

    (theoretically): 46 - 1 = 4095 possibilities

    (actual): constraints & correspondences reduce choices

    t = (t0,t1):

    amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>

    amtB = <(0,BMAX),inc> pressureB = <(0,inf),inc>

    pAB = <(0,inf), dec> flowAB = <(0,inf), dec>

    Total = <(0,inf), std>


Qsim 9
QSIM (9)

What happens after?

t = (t0,t1):

amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>

amtB = <(0,BMAX),inc> pressureB = <(0,inf),inc>

pAB = <(0,inf), dec> flowAB = <(0,inf), dec>

total = <(0,inf), std>

Well:

amtA is decreasing toward 0 (won't get there)‏

amtB is increasing toward BMAX.

flowAB is decreasing toward 0.

So either:

  • flowAB gets to 0 AND amtB gets to BMAX, or

  • flowAB gets to 0 BEFORE amtB gets to BMAX, or

  • amtB gets to BMAX BEFORE flowAB gets to 0

    (Tank B overflows)‏


Qsim 10
QSIM (10)

  • flowAB gets to 0 AND amtB gets to BMAX:

    t = t1(a):

    amtA = <(0,AMAX),std> pressureA = <(0,inf),std>

    amtB = <BMAX,std> pressureB = <(0,inf),std>

    pAB = <0,std> flowAB = <0,std>

    Total = <(0,inf), std>

    3. amtB gets to BMAX BEFORE flowAB gets to 0

    (Tank B overflows)‏

    t = t1(c):

    amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>

    amtB = <BMAX,inc> pressureB = <(0,inf),inc>

    pAB = <(0,inf),dec> flowAB = <(0,inf),dec>

    Total = <(0,inf),std>


Qsim 11
QSIM (11)

2. flowAB gets to 0 BEFORE amtB gets to BMAX

t = t1(b):

amtA = <(0,AMAX),std> pressureA = <(0,inf),std>

amtB = <(0,BMAX),std> pressureB = <(0,inf),std>

pAB = <0,std> flowAB = <0,std>

Total = <(0,inf), std>

amtB's level becomes a new landmark so:

amtA: 0..a0..AMAX..inf

pressureA: 0..p0..inf

amtB: 0..a1..BMAX..inf

pressureB: 0..p1..inf

pAB: -inf..0..inf

flowAB: -inf..0..inf

total: 0..to0..inf


Qsim 12
QSIM (12)

Output: graph of state transitions:


Qsim 13
QSIM (13)

QSIM algorithm:

  • Init queue w/QState(t0), complete its description

  • If queue empty or exceed resource limit then stop, otherwise pop state S from queue

  • For each var vi in S get successors from table

  • Determine all successor states consistent w/restrictions

    • If S is time-interval, delete copies of itself

  • For each successor Si, assert:

    successor(S,Si) predecessor(Si,S)‏

  • Apply global filters on successors

  • Add successors to queue

    Don't add inconsistent, quiescent, cycles, transition or t = inf states

  • Go to 2


Next time
Next Time

Probabilistic reasoning

  • Graphical Models

    Machine Learning and CSD


ad