Data-Driven Decision-Making
Download
1 / 54

Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhav Honeywell International, Inc. Automation and - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhav ý Honeywell International, Inc. Automation and Control Solutions Advanced Technology. Can We Generate More Value from Data?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhav Honeywell International, Inc. Automation and ' - mele


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Data-Driven Decision-Making

The Good, the Bad, and the Ugly

Ruda Kulhavý

Honeywell International, Inc.

Automation and Control Solutions

Advanced Technology


Can we generate more value from data l.jpg
Can We Generate More Value from Data?

  • Today, a typical “data mining” project is ad hoc, lengthy, costly, knowledge-intensive, and requiring on-going maintenance.

  • Although the project benefits can be quite significant, the resulting profit is often marginal.

  • The industry is in search of robust methods and reusable workflows, easy to use and adapt to system and organizational changes, and requiring no special knowledge from the end user.

  • This is a tough target … What can we offer to it today?


Learning from data probabilistic approach l.jpg

Learning from DataProbabilistic Approach


Learning from data l.jpg
Learning from Data

  • Data:di(k), i=1,…,n, k=1,…,N

    Independent variables

    • States (disturbance vars)

    • Actions (manipulated vars)

      Dependent variables

    • Responses (controlled vars)

    • Rewards (objective functions)

  • Goal: Learn from the data how the responses and rewards depend on actions and states.

Data Matrix

n variables

N observations


From data to probability l.jpg

Count Ni

Data

hypercube

Cell i=1,…,L

C

Relational

database

table

Fields

12

B

9

A

B

C

Dimensions

23

A

k =1

Records

Empirical

probability

23

9

12

N

i = 1

L

From Data to Probability


Probabilistic data mining l.jpg

Empirical

probability

Data cube

Database

Query

Answer

Smoothened

probability

Possible

Monte Carlo

approximation

Probability

operations

Probabilistic Data Mining


What makes up problem dimensionality l.jpg
What Makes Up ‘Problem Dimensionality’?

Take a discrete perspective:

  • Number of data (N)

    • N=105 five-minute samples per year

  • Number of cells (L)

    • L=dn cells, assuming n dimensions, each divided into d cells

  • Number of models (M)

    • M=dm models, assuming m parameters of model, each divided into d cells

    • Can be cut down if strong prior info is available.


Addressing dimensionality macroscopic prediction l.jpg

Addressing DimensionalityMacroscopic Prediction


Macroscopic prediction l.jpg
Macroscopic Prediction

E. T. Jaynes, Macroscopic Prediction, 1985:

  • If any macrophenomenon is found to be reproducible, then it follows that all microscopic details that were not reproduced must be irrelevant for understanding and predicting it.

  • Gibbs’ variational principle is … "predict that final state that can be realized by Nature in the greatest number of ways, while agreeing with your macroscopic information."


Boltzmann s solution 1877 l.jpg
Boltzmann’s Solution (1877)

  • To determine how N gas molecules distribute themselves in a conservative force field such as gravitation, Boltzmann divided the accessible 6-dimensional phase space of a single molecule into equal cells, with Ni molecules in the i-th cell.

  • The cells were considered so small that the energy Ei of a molecule did not vary appreciably within it, but at the same time so large that it could accommodate a large number Ni of molecules.


Boltzmann s solution cont l.jpg
Boltzmann’s Solution (cont.)

  • Noting that the number of ways this distribution can be realized is the multinomial coefficient

    he concluded that the "most probable" distribution is the one that maximizes W subject to the known constraints of his prior knowledge; in this case the total number of particles and total energy


Boltzmann s solution cont12 l.jpg
Boltzmann’s Solution (cont.)

  • If the numbers Ni are large, the factorials can be replaced with Stirling approximation

  • The solution maximizing log W can be found by Lagrange multipliers

    where C is a normalizing factor and the Lagrange multiplier is to be chosen so that the energy constraint is satisfied.

Shannon

entropy

Exponential

distribution


Why does it work l.jpg
Why Does It Work?

E.T.Jaynes, Where Do We Stand on Maximum

Entropy?, 1979:

  • Information about the dynamics entered Boltzmann’s equations at two places: (1) the conservation of total energy; and (2) the fact that he defined his cells in terms of phase volume …

  • The fact that this was enough to predict the correct spatial and velocity distribution of the molecules shows that the millions of intricate dynamical details that were not taken into account, were actually irrelevant to the predictions …


Why does it work cont l.jpg
Why Does It Work? (cont.)

E.T.Jaynes, Where Do We Stand on Maximum

Entropy?, 1979:

  • Boltzmann’s reasoning was super-efficient …

  • Whether by luck or inspiration, he put into his equations only the dynamical information that happened to be relevant to the questions he was asking.

  • Obviously, it would be of some importance to discover the secret of how this come about, and to understand it so well that we can exploit it in other problems …


General maximum entropy l.jpg
General Maximum Entropy

  • Empirical probability mass function r(N)

  • Equivalence of probability mass functions

    for a given (vector) function h  (h1,…,hL).

  • Equivalence class containing r(N):


General maximum entropy cont l.jpg
General Maximum Entropy (cont.)

  • Relative entropy (aka Kullback-Leibler distance)

  • Minimum relative entropy w.r.t. reference s(0)

  • Minimum relative entropy solution

    where C is a normalizing factor and is chosen so that

Maximum

entropy


Addressing dimensionality parametric approximation l.jpg

Addressing DimensionalityParametric Approximation


Probability approximation l.jpg
Probability Approximation

  • Approximate the empirical probability vector r(N) with a member s( ) of a more tractable family parameterized by vector :

  • Taking a geometric perspective, this can be regarded as a projection of the point r(N) onto a surface of a lower dimension.


Maximum likelihood l.jpg
Maximum Likelihood

  • Exponential family S(m)with a fixed "origin" s(0), canonical affine parameter , directional sufficient statistic h  (h1,…,hL), and normalizing factor C

  • Minimize relative entropy

  • By definition of , the task is equivalent to

Maximum

likelihood


Maximum likelihood cont l.jpg

Sufficient

statistic

Maximum Likelihood (cont.)

  • Minimum relative entropy solution

    where C is a normalizing factor and is chosen so that


Addressing dimensionality information geometry l.jpg

Addressing DimensionalityInformation Geometry


Dual projections l.jpg
Dual Projections

Maximum Entropy

Maximum Likelihood


Pythagorean geometry l.jpg

Dual parametrizations

of exponential family

Pythagorean Geometry

Equivalence class

Exponential family


Dual geometry l.jpg

Maximum Entropy

The empirical probability known with precision up to an equivalence class.

The solution found within an exponential family through a reference point.

Maximum Likelihood

The approximating probability sought within an exponential family.

The approximation found by projecting the empirical probability.

Dual Geometry

Exponential

family

Equivalence classes


Bayesian estimation l.jpg
Bayesian Estimation

  • Posterior probability vector for models i=1,…,M:


Addressing dimensionality relevance based weighting l.jpg

Addressing DimensionalityRelevance-Based Weighting


What if the model is too complex l.jpg
What If the Model Is Too Complex?

  • For some real-life problems, the level of detail that needs to be collected on the empirical probability (and, correspondingly, the dimension of the exponential family) is too high, possibly infinite.

  • In such case, we can either

    • sacrifice the closed-form solution, or

    • take a narrower view of the data,

      • modeling only the part of system behavior relevant to the problem in question,

      • while using a simpler, lower-dimensional model.


Relevance based weighting of data l.jpg
Relevance-Based Weighting of Data

  • A general idea of relevance weighting is to modify the empirical probability through

    where the weight vector reflects the relevance of particular cells to a case.

  • A popular choice of the weights wi for a given “query” vector x(0) is using a kernel function:


Local empirical distributions l.jpg

Projections of relevance-

weighted empirical

distributions onto an

exponential family

Projections of relevance-

weighted empirical

distributions onto an

exponential family

Query-independent model family

Query-specific empirical distributions

Local Empirical Distributions

Response variable

Predictor variable


Local modeling l.jpg
Local Modeling

Relational

Database

Multidimensional

Data Cube

Forecasted

variable

Heat

demand

Time

of day

Outdoor

temperature

Query point

( What if ? )

Explanatory

variables


Multiple forecasting applications l.jpg
Multiple Forecasting Applications

Electricity Loads

Heat Loads

Gas Loads

Process Yields


Data centric technology l.jpg
Data-Centric Technology

Regression

Classification

Continuous

target variable

(product demand,

product property,

perform. measure)

Categorical

target variable

(discrete event,

system fault,

process trip)

State and/or Action

State and/or Action

Query point

Query point

Neighborhood

Neighborhood

Novelty Detection

Optimization

Tested variable

(corrupt values,

unusual responses,

new behavior)

Reward

(operating profit,

production cost,

target matching)

Current

State

(operating

conditions)

Past

& new

State and/or Action

Action

(decision)

Tested point

Neighborhood


Increasingly popular approach l.jpg
Increasingly Popular Approach

  • Statistical Learning

    • Locally-Weighted / Nonparametric Regression

      • Cleveland (Bell Labs)

      • Vapnik (AT&T Labs)

  • Artificial Intelligence

    • Lazy / Memory-Based Learning

      • Moore (Carnegie Mellon University)

      • Bontempi (University of Brussels)

  • System Identification

    • Just-in-Time / On-Demand Modeling

      • Cybenko (Dartmouth College)

      • Ljung & Stenman (Linköping University)


How do humans solve problems l.jpg
How Do Humans Solve Problems?

Sales Rep

Focus on

recent

experience!

Expert

Take

everything

into account!

Engineer

Use

relevant

information!


Corresponding technologies l.jpg
Corresponding Technologies

Adaptive

Regression

Neural

Network

Local

Regression


Pros and cons l.jpg
Pros and Cons

Adaptive

Regression

Neural

Network

Local

Regression

  • Simple adaptation

  • Fast computation

  • Data compression

  • No actual learning

  • Local description

  • Global description

  • Fast lookup

  • Data compression

  • Slow learning

  • Interference problem

  • Lack of adaptation

  • Difficult to interpret

  • Minimum bias

  • Inherent adaptation

  • Easy to interpret

  • No compact model

  • No data compression

  • Slower lookup


Addressing dimensionality no locality in high dimension l.jpg

Addressing DimensionalityNo Locality in High Dimension?


Limits of local modeling l.jpg
Limits of Local Modeling

  • As the cube dimension n increases, it becomes increasingly difficult to do relevance weighting, similarity search, neighborhood sizing …

  • The volume of a unit hypersphere becomes a fraction of the volume of a unit hypercube.

  • The length of the diagonal ( ) of a unit hypercube goes to infinity.

  • The hypercube increasingly resembles a spherical “hedgehog” (with 2n spikes).

  • When uniformly distributed, most data appear near the cube edges.


No local data in high dimensions l.jpg

Dimension of

surface on which

the data live

1

2

3

Retrieved data ratio

10

100

Cube edge ratio

No “Local” Data in High Dimensions

  • However, in most real-life problems, the data is anything but uniform-ly distributed.

  • Thanks to technology design, integrated control & optimization, and human supervision, the actual number of degrees of freedom is often quite limited.


Local modeling revisited l.jpg

Query point neighborhood defined over an embedded manifold.

Local Modeling Revisited

  • Exploit data dependence structure.

    • “Divide and conquer” approach.

    • Compare p(x1)  p(x2) against p(x1,x2).

    • Make use of Markovian property.

  • Discover low-dimensional manifolds on which the data live.

    • Feature selection.

    • Cross-validation.


Local modeling revisited41 l.jpg
Local Modeling Revisited

  • Make use of multiple modes in data.

    • Tree of production or operating modes.

    • Definition of similar modes over the tree.

  • Analyze patterns in population of the cube cells with the data, incl. the occupancy numbers.

    • Estimate the probabilities of symbols generated by an information source, given an observed sequence of symbols.

    • Symbols are defined by cube cell labels, in a proper encoding.


Cube encoding l.jpg
Cube Encoding

For every i, i’,

there exists n

such that


Cube encoding43 l.jpg
Cube Encoding

For every i, i’,

there exists n

such that


Cube encoding44 l.jpg
Cube Encoding

For every i, i’,

there exists n

such that


Cube encoding45 l.jpg
Cube Encoding

For every i, i’,

there exists n

such that


General linear case l.jpg
General “Linear” Case

  • There exist m numbers D1, D2, …, Dmsuch thatfor every two populated cells i, i’, the absolute difference of the cell labels can be expressed as a weighted sum of the numbers D1, D2, …, Dm,while the corresponding weights n1, n2, …, nmare natural numbers

  • The number m defines the dimension of a “hyperplane” cutting the cube, on which the data live.


Symbolic forecasting l.jpg

Condition acts

as a sequence

template:

Symbolic Forecasting

For every i, i’,

there exists n

such that


Symbolic forecasting48 l.jpg
Symbolic Forecasting

More questions than answers at the moment:

  • What are proper “model” functions capturing population patterns and occupancy numbers?

  • What is a proper way of approaching the problem?

    • Coding theory?

    • Algebraic geometry?

    • Harmonic analysis?

  • Quantization error ..

  • Discrete to continuous transition …



Hypothesis formulation l.jpg
Hypothesis Formulation …

Two of world’s leading economists present quite

distinct views of globalization in their new books:

  • Joseph Stiglitz

    Globalization and Its Discontents

  • Jagdish Bhagwati

    In Defense of Globalization


Feature selection l.jpg
Feature Selection …

The Wall Street Journal Europe, Dec 2, 2002

The Globalization Stirs Debate at U.S. Universities:

  • In Latin America, Mr. Stiglitz says, growth in the 1990s was slower, at 2.9% a year, than it was during the days of trade protectionism in the 1960s, when the region’s annual growth rate was about 5.4%.

  • Mr. Bhagwati argues … that women’s wages in many developing countries has increased as multinational investment has risen.


Training data selection l.jpg
Training Data Selection …

The Wall Street Journal Europe, Dec 2, 2002

The Globalization Stirs Debate at U.S. Universities:

  • Mr. Stiglitz cites a World Bank study showing that the number of people living on less than $2 a day increased by nearly 100 million during the booming 1990s.

  • Mr. Bhagwati argues that the number of people living on less than $2 a day declined by nearly 500 million between 1976 and 1998.


Decision support rather than automation l.jpg
Decision Support Rather Than Automation

  • Since there are more ways of

  • phrasing a complex question,

  • multiple answers are more likely

  • than a single, “simple” one.

  • Is globalization a good or bad thing?

  • Should a company make an acquisition?

  • Should a vendor introduce a new product?

  • Should a production plant respond to a market opportunity?

  • What demand for natural gas will be in a country in 5 years from now?

Consistent Feedback

Goodness of Fit

Hypothesis

Decision

Support

System

Decision

Maker

Data

Plausible

Explanations


Humans to stay in control l.jpg
Humans To Stay in Control

  • At the moment, computerized data analysis is more likely to be delivered as decision support rather than closed-loop control.

  • Success depends to a large extent on effective interaction between humans and computers.

  • For the foreseeable future, formulation of hypotheses and interpretation of results is likely to stay with the people.

  • Commercial decision support software should support a typical usage scenario.


ad