Project overview
This presentation is the property of its rightful owner.
Sponsored Links
1 / 88

Project Overview PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Project Overview. Discovering Concepts Hidden in the Web Tsau Young (‘T. Y.’) Lin Computer Science Department, San Jose State University San Jose, CA 95192-0249, USA [email protected] Main results.

Download Presentation

Project Overview

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Project overview

Project Overview

Discovering Concepts Hidden in the Web

Tsau Young (‘T. Y.’) Lin

Computer Science Department, San Jose State University

San Jose, CA 95192-0249, USA

[email protected]


Main results

Main results

A set of documents is associated with a Matrix, called Latent Semantic Index(LSI), Then by treating the row vectors as Euclidean space points(point=TFIDF), The document is clustered(categorized)

polyhedron, the association is believed to be one-to-one

Corollary: A set of English documents and their Chinese translations can be identified via their semantics automatically.


Main results1

Main results

A set of documents is associated with a polyhedron, the association is believed to be near one-to-one

Corollary: A set of English documents and their Chinese translations can be identified via their semantics automatically.


Main results2

Main results

This is identified by semantics,as there is no explicit correspondence between two sets of documents.


Outline

Outline

1. Introduction

Domain: Information Ocean

Methodology: Granular Computing

Reaults

2. Intuitive View of Granular Computing

3. A Formal Theory

4.

2


Current state

Current State

  • Current search engines are syntactic based systems, they often return many meaningless web pages

  • Cause: Inadequate semantic analysis, and lack of semantic based organization of information ocean.


Information ocean

Information Ocean

  • Internet is an information ocean.

  • It needs a methodology to navigate.

  • A new methodology-Granular Computing


Granular computing a methodology

Granular Computing-a methodology

The term granular computing is first used to label a subset of Zadeh’s

granular mathematics as my research area in BISC, 1996-97

(Zadeh, L.A. (1998) Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems, Soft Computing, 2, 23-25.)


Granular computing

Granular computing

Since, then, it has grown into an active research area:

  • books, sessions, workshops

    (Zhong, Lin was the first independent conference using

    Name GrC; there has several in JCIS)

  • IEEE task force


Granular computing1

Granular Computing

Granulation seems to be a natural problem-solving methodology deeply rooted in human thinking.

Human body has been granulated into head, neck, and etc.


Granulating information ocean

Granulating Information Ocean

  • In this talk, we will explain how we granulate the semantic space of information ocean that consists of millions of web pages


Organizing information ocean

Organizing Information Ocean

  • How to organize the information ocean?

  • Considering the Semantics Space


Latent semantic space

Latent Semantic Space

  • A set of documents/web pages carries certain human thoughts. We will call the totality of these thoughts

  • Latent semantic space (LSS);

  • (recall Latent Semantic Index(LSI)


Classification clustering

Classification & clustering

In data mining,

  • a classification means identify an unseen object with one of the known classes in a partition

  • Clustering means classify a set of object into disjoint classes based on similarity, distance, and etc.; the key ingredient here is the classes are not known apriori.


Categorizing information

Categorizing Information

  • Multiple concepts can simultaneously exist in a single web page, So to organize web pages, a powerful

    Clustering

    method is needed.

    (The # of concepts can not be known apriori)


Latent semantic space lss

Latent Semantic Space(LSS)

  • The simplest representations of LSS?

  • A Set of Keywords

  • LSI


Latent semantic index

Latent Semantic Index


Tfidf

TFIDF

Definition 1. Let Tr denote a collection of documents. The significance of a term ti in

a document dj in Tr is its TFIDF value calculated by the function tfidf(ti, dj), which is equivalent to the value tf(ti, dj) · idf(ti, dj). It can be calculated as

TFIDF(ti; dj)=tf(ti; dj)log |Tr|/|Tr(ti)


Tfidf1

TFIDF

where Tr(ti)denotes the number of documents in Tr in which ti occurs at least once,

1 +log(N(ti; dj))if N(ti; dj)> 0

tf(ti; dj) =

0 otherwise

where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.


Tfidf2

TFIDF

where Tr(ti)denotes the number of documents in Tr in which ti occurs at least once,

1 +log(N(ti; dj))if N(ti; dj)> 0

tf(ti; dj) =

0 otherwise

where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.


Latent semantic index1

Latent Semantic Index

Treat each row as a point in Euclidean space. Clustering such a set of points is a common approach (using SVD)

Note that the points has very little to do with the semantic of documents


Topological space of lss

Topological Space of LSS

Euclidean space has many metics but has only one topology;

We will use this one


Keywords 0 association

Keywords (0-Association)

1. Given by Experts

2. High TFIDF is a Keyword

  • “Wall”, “Door”. . ., “Street”, “Ave”


Keywords pairs 1 association

Keywords Pairs (1-Association)

  • 1-association

    (“Wall”, “Street”)  financial notion,

    that nothing to do with the two vertices, “Wall” and “Street”


Keywords pairs 1 association1

Keywords Pairs (1-Association)

  • 1-association

    (“White”, “House”) 

    that nothing to do with the two vertices, “White” and “House”


Keywords pairs 1 association2

Keywords Pairs (1-Association)

  • 1-association

    (“Neural”, “Network”) 

    that nothing to do with the two vertices, “Wall” and “Street”


Geometric analogy 1 simplex

Geometric Analogy-1- Simplex

  • (open) 1-simplex:

    (v0,v1)  open segment

    (“Wall”, “Street”)  financial notion,

  • End points (boundaries) are not included


Keywords are abstract vertices

Keywords are abstract vertices

  • LSS of Documents/web pages

     Simplicial Complex

  • A special Hypergraph

  • Polyhedron  Simplicial Complex


R association

r-Association

  • r-association

    Similarly r-association represents some semantic generated by a set of r keywords, moreover the semantics may have nothing to do with the individual keywords

    There are mathematical structure that reflects such properties; see next


Topology open simplex

Topology:(Open) Simplex

  • 1-simplex: open segment (v0,v1)

  • 2-simplex: open triangle (v0,v1, v2) ;

  • 3-simplex: open tetrahedron (v0,v1, v2 , v3)

  • All boundaries are not included


Topology open simplex1

Topology: (Open) Simplex

  • A (open) r-simplex is the generalization of those low dimensional simplexes (segment, triangle and tetrahedron) to high dimensional analogy in r-space (Euclidean spaces of dimension r)

  • Theorem. r-simplex uniquely determines the r+1 linearly independent vertices, and vice versa


Project overview

Face

  • The convex hull of any m vertices of the r-simplexis called an m-face.

  • The 0-faces are the vertices, the 1-faces are the edges, 2-faces are triangles, and the single r-face is the whole r-simplex itself.


Project overview

A line segment where two faces of a polyhedron meet, also called a side.


N complex

n-Complex

  • A simplicial complexC is a finite set of simplices such that:

    • Any face of a simplex from C is also in C.

    • The intersection of any two simplices from C is either empty or is a face for both of them

  • If the maximal dimension of the constituting simplices is n then the complex is called n-complex.


Upper closure approximations

Upper/Closure approximations

Let B(p), p  V, be an elementary granule

U(X)= {B(p) | B(p)  X = } (Pawlak)

C(X)= {p | B(p)  X = } (Lin-topology)


Upper closure approximations1

Upper/Closure approximations

Cl(X)= iCi(X) (Sierpenski-topology)

Where Ci(X)= C(…(C(X))…)

(transfinite steps) Cl(X) is closed.


New view

New View

Divide (and Conquer)

Partition of set (generalize) ?

Partition of B-space

(topological partition)


New view b space

New View:B-space

The pair (V, B) is the universe, namely

an object is a pair (p, B(p))

where B: V  2V ;  p  B(p) is a granulation


Derived partitions

Derived Partitions

The inverse images of B is a partition (an equivalence relation)

C ={Cp | Cp =B –1 (Bp) p  V}


Derived partitions1

Derived Partitions

  • Cp is called the center class of Bp

  • A member of Cpis called a center.


Derived partitions2

Derived Partitions

  • The center class Cp consists of all the points that have the same granule

  • Center class Cp = {q | Bq= Bp}


C quotient set

C-quotient set

The set of center classes Cp is a quotient

set

US, UK, . . .

Iran, Iraq. .

Russia, Korea


New problem solving paradigm

New Problem Solving Paradigm

(Divide and) Conquer

Quotient set 

Topological Quotient space


Neighborhood of center class

Neighborhood of center class

  • C (in the case B is not reflexive)

B-granule/neighborhood

C-classes

C-classes


Neighborhood of center class1

Neighborhood of center class

C-classes

B-granule

C-classes


Topological partition

Topological partition

B-granule/neighborhood

Cp -classes

Cp -classes


New problem solving paradigm1

New Problem Solving Paradigm

(Divide and) Conquer

Quotient set 

Topological Quotient space


Topological partition1

Topological partition

B-granule/neighborhood

Cp -classes

Cp -classes


Topological partition2

Topological partition

B-granule/neighborhood

Cp -classes

Cp -classes


Topological partition3

Topological partition

B-granule/neighborhood

Cp -classes

Cp -classes


Topological table 2 column

Topological Table (2-column)


Future direction

Future Direction

  • Topological Reduct

  • Topological Table processing


Application 1 cwsp

Application 1: CWSP

In UK, a financial service company may consulted by competing companies. Therefore it is vital to have a lawfully enforceable security policy.

3


Background

Background

  • Brewer and Nash (BN) proposed Chinese Wall Security Policy Model (CWSP) 1989 for this purpose


Policy simple cwsp scwsp

Policy: Simple CWSP (SCWSP)

"Simple Security", BN asserted that

"people (agents) are only allowed

access to information which is not

held to conflict with any other

information that they (agents)

already possess."


A little fomral

A little Fomral

Simple CWSP(SCWSP):

No single agent can read data X and Y

that are in CONFLICT


Formal scwsp

Formal SCWSP

SCWSP says that a system is secure, if

“(X, Y)  CIR  X NDIF Y “

CIR=Conflict of Interests Binary Relation

NDIF=No direct information flow


Formal simple cwsp

Formal Simple CWSP

SCWSP says that a system is secure, if

“(X, Y)  CIR  X NDIF Y “

“(X, Y)  CIR  X DIF Y “

CIR=Conflict of Interests Binary Relation


More analysis

More Analysis

SCWSP requires no single agent can read X and Y,

  • but do not exclude the possibility a sequence of agents may read them

    Is it secure?


Aggressive cwsp acwsp

Aggressive CWSP (ACWSP)

The Intuitive Wall Model implicitly requires: No sequence of agents can read X and Y:

A0 reads X=X0and X1,

A1 reads X1and X1,

. . .

An reads Xn=Y


Composite information flow

CompositeInformation flow

CompositeInformation flow(CIF) is

a sequence of DIFs , denoted by 

such that

X=X0X1 . . .  Xn=Y

And we write X CIF Y

NCIF: No CIF


Composition information flow

Composition Information Flow

Aggressive CWSP says that a system is secure, if

“(X, Y)  CIR  X NCIF Y “

“(X, Y)  CIR  X CIF Y “


The problem

The Problem

Simple CWSP  ? Aggressive CWSP

This is a malicious Trojan Horse problem


Need acwsp theorem

Need ACWSP Theorem

  • Theorem If CIR is anti-reflexive, symmetric and anti-transitive, then

  • Simple CWSP  Aggressive CWSP


C and cir classes

C and CIR classes

  • CIR: Anti-reflexive, symmetric, anti-transitive

Cp -classes

CIR-class

Cp -classes


Application 2

Application 2

Association mining by Granular/Bitmap computing


Fundamental theorem

Fundamental Theorem

  • Theorem 1:

    All isomorphic relations have isomorphic patterns


Illustrations table k

Illustrations:Table K


Illustrations table k1

Illustrations: Table K’


Illustrations patterns in k

Illustrations: Patterns in K


Isomorphic 2 associations

Isomorphic 2-Associations


Canonical model

Canonical Model

  • Bitmaps in Granular Forms

  • Patterns in Granular Forms


Table k

Table K’


Illustration k gdm

Illustration: KGDM


Illustration k gdm1

Illustration: KGDM


Associations in granular forms

Associations in Granular Forms


Associations in granular forms1

Associations in Granular Forms


Fundamental theorems

Fundamental Theorems

1. All isomorphic relations are isomorphic to the canonical model (GDM)

2. A granule of GDM is a high frequency pattern if it has high support.


Relation lattice theorems

Relation Lattice Theorems

1. The granules of GDM generate a lattice of granules with join = and meet=.

This lattice is called Relational Lattice by Tony Lee (1983)

2. All elements of lattice can be written as join of prime (join-irreducible elements)

(Birkoff & MacLane, 1977, Chapter 11)


Find association by linear inequalities

Find Association by Linear Inequalities

Theorem. Let P1, P2,  are primes (join-irreducible) in the Canonical Model. then

G=x1* P1 x2* P2 

is a High Frequency Pattern, If

|G|= x1* |P1| +x2* |P2| +   th,

(xj is binary number)


Am by linear inequalities

AM by Linear Inequalities

|x1*{v1v5v6}=(20, 3rd)

+x2*{v2} =(10, 3rd)

+x3*{v3v4}=(10, 2nd)

+x4*{v7} =(20, 4th)

+x5*{v8v9} =(30, 1st)|

= x1*3+x2*1+x3*2+x4*1+ x5*2


Am by linear inequalities1

AM by Linear Inequalities

|x1*{v1v5v6}+x2*{v2}+x3*{v3v4}+x4*{v7}+x5*{v8v9}|

= x1*3+x2*1+x3*2+x4*1+ x5*2

1. x1=1

2. x2 =1, x3 =1, or x2 =1, x5 =1

3. x3 =1, x4 =1 or x3 =1, x5 =1

4. x4 =1, x5 =1


Am by linear inequalities2

AM by Linear Inequalities

|x1*{v1v5v6}+x2*{v2}+x3*{v3v4}+x4*{v7}+x5*{v8v9}|

= x1*3+x2*1+x3*2+x4*1+ x5*2

1. x1=1

|1*{v1v5v6} | = 1*3=3

(20, 3rd) |{v1v5 v6 v7 }  {v1 v2v5 v6 }|=

|{v1v5 v6 }|=3


Am by linear inequalities3

AM by Linear Inequalities

|x1*{v1v5v6}+x2*{v2}+x3*{v3v4}+x4*{v7}+x5*{v8v9}|

= x1*3+x2*1+x3*2+x4*1+ x5*2

x2 =1, x3 =1, or x2 =1, x5 =1

|x2*{v2}+x3*{v3v4}| =(1020, 3rd)

|x2*{v2}+x5*{v8v9}| =(10, 2nd) (10, 3rd)

x3 =1, x4 =1 or x3 =1, x5 =1

x4 =1, x5 =1


Am by linear inequalities4

AM by Linear Inequalities

|x1*{v1v5v6}+x2*{v2}+x3*{v3v4}+x4*{v7}+x5*{v8v9}|

= x1*3+x2*1+x3*2+x4*1+ x5*2

x3 =1, x4 =1 or x3 =1, x5 =1

| x3*{v3v4}+x4*{v7}| =(10, 2nd  3rd)

| x3*{v3v4}+x5*{v8v9}| =(10, 2nd) (30, 1st)

x4 =1, x5 =1


Am by linear inequalities5

AM by Linear Inequalities

|x1*{v1v5v6}+x2*{v2}+x3*{v3v4}+x4*{v7}+x5*{v8v9}|

= x1*3+x2*1+x3*2+x4*1+ x5*2

x4 =1, x5 =1

| x3*{v3v4}+x5*{v8v9}| =(20, 4st) (30, 1st)


  • Login