What Semantic Web researchers need
This presentation is the property of its rightful owner.
Sponsored Links
1 / 151

What Semantic Web researchers need to know about Machine Learning? PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

What Semantic Web researchers need to know about Machine Learning? http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/ Marko Grobelnik, Bla ž Fortuna, Dunja M ladeni ć. How Semantic Web compares to Machine Learning? (1/2).

Download Presentation

What Semantic Web researchers need to know about Machine Learning?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


What semantic web researchers need to know about machine learning analytics ijs si

What Semantic Web researchers need

to know about Machine Learning?

http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/

Marko Grobelnik, Blaž Fortuna,Dunja Mladenić


How semantic web compares to machine learning 1 2

How Semantic Web compares to Machine Learning? (1/2)

  • Semantic Web is traditionally non-analytic area of research…

    • …it includes a lot of logic, some language technologies, some standardization etc.

    • Generally, manual encoding of knowledge is still appreciated the most…

    • …where the main paradigm is top-down.

    • Top-down approaches enable accuracy when designing knowledge bases, but can cost a lot and are mostly not scalable


How semantic web compares to machine learning 2 2

How Semantic Web compares to Machine Learning? (2/2)

  • Machine Learning is on the other hand almost exclusively analytic & bottom-up…

    • …where analytic techniques have the main goal to extract knowledge from the data…

    • …with a little or no human involvement.

    • The cost per encoded piece of knowledge is lower, achievable scale is high, …

    • …but, the techniques are limited to what is computable in a reasonable time.


What are stitching points between sw and ml

What are stitching points between SW and ML?

  • …SW can contribute to ML and vice versa

  • Machine Learning for Semantic Web

    • …decrease costs when extracting knowledge and building ontologies, speeding-up reasoning, scalability

  • Semantic Web for Machine Learning

    • …Semantic Web techniques can provide new insights into the data (e.g. via reasoning) which are unobservable by statistical means


Part 1 understanding machine learning in simple terms

Part 1: Understanding Machine Learning in simple terms?

Sub-areas of machine learning

What are constituents of a machine learning algorithm?

What kind of problems one could try to solve with ML methods?

Why ML is easy and why it is hard?


Sub areas of machine learning

Sub-areas of machine learning

  • Data-Mining

    • Main events: ACM KDD, IEEE ISCM, SIAM SDM

    • Books: Witten and Frank, 1999

  • AI style Machine Learning

    • Main events: ICML, ECML/PKDD

    • Books: Mitchell, 1997

  • Statistical Machine Learning

    • Main events: NIPS, UAI

    • Books: Bishop 2007, Duda & Hard & Stork 2000

  • Theoretical Machine Learning

    • Main events: COLT, ALT

    • Books: Vapnik 1999


Which areas are contributing

Which areas are contributing?

Learning with

different

representations

Analysis of Data bases

Data Mining

AI Style ML

Theory on learning

Statistical data

analysis

Machine Learning

COLT

Statistics


What is learning in machine learning

What is “Learning” in Machine Learning?

  • Herbert Simon: “Learning is any process by which a system improves performance from experience.”

  • Improve on task T, with respect to performance metric P, based on experience E:

    • T: Recognizing hand-written words

    • P: Percentage of words correctly classified

    • E: Database of human-labeled images of handwritten words

    • T: Driving on four-lane highways using vision sensors

    • P: Average distance traveled before a human-judged error

    • E: A sequence of images and steering commands recorded while observing a human driver

    • T: Categorize email messages as spam or legitimate

    • P: Percentage of email messages correctly classified

    • E: Database of emails, some with human-given labels


What are constituents of a machine learning algorithm

What are constituents of a machine learning algorithm?

  • Typical ML algorithm includes:

    • Input data in some kind of form (most often vectors)

    • Input parameters (settings)

    • Fitting data using some search algorithm into a model

    • Output the model in the form of y=f(x)

    • Use the model for classification of unseen data

  • …the key elements are:

    • Language of the model – determines complexity of the model

      • …popular language for many algorithms is linear model (used in SVM, Perceptron, Bayes, ...): y = a x1 + b x2 + c x3 + …

    • Search algorithm – determines the quality of a result

      • …most often it is some kind of local optimization


What kind of problems one could try to solve with ml methods

What kind of problems one could try to solve with ML methods?

  • Generally, machine learning is about finding patterns and regularities in the data…

  • …the result is always a model which could be understood as a summary of the data used for learning model

  • The resulting model can be used for:

    • Explaining existing phenomena in the data

    • Predicting future situations


Why ml is easy and why it is hard

Why ML is easy and why it is hard?

  • ML is easy:

    • …because it seems the God is not playing dice and the Nature usually behaves nicely and is not chaotic

    • With ML techniques we try to discover the rules how the nature was generating the observed data.

    • Simple algorithmic approaches can bring a lot of success when understanding the observed data.

  • ML is hard:

    • …because we don’t know the language the God is using when programming the Universe

    • There are many different ways to represent similar concepts…

    • …we need to deal with scale, noise, dynamics, etc and is hard to find the right way how to model the observed data


Part 2 main approaches in machine learning

Part 2: Main approaches in Machine Learning

Supervised learning

Semi-supervised learning

Unsupervised learning


When to apply each of the approaches

When to apply each of the approaches?

  • …let’s take the example from medicine:

    • we have patients which have symptoms (inputs) and diagnosis (output)

  • Supervised learning (classification)

    • …given symptoms and corresponding diagnoses for many patients, the goal is to find rules which can map/predict symptoms to diagnosis for an unseen patient

  • Semi-supervised learning (transduction, active learning)

    • …given symptoms and corresponding diagnoses for only a few symptoms, leverage these to find most probable diagnoses for all the patients

  • Unsupervised learning (clustering, decompositions)

    • …given only symptoms for many patients, find the groups of similar patients (with e.g. possibly similar diagnoses).


Supervised learning

Supervised learning

Assign an object to a given finite set of classes:

  • Medical diagnosis

    • …assign diagnosis to a patient

  • Credit card applications or transactions

    • …assign credit score to an applicant

  • Fraud detection in e-commerce

    • …decide about fraud or non-fraud event in a business process

  • Financial investments

    • …decide whether to buy or sell or hold on a stock exchange

  • Spam filtering of e-mails

    • …decide if an email is a spam or a regular email

  • Recommending articles in a newspaper

    • …decide if an article fits the user profile

  • Semantic/linguistic annotation

    • …assign semantic or linguistic annotation to a word or phrase


Recommending news articles

Recommending News Articles

???

new article

(test example)

Machine Learning

Classifier

labeled articles

(training examples)

predicted article class

(interesting or not)


Supervised learning1

Supervised learning

  • Given is a set of labeled examples represented by feature vectors

  • The goal is: to build a model approximating the target function which would automatically assign right label to a new unlabeled example

  • Feature values:

    • discrete (eg., eyes_color  {brown, blue, green})

    • continuous (eg., age  [0..200])

    • ordered (eg., size {small, medium, large})

  • Values of the target function – labels:

    • discrete (classification) or continuous (regression)

    • exclude each other (eg., medical diagnosis) or not (eg., document content can be about arts and computer science)

    • have some predefined relations (taxonomy of document categories, e.,g., DMoz or Medline)

  • The target function can be represented in different ways (storing examples, symbolic, numerical, graphical,…) and modeled by using different algorithms


Illustrative example

Illustrative example

Recommending cartoon for a 5-year old boy

  • Main characters  {vehicle, human, animal}

  • Duration [5..90]


Target function

Target function

There is a trade-off between the expressiveness of a representation and the ease of learning

  • The more expressive a representation, the better it will be at approximating an arbitrary function; however, the more examples will be needed to learn an accurate function

  • Illustrative example

  • Values of the target function:discrete labels (classification), exclude each other

    • Interesting movie:


Possible data visualization

Possible data visualization

Possible Model

for not interesting

(vehicles = no) and

(human = yes)

human = yeshuman = no

vehicles = yesvehicles = no


Generalization

Generalization

  • Model must generalize the data to correctly classify yet unseen examples (the ones which don’t appear in the training data)

  • Lookup table of training examples is a consistent model that does not generalize

    • An example that was not in the training data can not be classified

      Occam’s razor:

    • Finding a simple model helps ensure generalization


Ockham occam s razor

Ockham (Occam)’s Razor

  • William of Ockham (1295-1349) was a Franciscan friar who applied the criteria to theology:

    • “Entities should not be multiplied beyond necessity” (Classical version but not an actual quote)

    • “The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” (Einstein)

  • requires a precise definition of simplicity

  • assumes that nature itself is simple


Algorithms for learning classification models

Algorithms for learning classification models

Storing examples

  • Nearest Neighbour

    Symbolic

  • Decision trees

  • Rules in propositional logic or first order logic

    Numerical

  • Perceptron algorithm

  • Winnow algorithm

  • Support Vector Machines

  • Logistic Regression

    Probabilistic graphical models

  • Naive Bayesian classifier

  • Hidden-Markov Models


Nearest neighbor

Nearest neighbor

  • Storing training examples without generating any generalization

    • Simple, requires efficient storage

  • Classification by comparing the example to the stored training examples and estimating the class based on classes of the most similar examples

    • Similarity function is crucial

      Also known as:

    • Instance-based, Case-based, Exemplar-based, Memory-based, Lazy Learning


Similarity distance

Similarity/Distance

  • For continuous features use Euclidian distance

  • For discrete features, assume distance between two values is 0 if they are the same and 1 if they are different (eg., Hamming distance for bit vectors).

  • To compensate for difference in units across features, scale all continuous values to the interval [0,1].


Nearest neighbor1

Nearest neighbor


Nearest neighbor example

Nearest neighbor - example

Model

Model

Features


Nearest neighbor example1

Nearest neighbor - example

Distance:

1+0+1+1=3

1+1+1+1=4

1+1+0+1=3

1+1+0+0=2

1+1+1+1=4

1+0+1+1=3

1+0+1+1=3

1+0+1+1=3


Decision trees

Decision trees

  • Recursively splitting set of examples until (almost) all examples in the set have the same class value

    • starts with a set of all examples

    • find a feature giving the best split

    • split the examples according to the feature values

    • repeat for each subset of examples

  • Classification by “pushing” the example from root to leaf, assign class value of the leaf


Decision tree example

Decision tree - example

duration

characters

5

3

5

3

vehicle

≤10

animals

human

90≥

10<, ≤ 30

30 <,<90

3

3

2

2

2

1

1

2

= 0.63*0.68+0.38*0.53

= 0.95

= 0.33*1.58+0.67*0.58

= 0.91

InfGain=0.95-(0+0+0)=0.95

InfGain=0.95-(0+0+0.38*0.91+0)=0.61


Decision tree example1

Decision tree - example

characters

5

3

vehicle

animals

human

3

3

2


Decision tree model

Decision tree model


If then rules

If-then Rules

Generating rules by adding conditions:

  • Λ – restricting the rule (less examples match)

  • V – generalizing the rule (more examples match)

  • Maximize quality of each rule (eg., matching examples are of the same class) while aiming at describing all the examples with a set of rules


  • If then rules1

    If-then Rules

    characters

    5

    3

    vehicle

    animals

    human

    3

    3

    2

    Converting a tree to rules

    • If (Main characters = vehicles) then interesting

    • If (Main characters = human) then uninteresting

    • If (Main characters = animals) then interesting

    • :


    Support vector machine

    Support Vector Machine

    Learns a hyperplane in higher dimensional space

    that separates the training data and

    gives the highest margin

    Implicit mapping of the original feature space into higher dimensional space

    mapping using so called kernel function (eg., linear, polynomial, …)

    Regarded as state-of-the-art in text document classification

    Positive class

    Margin

    Negative class

    Hyperplane


    What semantic web researchers need to know about machine learning analytics ijs si

    Linear Model


    Na ve bayes

    Naïve Bayes

    Determine class of example ek by estimating

    • P(ci) – estimate from the data using frequency: no. of examples with class ck / no. of all examples

    • P(ek|ci) – too many possibilities (all combinations of feature values)

      • assume feature independence given the class


    Na ve bayes example

    Naïve Bayes - example

    Model

    P( | ) → 5/8*(0.4*0.2) = 0.05

    P( | ) → 3/8*(0*0) = 0

    P(vehicle| )=0.6

    P(vehicle| )=0

    P(human| )=0

    P(human| )=1

    P(animal| )=0.4

    P(animal| )=0

    P( )=5/8

    P( )=3/8

    P(5| )=0.2

    P(5| )=0

    P(10| )=0.2

    P(10| )=0

    P(20| )=0

    P(20| )=0.33

    P(30| )=0

    P(30| )=0.33

    P(60| )=0.2

    P(60| )=0

    P(90| )=0.4

    P(90| )=0.33


    Generative probabilistic models

    Generative Probabilistic Models

    • Assume a simple (usually unrealistic) probabilistic method by which the data was generated

    • Each class value has a different parameterized generative model that characterizes it

    • Training: Use the data for each category to estimate the parameters of the generative model for that category.

      • Maximum Likelihood Estimation (MLE): Set parameters to maximize the probability that the model produced the given training data

      • If Mλ denotes a model with parameter values λ and Dk is the training data for the kth class, find model parameters for class k (λk) that maximize the likelihood of Dk:

    • Testing: Use Bayesian analysis to determine the category model that most likely generated a specific test instance.


    Semi supervised learning

    Semi-supervised learning

    Similar to supervised learning except that

    • we have examples and only some of them are labeled

    • we may have a human available for a limited time to provide labels of examples

      • …this corresponds to the situation where all the patients in the database have symptoms, but only a few have diagnosis

      • …and occasionally we have doctors for a limited time to respond the questions about the patients


    Using unlabeled data nigam et al 2000

    Using unlabeled data(Nigam et al., 2000)

    • Given: a small number of labeled examples and a large pool of unlabeled examples, no human available

      • e.g., classifying news article as interesting or not interesting

    • Approach description (EM + Naive Bayes):

      • train a classifier with only labeled documents,

      • assign probabilistically-weighted class labels to unlabeled documents,

      • train a new classifier using all the documents

      • iterate until the classifier remains unchanged


    Using unlabeled data with expectation maximization em

    Using Unlabeled Data with Expectation-Maximization (EM)

    E-step: Estimate labels of unlabeled documents

    Initialize: Learn from labeled only

    Naive Bayes

    M-step: Use all documents to rebuild classifier

    Guarantees local maximum a posteriori parameters


    Co training blum mitchell 1998

    Co-training (Blum & Mitchell, 1998)

    Theory behind co-training

    • Possible to learn from unlabeled examples

    • Value of unlabeled data depends on

      • How (conditionally) independent are the two representations of the same data

        • The more the better

      • The number of redundant inputs (features)

        • Expected error decreases exponentially with this number

    • Disagreement on unlabeled data predicts true error

      Better performance on labelling unlabeled data compared to EM approach


    Bootstrap learning to classify web pages

    Bootstrap Learning to Classify Web Pages

    few labeled and many unlabeled

    Page Classifier

    Link Classifier

    Document

    Given: set of documents where each

    document is described by two independent

    sets of features (e.g. document text + hyperlinks anchor text)

    Hyperlink to

    the document


    Active learning

    Active Learning


    Active learning1

    Active Learning

    • We use this methods whenever hand-labeled data are rare or expensive to obtain

    • Interactive method

    • Requests only labeling of “interesting” objects

    • Much less human work needed for the same result compared to arbitrary labeling examples

    Data &

    labels

    Teacher

    passive student

    query

    Teacher

    active student

    label

    Active student

    asking smart

    questions

    performance

    Passive student

    asking random

    questions

    number of questions


    Approaches to active learning

    Approaches to Active Learning

    • Uncertainty sampling (efficient)

      • select example closest to the decision hyperplane (or the one with classification probability closest to P=0.5) [Tong & Koller 2000]

    • Maximum margin ratio change

      • select example with the largest predicted impact on the margin size if selected [Tong & Koller 2000]

    • Monte Carlo Estimation of Error Reduction

      • select example that reinforces our current beliefs [Roy & McCallum 2001]

    • Random sampling as baseline

    • Experimental evaluation (using F1-measure) of the four listed approaches shown on three categories from Reuters-2000 dataset

      • average over 10 random samples of 5000 training (out of 500k) and 10k testing (out of 300k)examples

      • two of the methods a rather time consuming, thus we run them for including the first 50 unlabeled examples

      • experiments show that active learning is especially useful for unbalanced data


    What semantic web researchers need to know about machine learning analytics ijs si

    Category with very unbalanced class distribution having 2.7% of positive examples

    Uncertainty seems to outperform MarginRatio


    Illustration of active learning

    Illustration of Active learning

    • starting with one labeled example from each class (red and blue)

    • select one example for labeling (green circle)

    • request label and add re-generate the model using the extended labeled data

      Illustration of linear SVM model using

    • arbitrary selection of unlabeled examples (random)

    • active learning selecting the most uncertain examples (closest to the decision hyperplane)


    What semantic web researchers need to know about machine learning analytics ijs si

    Uncertainty sampling of unlabeled example


    Unsupervised learning

    Unsupervised learning

    • Given is a set of examples

    • The goal is: to cluster the examples into several groups based on some similarity measure

      • examples inside the group should be similar while examples between the groups should be different

        Similarity measure plays a crucial role:

        Cosine Jaccard

        Minkowsky (norm k)

        Manhattan (k=1), Euclidean (k=2 )


    Clustering methods

    Clustering methods

    • Hierarchical

      • agglomerative – at each step merge two or more groups

      • divisive – at each step break the selected group into two or more groups

    • Non hierarchical

      • requires specification of the number of clusters

      • optimization of the initial clustering (e.g., maximize similarity of examples inside the same group)

    • Geometrical

      • map multidimensional space into two- or three-dimensional (e.g., principal component analysis)

    • Graph-theoretical


    K means clustering algorithm

    K-Means clustering algorithm

    • Given:

      • set of examples (e.g., TFIDF vectors of documents),

      • distance measure (e.g., cosine)

      • K (number of groups)

    • For each of K groups initialize its centroid with a random document

    • While not converging

      • Each document is assigned to the nearest group (represented by its centroid)

      • For each group calculate new centroid (group mass point, average document in the group)


    Example of k means clustering

    Example of k-means clustering

    • Examples:

    • A: 1,0,1,0,1

    • B: 1,0,0,0,1

    • C: 1,0,1,0,0

    • D: 0,0,0,1,0

    • E: 0,1,0,1,0

    • K=2

    • Randomly select two examples, e.g., A, D to be representatives of two clustersI: A, II: D

    • Calculate similarity of other examples to the them

      B,I= 0.82, B,II= 0, C,I= 0.82, C,II= 0, E,I= 0, E,II= 0.7

    • Assign examples to the most similar clusterI: (A,B,C) II: (D,E)

    • Calculate the cluster centroid I: 1,0,0.67,0,0.67 II: 0,0.5,0,1,0

    • Calculate similarity of all the examples to the centroidsA,I= 0.88, A,II= 0, B,I= 0.77, B,II= 0, C,I= 0.77, C,II= 0,D,I= 0, D,II= 0.82, E,I= 0, E,II= 0.87

    • Cluster the examplesI: (A,B,C) II: (D,E)

    • Stop as the clustering got stabilized


    Example of hierarchical clustering bisecting k means

    Example of hierarchical clustering(bisecting k-means)

    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

    0, 1, 2, 4, 6, 7, 9, 10, 11

    3, 5, 8

    0, 2, 4, 7, 10, 11

    1, 6, 9

    3, 8

    5

    0, 2, 4, 7, 11

    10

    1, 9

    6

    3

    8

    2, 4, 11

    0, 7

    1

    9

    2, 11

    4

    0

    7

    2

    11


    Latent semantic indexing

    Latent Semantic Indexing

    • LSI is a statistical technique that attempts to estimate the hidden content structure within documents:

      • …it uses linear algebra technique Singular-Value-Decomposition (SVD)

      • …it discovers statistically most significant co-occurrences of terms


    Lsi example

    LSI Example

    Original document-term mantrix

    Rescaled document matrix,

    Reduced into two dimensions

    High correlation although

    d2 and d3 don’t share

    any word

    Correlation matrix


    Data one can analyze with ml methods

    Data one can analyze with ML methods

    • Structured, tabular data

    • Textual data

    • Multilingual data

    • Social networks

    • Images/videos

    • Relational data

    • Temporal data

    • Cross-modal analysis

    • Very large datasets


    Structured tabular data

    Structured, tabular data

    Table, where each example is a raw, each feature is a column with predefined values

    feature

    example

    .

    .

    .

    • Main characters  {vehicle, human, animal}

    • Duration  [5..90]


    Textual data

    Textual data

    Having a set of documents, represent each as a feature vector:

    • divide text into units (eg., words), remove punctuation, (remove stop-words, stemming,…)

    • each unit becomes a feature having numeric weight as its value (eg., number of occurrences in the text - referred to as term frequency or TF)

      Commonly used weight is TFIDF:

    • tf(w) – term frequency (no. of occurrences of wordwin document dokumentu)

    • df(w) – document frequency (no. of documents containing word w)

    • N – no. of all documents


    Textual data example

    Textual data - example

    Bob the builder is a children animated movie on a character Bob and his friends that include several vehicle characters. They face challenges and jointly solve them, such as, repair a roof or save Bob’s cat from a tall tree…

    Pixar has several short animated movies suitable for children. Locomotion is one of them showing train engine and a train wagon as two characters that face a challenge of crossing a half-broken bridge…

    Simpson family provokes a smile on many adult and children faces showing everyday life of a family of four…


    Perceptron learning algorithm on document classification

    Perceptron learning algorithm on (document classification)

    Input:

    set of examples (documents)in the form of vectors; each labeled as +1 or -1

    Output:

    linear model wi(one weight per word)

    Algorithm:

    • initialize the model wi by setting word weights = 0

    • iterate through documents N times

    • For document d from D

      • // Using current model wi classify the document d

      • If sum(di *wi) >= 0 then classify document as positive

      • Else classify document as negative

      • If document classification is wrong then

        • // adjust weights of all words occurring in the document

        • wt+1 =wt +sign(true-class) * Beta (input parameter Beta>0)

        • // where sign(positive) = 1 and sign(negative) =-1


    Multilingual data

    Multilingual data

    • Text in several natural languages

    • Perform machine learning and retrieval on textual data regardless the language differences

    • Approach:

      • Machine Translation (on sentence level)

      • Multilingual lexicon (on word level)

      • Mapping into semantic space (on word level, eg., KCCA)


    Kcca to handle multilingual data

    KCCA to handle multilingual data

    KCCA enables representing documents in a “language neutral way”

    Intuition behind KCCA:

    Given a parallel corpus (such as Acquis)…

    …first, we automatically identify language independent semantic concepts from text,

    …then, we re-represent documents with the identified concepts,

    …finally, we are able to perform cross language statistical operations (such as retrieval, classification, clustering…)


    What semantic web researchers need to know about machine learning analytics ijs si

    Slovenian

    Slovak

    German

    English

    Czech

    French

    Hungarian

    Spanish

    KCCA

    Greek

    Italian

    Language Independent

    Document Representation

    Danish

    Finnish

    Lithuanian

    Swedish

    transform

    Dutch

    New document

    represented as text in

    any of the above languages

    New document

    represented in

    Language Neutral way

    …enables cross-lingual retrieval, categorization, clustering, …


    Input for kcca

    Input for KCCA

    On input we have set of aligned documents:

    For each document we have a version in each language

    Documents are represented as bag-of-words vectors

    Bag-of-words space for English language

    Bag-of-words space for German language

    Pair of aligned documents


    The output from kcca

    The Output from KCCA

    The goal: find pairs of semantic dimensions that co-appear in documents and their translations with high correlation

    Semantic dimension is a weighted set of words.

    These pairs are pairs of vectors, one from e.g. English bag-of-words space and one from German bag-of-words space.

    loss, income, company, quarter

    verlust, einkommen, firma, viertel

    wage, payment, negotiations, union

    zahlung, volle, gewerkschaft, verhand-lungsrunde

    Semantic dimension

    Semantic dimensions pair


    The algorithm theory

    The Algorithm – Theory

    Formally the KCCA solves:

    max(x,y) Corr(<x,, , >, <y,, , >)

    x, y – semantic directions for English and German

    ( , ) is a pair of aligned documents


    Examples of semantic dimensions from acquis corpus english french 1 2

    Examples of Semantic Dimensions from Acquis corpus: English-French(1/2)

    Most important words from semantic dimensions automatically generated from 2000 documents:

    Veterinary, Transport

    DIRECTIVE, DECISION, VEHICLES, AGREEMENT, EC, VETERINARY, PRODUCTS, HEALTH, MEAT

    DIRECTIVE, DECISION, VEHICULES, PRESENTE, RESIDUS, ACCORD, PRODUITS, ANIMAUX, SANITAIRE

    Customs

    NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF, CLASSIFICATION, CUSTOMS

    NOMENCLATURE, COMBINEE, COLONNE, MARCHANDISES, CLASSEMENT, TARIF, TARIFAIRES

    EMBRYOS, ANIMALS, OVA, SEMEN, ANIMAL, CONVENTION, BOVINE, DECISION, FEEDINGSTUFFS

    EMBRYONS, ANIMAUX, OVULES, CONVENTION, SPERME, EQUIDES, DECISION, BOVINE, ADDITIFS

    SUGAR, CONVENTION, ADDITIVES, PIGMEAT, PRICE, PRICES, FEEDINGSTUFFS, SEED

    SUCRE, CONVENTION, PORC, ADDITIFS, PRIX, ALIMENTATION, SEMENCES, DECISION

    EXPORT, LICENCES, LICENCE, REFUND, VEHICLES, FISHERY, CONVENTION, CERTIFICATE, ISSUED

    EXPORTATION, CERTIFICATS, CERTIFICAT, PECHE, VEHICULES, LAIT, CONVENTION

    Export Licences

    Agriculture

    Veterinary


    Examples of semantic dimensions from acquis corpora english slovene 2 2

    Examples of Semantic Dimensions from Acquis corpora: English-Slovene(2/2)

    Most important words from semantic dimensions automatically generated from 2000 documents :

    Agriculture

    OLIVE, OIL, AID, SUGAR, PRICE, STATE, MILK, LICENCES, OR, EXPORT, INTERVENTION

    OLJA, OLJCNEGA, POMOCI, SLADKORJA, POMOC, OLJK, SLADKOR, ALI, DOVOLJENJA, CE

    Customs

    NOMENCLATURE, COLUMN, COMBINED, GOODS, TARIFF, CLASSIFICATION, ST, ANNEXED, INVOKED

    NOMENKLATURO, STOLPCU, NOMENKLATURE, KOMBINIRANO, KOMBINIRANE, CARINSKI, BLAGA

    QUOTAS, TARIFF, SEED, CUSTOMS, COLUMN, ENERGY, INVOKED, ATOMIC, QUOTA, OPENING

    KVOT, TARIFNE, SEMENA, KVOTE, TARIFNIH, CARINSKI, ATOMSKO, ENERGIJO, ODPRTJU

    DESIGNATIONS, GEOGRAPHICAL, INDICATIONS, EURATOM, PROTECTED, ECSC, NAMES, ORIGIN

    OZNACB, EURATOM, GEOGRAFSKIH, POREKLA, ESPJ, ZASCITENIH, OZNACBE, IMEN, REGISTER

    WINE, WINES, ALCOHOL, DRINKS, DISTILLATION, POULTRYMEAT, ICEWINE, ANALYSIS

    VINO, VINA, VIN, VINSKEM, VINSKEGA, ALKOHOL, NAMIZNEGA, DESTILACIJO, DESTILACIJE

    Wine

    Agriculture protection

    Energy


    Applications of kcca

    Applications of KCCA

    Cross-lingual document retrieval: retrieved documents depend only on the meaning of the query and not its language.

    Automatic document categorization: only one classifier is learned and not a separate classifier for each language

    Document clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

    Cross-media information retrieval: in the same way we correlate two languages we can correlate text to images, text to video, text to sound, …


    What semantic web researchers need to know about machine learning analytics ijs si

    Example of cross-lingual information retrieval on Reuters news corpus using KCCA

    Borse = Stock Exchange

    Borse = Stock Exchange

    Borse = Stock Exchange

    Borse = Stock Exchange

    ‘Borse’ = ‘Stock Exchange’


    Search engine

    Search engine


    Related approaches

    Related approaches

    Usual approach for modelling cross language Information Retrieval is Latent Semantic Indexing (LSI/SVD) on parallel corpora

    …measured performance of KCCA is significantly better then of LSI

    [Vinokourov et. al, 2002]


    Availability scalability

    Availability/Scalability

    KCCA is available within Text-Garden text-mining software environment

    …available at http://www.textmining.net

    Current version processes up-to 10.000 documents

    Next version (incremental) will be able to process up-to 100.000 documents


    Social networks

    Social networks

    • Social networks can be also potential source of data for machine learning and building semantic structures

      • …conceptually they share similar underlying structure as text – namely, the underlying distribution is generated by power-law

    • In the next slides we show how social networks can be modeled using unsupervised techhniques


    Analysis of e mail graph

    Analysis of e-mail graph

    • An e-mail graph can be analyzed in the following 5 major steps:

      • Starting with log files from an e-mail server where the data include information about e-mail transactions with the fields: sender and the list of receivers.

      • After cleaning we get the data in the form of e-mail transactions which include e-mail addresses of sender and receiver.

      • From a set of e-mail transactions we construct a graph where vertices are e-mail addresses connected if there is a transaction between them

      • E-mail graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations

      • Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where e-mails came from.


    Graph transformation into a set of sparse matrix

    Graph transformationinto a set of sparse matrix

    • Graph with N vertices is transformed into N*N sparse matrix where:

      • …Xth row represents information for Xth vertex

      • …Xth row has nonzero components for:

        • Xth vertex itself and

        • Xth vertex’s neighbors on the distance D (e.g. 1, 2, 3)

      • Intuitively, Xth row represents numerically “neighborhood” of the Xth vertex within the graph:

        • Xth element in the Xth row has weight 1

        • …elements representing neighbors have lower weights relative to the distance (d) from the Xth vertex (1/(2^d))

          • (e.g. 1, 0.5, 0.25, 0.125, …)


    Graph transformation into sparse matrix example

    Graph transformation into sparse matrix (example)

    2

    6

    3

    7

    4

    0

    8

    1

    5

    9

    10

    11

    Transforming

    Graph into

    Matrix


    Data used for experimentation

    Data used for Experimentation

    • The data is the collection of log files with e-mail transactions from local e-mail spam filter software Amavis (http://www.amavis.org/):

      • Each line of the log files denotes one event at the spam filter software

      • We were interested in the events on successful e-mail transactions

        • ...having information on time, sender, and list of receivers

      • An example of successful e-mail transaction is the following line:

        • 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed CLEAN, [217.32.164.151] [193.113.30.29] <[email protected]> -> <[email protected]>, Message-ID: <[email protected]emhost.net>, Hits: -1.668, 6389 ms


    Some statistics about the data

    Some statistics about the data

    • The log files include e-mails for 19 months:

      • …this sums up to 12.8Gb of data.

      • After filtering out successful e-mail transactions it remains 564Mb

        • …which contains approx. 2.7 million of successful e-mail transitions used for further processing

      • The whole dataset contains references to approx. 45000 e-mail addresses

        • …after the data cleaning phase the number is reduced to approx. 17000 e-mail addresses

        • …out of which 770 e-mail addresses are internal from the home institution (with local domain name)


    What semantic web researchers need to know about machine learning analytics ijs si

    Organizational structure of JSI produced from cleaned

    e-mail transactions with OntoGen in <5 minutes


    What semantic web researchers need to know about machine learning analytics ijs si

    Organizational structure of JSI visualized from

    e-mail transactions with Document-Atlas


    Software mining

    Software Mining

    Software

    Linkanalysis

    Textmining

    Structured data

    = networks

    Unstructured data

    = textual documents

    Document network

    = a set of interlinked documents; each link has type and weight

    [Grčar, Mladenić, Grobelnik, 2007]


    What semantic web researchers need to know about machine learning analytics ijs si

    Extracting data

    Classname

    Comment references

    /** The format of Documents. Subclasses of DocumentFormat know about

    * particular MIME types and how to unpack the information in any

    * markup or formatting they contain into GATE annotations. Each MIME

    * type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,

    * RtfDocumentFormat, MpegDocumentFormat. These classes register themselves

    * with a static index residing here when they are constructed. Static

    * getDocumentFormat methods can then be used to get the appropriate

    * format class for a particular document.

    */

    public abstract class DocumentFormat

    extends AbstractLanguageResource implements LanguageResource{

    /** The MIME type of this format. */

    private MimeTypemimeType = null;

    /**

    * Find a DocumentFormat implementation that deals with a particular

    * MIME type, given that type.

    * @param aGateDocument this document will receive as a feature

    * the associated Mime Type. The name of the feature is

    * MimeType and its value is in the format type/subtype

    * @param mimeType the mime type that is given as input

    */

    static public DocumentFormat getDocumentFormat(gate.Document aGateDocument,

    MimeType mimeType){

    } // getDocumentFormat(aGateDocument, MimeType)

    } // class DocumentFormat

    Classcomment

    Super-class(base class)

    Implementedinterface

    Field comment

    A field

    Field type

    Field name

    Method comment

    A method

    Comment reference

    Returntype

    Methodname


    Extracting data

    Extracting data

    DocumentFormat.class

    DocumentFormat


    Extracting data1

    Extracting data

    DocumentFormat

    DocumentFormat.class

    LanguageResource

    MimeType

    2

    RtfDocumentFormat

    DocumentFormat

    AbstractLanguageResource

    Document

    XmlDocumentFormat

    MpegDocumentFormat


    Example graph

    Example graph

    See next slide


    Example graph zoomed

    Example graph - zoomed


    Structuring extracted knowledge

    Structuring extracted knowledge

    Source code

    Preprocessing

    Featurevectors

    OntoGen

    Ontology


    Semantic web and machine learning

    Semantic Web and Machine Learning


    Examples of semantic web with machine learning

    Examples of Semantic Web with Machine Learning

    • Ontology learning

    • Detecting semantic via Google

    • Visualization of semantic spaces

    • Contextualized search

    • User modeling

    • Detecting bias in the news

    • Semantic summarization


    Ml view on ontology

    ML view on Ontology

    • Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts

    • Ontology can be seen as a graph/network structure consisting from:

      • a set of concepts (vertices in a graph),

      • a set of instances assigned to a particular concepts (data records assigned to vertices in a graph)

      • a set of relationships connecting concepts (directed edges in a graph)


    Example of a topic ontology

    Example of a Topic Ontology


    Ontology learning

    Ontology learning

    • Define the ontology learning tasks in terms of mappings between ontology components, where some of the components are given and some are missing and we want to induce the missing ones.

    • Some typical scenarios in ontology learning are the following:

      • Inducing concepts/clustering of instances (given instances)

      • Inducing relations (given concepts and the associated instances)

      • Ontology population (given an ontology and relevant, but not associated instances)

      • Ontology generation (given instances and any other background information)

      • Ontology updating/extending (given an ontology and background information, such as, new instances or the ontology usage patterns)


    Ontology learning with ontogen developed on the top of text garden

    Ontology Learning with OntoGen(developed on the top of Text Garden)

    • Semi-Automatic

      • provide suggestions and insights into the domain

      • the user interacts with parameters of methods

      • final decisions taken by the user

    • Data-Driven

      • most of the aid provided by the system is based on some underlying data

      • instances are described by features extracted from the data (eg., words-vectors)

        [Fortuna, Mladenić, Grobelnik, 2005]

        Installation package is publicly available in binaries at ontogen.ijs.si


    Basic idea behind ontogen

    Basic idea behind OntoGen

    Text corpus

    Ontology

    Concept A

    Concept B

    Domain

    Concept C

    112


    What semantic web researchers need to know about machine learning analytics ijs si

    Hierarchy of concepts

    Ontology visualization

    Selected concept

    Details about the selected concept

    These two views are synchronized

    Root concept

    Concept name

    Descriptive keywords

    Distinctive keywords


    What semantic web researchers need to know about machine learning analytics ijs si

    Suggesting sub-concepts

    Number of suggestions

    List of suggestions


    What semantic web researchers need to know about machine learning analytics ijs si

    Inspection tool


    What semantic web researchers need to know about machine learning analytics ijs si

    Mining

    Pharmacy

    Real estate

    Oil

    Airlines

    Restaurants

    Life insurance


    What semantic web researchers need to know about machine learning analytics ijs si

    Description of the concept


    What semantic web researchers need to know about machine learning analytics ijs si

    The system asks several “yes or no” questions


    What semantic web researchers need to know about machine learning analytics ijs si

    Checkboxes indicate whether a document belongs to the concept or not

    List of documents

    Document preview pane

    Red dots represent documents that currently belong to the concept

    Similarity graph


    Google meaning

    Google Meaning


    Google meaning1

    Google meaning

    • Google offers insight on how words are used in a large corpora (web)

    • This can be used to estimate distance between terms

      • Estimation is based on the co-occurrence of terms

      • Uses the frequency of each term and their co-occurrences

    • Examples:

      • GoogDist(“Oscar Wilde”, “The Picture of Dorian Gray”) = 0.2

      • GoogDist(“Oscar Wilde”, “A Midsummer Night’s Dream”) = 0.5


    Google meaning2

    Google meaning

    • Google offers insight on how words are used in a large corpora (web)

    • This can be used to estimate similarities between terms

      • Estimation is based on the co-occurrence of terms

      • Uses the frequency of each term and their co-occurrences


    Text visualization document atlas http docatlas ijs si

    Text VisualizationDocument-Atlas http://docatlas.ijs.si


    Document atlas visualization of document collections and their structure

    Document Atlas – visualization of document collections and their structure

    http://docatlas.ijs.si


    Contextualized web search searchpoint http searchpoint ijs si

    Contextualized web-search SearchPoint: http://searchpoint.ijs.si


    Visualization of search results

    Visualization of search results

    • Search engines generally work very well

    • There are cases where it is difficult to specify a query

    • Idea: help the user byclustering all the hits and visualise the results space


    Example jaguar

    Example: jaguar

    Relevant part of ontology

    for the query “jaguar”


    Example iswc

    Example: ISWC

    Query ISWC has many meanings


    What semantic web researchers need to know about machine learning analytics ijs si

    Example: Password


    Sektbar intelligent user profiling

    SEKTbar – intelligent user profiling


    User focused browsing history in sektbar

    User focused browsing history in SEKTbar

    • Web-based user profile is automatically generated while the user is browsing the Web

      • User profile is represented as a user-interest topic ontology (root gives the user’s general interest, leaves specific interests)

      • topic ontology is generated by using hierarchical clustering

      • nodes of current interest are determined by comparing the profile node centroids to the centroid, computed out of the m most recently visited pages

    • User profile is visualized on the SEKTbar (Internet Explorer Toolbar)

      • the user can select a node in the hierarchy to see its specific keywords and associated pages

        [Grčar, Mladenić, Grobelnik, 2006]


    Detecting the bias in media with statistical learning methods

    Detecting the bias in media with statistical learning methods


    Media bias

    Media Bias

    • The descriptions of the same event in news can differ with respect to the news source and its bias.

      • Can we detect this using statistical methods?

      • How does this bias reflect in vocabulary?

    • To answer these questions we need:

      • A controlled collection of news for a longer time period

        • A year corpus of news articles on international events

      • Matching of news articles from different news sources based on the described event

        • We used bag-of-words model and “Best Reciprocal Hit” method from Bioinformatics to match news articles describing similar events


    Experimental setup

    Experimental setup

    Time period: March 31st 2005 – April 14th 2006

    Size of collections:

    Number of discovered matches:


    Prediction of news source

    Prediction of news source

    • The task: given a pair of news articles describing the same event, can we predict the news source for each?

    • In this experiment we focused on CNN and Al Jazeera.

    • SVM linear classifier was used for prediction

      • Evaluation was done using 10-fold cross-validation

      • Significance of results was tested against random matches

    • We used SVM feature selection [Brank et. al.] to extract most important classification keywords.


    Source prediction results

    Source prediction results

    • News source can be predicted with 81% BEP

      • high bias in vocabulary

    • News source specific keywords:


    Bias example

    Bias example


    Keywords differences on topic level

    Keywords differences on topic level


    News source similarity maps

    Based on covered events

    News source similarity maps

    Based on source vocabulary differences


    Summarization of documents through semantic graphs

    Summarization of documents through semantic-graphs


    Approach description

    Approach Description

    • Approach:

      • Learn a machine learning model for selecting sentences

      • Use information about semantic structure of the document (concepts and relations among concepts)

        Results are promising

      • achieved 70% recall of and 40% precision on extracted Subject-Predicate-Object triples on DUC (Document understanding conference) data


    Summarization

    Summarization

    Human built document summary

    Original Document

    • Cracks Appear in U.N. Trade Embargo Against Iraq.

    • Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''.

    Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to

    exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.

    Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

    Manual

    summarization

    Creation of

    semantic network

    Semantic net of

    Subj-Pred-Obj

    triples

    Automatically built

    document summary (not done by us)

    70% recall, 40% precision

    of selected triples according to human generated summaries

    Automatic

    summarization by

    selecting relevant triples

    Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to

    exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.

    Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

    Nat. Lang.

    Generation

    Mapping between graphs learned with ML methods

    Semantic net of

    Subj-Pred-Obj

    triples


    Detailed summarization procedure

    Detailed Summarization Procedure

    Linguistic analysis of the text

    - Deep parsing of sentences

    Refinement of the text parse

    - Named-entity consolidation

    Determine that ’George Bush’ = ‘Bush’

    = ‘U.S. president’

    - Anaphora resolution

    Link pronouns with name-entities

    Extract Subject–Predicate–Object triples

    Tom Sawyer went to town. He met a friend. Tom was happy. …

    Tom Sawyer went to town. He [Tom Sawyer] met a friend. Tom [Tom Sawyer] was happy. …

    Tom  go  town

    Tom  meet  friend

    Tom  is  happy

    Compose a graph from triples

    Describe each triple with a set of features for learning

    Learn a model to classify triples into the summary

    Generate a summary graph

    Use summary graph to generate textual document summary


    Example of summarization

    Example of summarization

    Human written summary

    Cracks Appear in U.N. Trade Embargo Against Iraq.

    Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''.

    Cracks appeared in the U.N. trade embargo against Iraq. The State Department reports that Cuba and Romania have struck oil deals with Iraq as others attempt to trade with Baghdad in defiance of the sanctions. Iran has agreed to

    exchange food and medicine for Iraqi oil. Saddam has offered developing nations free oil if they send their tankers to pick it up. Thus far, none has accepted.

    Japan, accused of responding too slowly to the Gulf crisis, has promised $2 billion in aid to countries hit hardest by the Iraqi trade embargo. President Bush has promised that Saddam's aggression will not succeed.

    7800 chars, 1300 words


    What semantic web researchers need to know about machine learning analytics ijs si

    Automatically generated graph of summary triples


    Further online information

    Further online information

    • Recorded tutorials, lectures, summer-schools available from http://videolectures.net

      • Semantic Web: http://videolectures.net/Top/Computer_Science/Semantic_Web/

      • Machine Learning: http://videolectures.net/Top/Computer_Science/Machine_Learning/


    Ml summer schools @ video lectures

    ML Summer Schools @ Video Lectures

    • ML summer schools online


  • Login