Empowering the publishing process with semantic technologies
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Empowering the Publishing Process with Semantic Technologies PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

Empowering the Publishing Process with Semantic Technologies. Stephen Cohen Principal Consultant . O’Reilly Tools of Change Conference 23 February 2010. Agenda. Overview Semantic technologies Case studies Benefits and challenges Questions. Innodata Isogen – Who We Are.

Download Presentation

Empowering the Publishing Process with Semantic Technologies

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Empowering the publishing process with semantic technologies

Empowering the Publishing Process with Semantic Technologies

Stephen Cohen

Principal Consultant

O’Reilly Tools of Change Conference

23 February 2010


Agenda

Agenda

  • Overview

  • Semantic technologies

  • Case studies

  • Benefits and challenges

  • Questions


Innodata isogen who we are

Innodata Isogen – Who We Are

Innodata Isogen provides knowledge, production, technology and consulting services to the world’s leading media, publishing and information services companies

  • We specialize in publishing, to help our clients to:

  • lower total cost of ownership for their content supply chain

  • re-engineer business processes

    • multi-shore services to lower cost, manage risk and balance the cost / quality ratio

    • combine content and technology outsourcing add value

  • Our clients include

  • leading scholarly, business and legal publishers

  • secondary publishers (content aggregators)

  • agencies of the U.S. Department of Defense

  • major aerospace manufacturers

6,500 globalstaff

London

Paris

Israel

Delhi

Manila

Cebu

Colombo

New Jersey

Dallas


Overview

Overview

  • Semantic technologies are often used to more effectively monetize content and improve the customer experience on the Web

    • semantic advertising

    • semantic search

  • They have also been used effectively throughout the publishing process

  • Today we will talk about companies that are using semantic technologies and text mining to process content better, faster, cheaper


What do publishers have in common

What Do Publishers Have in Common?

  • They all want to deliver information better, faster, cheaper

  • Better

    • offer the information customers and users want and need (focused)

    • make it easier for customers to discover new information and relationships between information

  • Faster

    • get it in the hands of customers ahead of your competition (when they need it)

  • Cheaper

    • do it in the most cost effective way possible


Semantic analysis tools can help

Semantic Analysis Tools Can Help

  • Across the content supply chain

  • Better

    • more accurate, consistent content tagging, indexing, abstracting, linking

  • Faster

    • find out sooner about new information (e.g., announcements, legal opinions, rules changes)

    • (semi) automate content enrichment

    • increase throughput

  • Cheaper

    • deploy resources most cost effectively (do more with less)


Semantic technologies some characteristics

Semantic Technologies: Some Characteristics

  • Briefly, semantic technologies are algorithms that seek to model the associative processes that humans perform to extract meaning from information

  • Knowing a little bit about “the man behind the curtain” can help when it comes to deciding which approach is a good fit for your company’s needs

  • They can be rules-based, use statistical analysis, use semantic and linguistic clustering, etc.

  • Not surprisingly, there are many approaches to modeling and each has its strengths and weaknesses


Rules based text analysis

Rules-Based Text Analysis

  • Precisely defines criteria by which a document belongs to a category

  • Matches terms in a thesaurus to words in content

  • Typically uses “if-then-else” rules

  • Relative easy to deploy; start with simple rules and enhance over time

  • Rules can get complex, difficult to maintain

Word = shrub?

Assign Category = ‘bush’

Word = Bush

AND

within 4 words of President?

Assign Category =

‘chief executive’

doc.type = email?

Assign Category =

‘internal communication’


Statistical analysis

Statistical Analysis

  • Word frequency

  • Relative placement of words, groupings

  • Distance between words in a document

  • Pattern analysis

  • Co-occurrence of terms to find clumps or clusters of closely related documents

  • Makes assignments to categories based on a set of training documents

  • Requires more time to deploy due to need to select a representative set of documents for training the tool

  • Accuracy of the semantic analysis will depend on how well the training documents have been chosen


Semantic and linguistic clustering

Semantic and Linguistic Clustering

  • Concept extraction

  • Language dependent

  • Documents clustered or grouped depending on meaning of words using thesauri, parts-of-speech analyzers, rule-based & probabilistic grammar, etc.

  • Analyzes structure of sentences

    • analysis of words - prefixes, suffixes, roots

    • word-level analysis including parts of speech

    • analyzes structure & relationships between words in a sentence

    • possible meanings of a sentence; enhanced by statistical analysis


The content supply chain

The Content Supply Chain

  • We view the publishing process in terms of a supply chain

  • It begins with content acquisition through conversion and enhancement, on to product assembly and, lastly, to product publishing and distribution

  • Using semantic tools has an impact on roles and responsibilities, workflows and the way content is processed at each stage of the content supply chain

  • Semantic tools and text mining are used at different stages of the editorial and production process


Semantic tools in the content supply chain

Semantic Tools in the Content Supply Chain

  • Source / Create

  • Convert / Structure

  • Normalize

  • Store /

  • Manage

  • Edit /

  • Enhance

  • Product Assembly

  • Publish / Distribute

Intelligent agents for targeted retrieval (content federation);

“acquire what is new or changed from sites I am interested in”

Abstracting, auto-summarization (e.g., synopses, headnotes)

Custom publishing; ‘Synthetic documents’

Content delivery for multiple output channels and product formats

Linking; entity extraction; citations; classification ,

machine aided indexing; contextual meaning

Extract content for tagging; identify not only document structure

but document meaning; structure unstructured content

Controlled vocabulary and authority list management; taxonomy managers; knowledge management


Empowering the publishing process with semantic technologies

Case Studies


Preview of case studies

Preview of Case Studies

  • Rules-based auto-classification

  • Document analysis and entity linking

  • Auto-summarization

  • Product assembly

  • Custom information feeds


Empowering the publishing process with semantic technologies

Case Study

Rules-based Auto-classification


Rules based auto classification

Rules-based Auto-classification

TAXONOMY MANAGER

DEFINE CLASSIFICATION RULES

RULES MANAGEMENT

SYSTEM

INDEXER

Add/remove terms; Create groupings; Map terms

Automatic update of rules to reflect changes in taxonomy

Indexer defines classification rules

Review usage statistics

Rules used, not used; add, modify, delete rules

Baseline Test Set

Test & adjust rules

RULES BASE

INDEXER REVIEW

AUTO-CLASSIFICATION

INDEXER

Accepts, rejects, adds, classification terms

Reviews rules system applied that yielded wrong classification

Flag problems to rules builder; suggest new terms

Set-up

Apply rules to classify content against taxonomy

System tracks rules usage

(which ones used; frequency)

SYSTEM

Tracks rules that generated

incorrect classifications


Empowering the publishing process with semantic technologies

Case Study

Document Analysis and Entity Linking


Document analysis and entity linking

Document Analysis and Entity Linking

  • Focus is on document analysis and entity linking in editorial workflow

  • Subsidiary of a global legal publishing house

    • content base of 3.5 million cases, related documents

    • manages over 17 million citations

    • updates of case law processed daily

    • cases growing at 20% per annum

  • Challenges

    • avoid processes performed manually by individuals

    • allow the user to select and filter the information needed for their job

    • take into account an increasing number of legal information sources

  • Describes target configuration but not yet fully realized


Goals for the new process

Goals for the New Process

  • Aid the process of knowledge extraction and storage

    • identify legal sources (e.g., official publication, case law decision)

    • extract legal citations (which source is cited and why?)

    • populate a knowledge base and cyclically enrich the content

  • Process each piece of information one time

    • normalize, tag, enrich, link, form concepts, etc.

  • Build standardized common knowledge base for use throughout the editorial and production process and by downstream by end-users

  • Maintain consistent thesauri, ontologies, taxonomies and provide a mechanism for their management and updating


Document analysis and linking process

Document Analysis and Linking Process

DEFINITION PHASE

AUTOMATED TEXT ANALYSIS

Domain-specific lists for entity recognition

Text mining rules

Entity extraction

Automated Semantic Analysis

Tag content

Linking

Baseline Test Set

Test text

analysis tool

Iterative application of rules

KNOWLEDGE MANAGEMENT

SEARCH AND NAVIGATION SERVICES

REVIEW AND QC ENRICHED CONTENT

LIST AND RULES MAINTENANCE

Use search,navigation tools to

review,identify, and correct

Weekly review of exception reports

Legal editors

Librarian

Entity error

Link error

Concept error

20


Benefits of the new process

Benefits of the New Process

  • Workflow

    • a semi-automated process

    • editors review QC output from text mining tool to enhance and correct as necessary

    • analysis and linking by automated text analysis tool

    • parallel processing in text analysis tool

    • analysis, referencing and linking became part of the same workflow

  • Roles and responsibilities

    • editors no longer need to be experts in mark-up languages; content is tagged automatically

    • low value editorial tasks handled by text analysis tool

    • existing staff can focus on high value tasks

    • new role to maintain and enhance semantic lists and text mining tool rules

  • Content

    • quality document analysis improves through enhancements to the lists and rules used by the text mining tool

    • able to federate metadata across multiple content management systems

    • same knowledge base and text mining tool integrated into online products


Empowering the publishing process with semantic technologies

Case Study

Auto-summarization


Auto summarization major newspaper

Auto-summarization – Major Newspaper

Content in

  • Document zones

  • Rules: semantics; dictionary; complex grammar rules

  • Section weightings

  • Sentence position

  • Relative importance of sentences

  • Markers for start of sections, paragraphs, sentences

  • Sentence length of summary

Document Analysis

Source; type; format; content

Auto-summarization

Rules Base

OR

Extent of automation depends on article importance

Administrator monitors, improves rules set based on usage

Auto-summarization

(draft version)

Expert review and edit

(final version)

OR

Manual summarization

Outsource or in-house experts

OR


Empowering the publishing process with semantic technologies

Case Study

Product Assembly


Product assembly

Product Assembly

New

Content

Process

Source / Capture

Convert / Normalize

Analyze / Classify / Enhance - Editorial

Content Repository

Extract Product Content From Repository

Select content

(XQuery)

Select content

(XQuery)

Select content

(XQuery)

Select content

(XQuery)

Render

WCSS; Proprietary

Render

FOSI; XSLFO;

Proprietary

Render

Render

XSLT; CSS; RSS

Format

Product

XML Content Store

Rich Media


Empowering the publishing process with semantic technologies

Case Study

Custom Information Feeds


Custom information feeds

Custom Information Feeds

Delivery

Repository

Content

End Users

PRO

BASEBALL

REAL-TIME UPDATES

TARGETED INFO

SCORES

XML

PLAYERS

FOOTBALL

NEWS

COLLEGE

REAL-TIME FEEDS

ENRICHED EMAIL

PEOPLE

RULES

SOCCER

HIGH SCHOOL

STANDINGS

STATS

RICH MEDIA

REAL-TIME FEEDS

ENRICHED EMAIL

HOCKEY

REC

SCHEDS


Benefits of using semantic technologies

Benefits of Using Semantic Technologies

  • People

    • minimize high-value resources performing commodity tasks

    • editors focus on real editorial added value; no need to be concerned about markup

    • increased capacity without increasing headcount

    • novice indexers come up to speed quicker

  • Process

    • reduced processing time due to automation

    • sequential tasks can be performed in one step

    • products can be more targeted to specific customer needs

    • parts can be outsourced

  • Content

    • richer more consistent classification, linking, summarization, semantic tagging

    • common controlled vocabularies maintained and applied across entire content base

    • same content can be classified and summarized along more dimensions to serve different customer groups

    • greater value can be extracted from unstructured content with text mining and semantic analysis

    • taxonomy managers support a rigorous approach to maintenance and updating


Challenges using semantic technologies

Challenges Using Semantic Technologies

  • People

    • retrain resources for new roles (rules builder, taxonomy manager, etc.) is time consuming

    • level of accuracy depends on ability of editors to write logical rules

  • Process

    • time required to refine rules and train analysis engine can be extensive (some report 12-18 months)

    • productivity improvements are a function of thesaurus structure, rule-builder’s skill level, document type; the more complex any of these are the longer it takes to achieve return on investment

  • Content

    • automated content analysis doesn’t match up to the analytical skills of trained subject area experts (at least in some highly technical disciplines)

    • some find it difficult to measure the impact of indexing consistency

    • lower quality when there is fully automated machine aided indexing with no follow-on QC by subject area experts


Empowering the publishing process with semantic technologies

Questions


Empowering the publishing process with semantic technologies

Thank You

Stephen Cohen

Principal Consultant

[email protected]

+1 (201) 371-8044

Innodata Isogen, Inc.

Three University Plaza

Hackensack, NJ 07601

+1 (201) 371-2828

www.innodata-isogen.com

Proprietary and Confidential

WWW.INNODATA-ISOGEN.COM


  • Login