Training-less Ontology-based Text Categorization.

Training-less Ontology-based Text Categorization. Maciej Janik LSDIS lab, Computer Science, University of Georgia Major professor: Dr. Krzysztof J. Kochut Committee Dr. John A. Miller Dr. Khaled Rasheed Dr. Amit P. Sheth July 1st, 2008 Dissertation Defense

Document categorization Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. [Wikipedia]

Objectives • Document categorization method • Classification is based on knowledge from ontology • Do not require training set • Use semantic information for categorization • Explore role of semantic associations in text categorization • Incorporate user interest (context) into categorization

Automatic document categorization • Methods are based on word/phrase statistics, information gain and other probability or similarity measures 1. • Examples • Naïve Bayes, SVM, Decision Tree, k-NN • Categorization based on information (frequencies, probabilities) learned from the training documents. • Vocabulary extension/unification possible by use of synonyms, homonyms, word groups (eg. from WordNet) • Document representation for categorization • Set or vector of features - most popular and simple: bag of words • Does not include information about document structure, relative position of phrases, etc. (1) Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34 (1). 1 - 47.

Document categorization by people • People categorize document by understanding its content, using their knowledge and understanding what the category is. • Categorization is based on: • Document content • Knowledge • Category • Perceived interest entities and relationshipsontologycategory definitioncategorization context

OmniCat approach • Categorization knowledge • Ontology • Features • Entities, relationships and semantic associations • Category definitions • Relevant fragments of ontology • Importance of classes, entities, and relationships • Categorization process • Matching of a document text to find best fit into defined ontology fragments

Semantic associations • Semantic Association • A simple, undirected path that connects two entities in the knowledge base and describe how they are related. • Relationships on the path define meaning of this connection. • Directionality of relationships sets specific interpretation of a path. • Entities on the path specify the content. (1) Sheth, A. P., I. B. Arpinar, et al. (2003). Relationships at the Heart of Semantic Web: Modeling, Discovering, and Exploiting Complex Semantic Relationships. Enhancing the Power of the Internet: Studies in Fuzziness and Soft Computing. M. Nikravesh, B. Azvin, R. Yager and L. Zadeh, Springer Verlag.

Semantic Associations - Paths in RDF child child older works_for child Directed path child child Undirected path, but with specific properties and directionality Undirected path

BRAHMS Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery", Fourth International Semantic Web Conference, ISWC 2005, Galway, Ireland, 6-10 November 2005

BRAHMS • Features • Main-memory RDF/S storage • Handle RDF and RDFS data • High performance for accessing RDF/S data • Efficient handling of large onologies • Rich API provide a framework for creating ontology-based algorithms (e.g. semantic association discovery) • Separation of schema and instances • Read-only access to ontology • Developed for the need of SemDis1 project (1) http://lsdis.cs.uga.edu/projects/semdis/

Design decisions • Performance requirements • use main memory for storage – fastest access • create indexes for operations used in graph traversal algorithms • use C/C++ in implementation instead of Java • instead of string URIs, use simple type [int] as resource identifiers. • Ontology size • compact representation for handling large ontologies – leave some memory for algorithms

Design decisions • Handle RDF / S • simplify the design and do not include and check logic or constraints imposed by OWL • Separate instance base from schema • represent instances, schema classes and properties as different object types • have specific methods to access schema or instances • different types of objects require different types of statements

Design decisions • Framework for algorithms • create rich API of basic operations to access RDF/S data • Consequences of design decisions • compact knowledge base to minimize memory usage, no memory fragmentation – use contiguous memory blocks  make it read-only • create snapshot of memory structures for fast start-up (parse* once, use many times) • handle taxonomy in a special way. (*) Redland’s Raptor is used as RDF/S parser – http://librdf.org/raptor

Results - timing 45,000 Instance statements 29,889 instances RDF: 13Mb

Results - timing

SPARQLeR Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, 3-7 June 2007

SPARQLeR • Extension of SPARQL for semantic association discovery. • Seamlessly integrated into the SPARQL syntax. • Graph patterns incorporating simple paths with constraints. • Support for flexible length paths. • Property constraints (path patterns) are based on regular expressions over properties. • Additional constraints on entities included in the path (instances and properties).

Path patterns in SPARQLeR • Path is SPARQLeR is a meta-property • Resource –[property] Resource • Resource –[path] Resource • Path is also a Sequence • Test if a resource is in the path: • rdfs:member • Test if a resource is at a specific position in the path: • rdf:_2, rdf:_4, ... • SPARQLeR-specific path properties • Test all resources or all properties in the path: • rdfms:entityResource and rdfms:propertyResource Example: all resources on a path must be of type foo:Person

SPARQLeR extensions • Path expressions • use of regular expressions over properties • Flexible path specification • Undirected • Defined directionality paths • Directed • Length restricted • Complex path patterns • Test of resources and properties on the path • Intersecting paths

RegExp in path constraints • Path constraints on properties are based on regular expressions • Uses syntax similar to lex • Easy for grep users • Examples: a c* d a+ (b|c) a [abc] c? d ( b a-1 )+ c

foo:rel A rdf:type rdfs:subPropertyOf foo:rel foo:prop foo:prop r s e ?x SPARQLeR - example SELECT list(%path) WHERE {<r> %path <s> . %path rdf:_2 <e> . %path rdfms:entityResource ?x .?x rdf:type <foo:A> FILTER(length(%path)<=6 && regex(%path,“(foo:prop -foo:rel)+”,“dih”) }

Experiments • Scalability • Modified DBLP datasets in RDF (added random citations) • Test on increasing dataset (adding older years of publications) • Search for cited publications (transitive) PREFIX opus:<http://lsdis.cs.uga.edu/projects/semdis/opus#> SELECT ?end_publication WHERE { <http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>%path ?end_publication FILTER ( length(%path)<=26 &&regex(%path, "(opus:cites_publication)*" ) ) } B. Aleman-Meza et. al. Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)

Experiments – dataset characteristics

Experiments – results: single source paths Search paths up to length 26

OmniCat Maciej Janik, Krys Kochut. “OmniCat: Automatic Text Classification with Dynamically Defined Categories”, 7th International Semantic Web Conference (ISWC 2008), Karlsruhe, Germany [submitted to] Maciej Janik, Krys Kochut. "Wikipedia in Action: Ontological Knowledge in Text Categorization", Second IEEE International Conference on Semantic Computing, ICSC 2008, Santa Clara, CA, USA, August 2008 [to appear] Maciej Janik, Krys Kochut. "Training-less Ontology-based Text Categorization", Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008) at the 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 30 March 2008

Ontology • “An explicit specification of a conceptualization.” 1 • Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. [Wikipedia] Gruber, T. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5 (2). 199-220, 1993.

Ontology-based classification • Ontology IS the knowledge base and THE CLASSIFIER – no need for training set. • Rich instance base defines known universe. • Schema with taxonomy describe categorization structure. • Classification is based on recognized entities in text and semantic relationships between them. • Categories assigned are based on entities types, taxonomy embedded in schema and provided categorization contexts.

OntoCategorization – bases • Probability • Document is classified based on probabilities that given feature (word, phrase) belongs to a certain category. • Similarity • Category is defined as ontology fragment (entities, classes, structures, etc.) • Similarity of document graph to given ontology fragment describes closeness to selected category • Connectivity (components) • Knowledge is based on associations. • Entities in one category should form a connected component, as they belong to the same subject. • Context • Specific entities, entity types, or semantic structures may be of different importance for user

Graph representation of text • Graph representation preserves (selected) structural information from document • Relative words positions to find close co-occurring phrases. • Paragraph, formatting (eg. emphasize), part of document. • Sample representations • Words form a directed graph, chained in order as they appear in each sentence. • Words form a weighted graph, where edge connects words within certain distance and weight determines closeness. • Connected terms based on NLP processing or co-occurrence.

Graph-based categorization • Categorization based on similarity metrics 1 • Isomorphism • Maximum common subgraph/ minimum common supergraph • Graph edit distance • Statistical methods • Diameter, degree distribution, betwenness • Comparison of node neighbors • Distance preservation measure • Methods • k-NN – most straightforward • similarity to centroids – graph mean and graph median • term distance to category (1) Schenker, A., Bunke, H., Last, M. and Kandel, A. Graph-Theoretic Techniques for Web Content Mining. World Scientific, London, 2005.

Classes and categories • Classes do not have to be categories • Classes • Form taxonomy / partonomy • Strict, formal requirements • Membership based on features • Categories • Can include other categories, intersect with them, etc. – more set-like approach • Category can be a complex structure of classes, relationships and instances • Topic of interest that can span multiple, normally unrelated classes in schema

OmniCat system

Algorithm sketch • Semantic graph construction • Conversion of an unstructured text into semantic graph • Thematic graph selection • Setting a topic by selection of graph(s) for categorization • Categorization using ontology • Bottom-up approach of category discovery • Top-down approach with categorization context projection

Semantic graph construction (1) • Named entity identification • Matching known phrases (literals) from ontology and assign initial confidence weight • Each phrase has assigned a confidence level based on uniqueness of entity identification • Number of times each phrase is matched suggests its importance in text • Text-phrase similarity is used when applying stop words removal or stemming

Sales Process (computing) Ford Motor Company Business process Process (science) Jaguar (animal) Land_Rover Ford Motor Company Jaguar Cars Ltd. Chief Executive Officer Alan_Mulally Example of entity matching Ford Motor Co. is in the process of selling Jaguar and Land Rover, according to Ford CEO Alan Mulally.

Semantic graph construction (2) • Entity relationship extraction • NLP parse of each sentence to get dependency tree • Use previously matched phrases as clues for entities positions • If matched phrases are close in the parse tree, add a relationship between them in the final graph • OmniCat does not extract named relationships

Example – parse tree and triples Ford Motor Co. is in the process of selling Jaguar and Land Rover, according to Ford CEO Alan Mulally.

Semantic graph construction (3) • Connectivity inducement • For each pair of matched entities find all relationships in the ontology • Each relationship has importance factor, based on semantics of information it defines

Example – NLP + ontology knowledge Ford Motor Co. is in the process of selling Jaguar and Land Rover, according to Ford CEO Alan Mulally. named_after Jaguar (animal) Jaguar Cars Chief Executive Officer parent_company sells has_CEO Ford Motor Company is_a sells CEO_of parent_company Land Rover Alan Mulally

Thematic graph selection (1) • Removal of specific types of entities (optional) • Specific for news documents • What? Who? • Content of the news • Where? When? • Date, time and place • Entities that may become hotspots in the created document graph

Thematic graph selection (2) • Entity weight propagation • Each entity has assigned initial match weight • Entities are connected by relationships with given importance factor • Propagate weight using HITS 1 algorithm to find best hub and authority entities • Best authoritative entities are most important for document categorization – core of the graph • Calculate centrality to find entities that are “topic landmarks” (1) Kleinberg, J.M., Authoritative Sources in a Hyperlinked Environment. in ACM-SIAM Symposium on Discrete Algorithms, (1998).

Thematic graph selection (3) • Selection of the dominant thematic graph for categorization • Select connected component that is largest and has maximum weight for further categorization • Based on assumption that entities associated with the same or related topics are interconnected in ontology • Effectively disambiguate many incorrectly matched entities • Focus on one or few major topics of a document

Thematic graph examples Chief Executive Officer Jaguar Cars Jaguar (animal) Ford Motor Company Alan Mulally Land Rover Announcement Sales News Business Newspaper Buyer

Thematic graph categorization • Categorization concentrates on selected dominant thematic graph • Proposed methods • Bottom-up category discovery • Class-category mapping • Top-down category projection • Categorization based on context projection • Combination of categorization contexts for complex categories

Bottom-up categorization (1) • Category discovery approach • No category definitions are needed, only taxonomy from the ontology • Bottom-up approach – discover categories based on classification of entities • Best category should • Cover largest portion of entities in the thematic graph • Be most possible direct class for entities • Include entities from core of the graph

Bottom-up class discovery

Bottom-up categorization (2) • External categories are given as set of classes • In case of Wikipedia and external corpora, categories are defined as mapping of appropriate Wikipedia categories • Previously discovered categories are matched with categories definitions • Top-k are considered for matching • Matching until one category becomes dominant

Entities and categories Car Manufacturers Felines Living people Off-road wehicles Ford Pantherinae Ford people Jaguar Panthera Ford executives Jaguar Cars Alan Mulally Jaguar (animal) Ford Motor Company Chief Executive Officer Land Rover

Example Ford, utility ready to work on plug-in car Automaker, Southern California Edison to unveil alliance in response to demand for energy-efficient vehicles. DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison will announce an unusual alliance Monday aimed at clearing the way for a new generation of rechargeable electric cars, the companies said. Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International (Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with reporters at Edison's headquarters in Rosemead, Calif., the companies said. [...] Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles uses batteries to power the vehicle at low speeds and in to provide assistance during stop-and-go traffic and hard acceleration, delivering higher fuel economy. General Motors Corp. (Charts , Fortune 500) has already begun work this year to develop its own plug-in hybrid car, designed to use little or no gasoline over short distances. The company showed off a concept version of the Chevrolet Volt in January at the Detroit Auto show and has awarded contracts to two battery makers to research advanced batteries for a possible production version.

Training-less Ontology-based Text Categorization.

Training-less Ontology-based Text Categorization.

Presentation Transcript

Text Categorization

Text Categorization

Text Categorization (TC)

Training-less Ontology-based Text Categorization.

Learning for Text Categorization

Text Categorization

Text Categorization

Text Categorization

Text Categorization

text categorization

An EM based training algorithm for Cross-Lingual Text Categorization

A Text Categorization Based on summarization Technique

Statistical Text Categorization

Text Categorization

Text Categorization

Ontology Based Annotation of Text Segments

Text Categorization

Text Categorization

Text Categorization

Text Categorization (continued)