1 / 60

Metadata Mining

Metadata Mining. Masoud Makrehchi Supervisor: Prof. Mohamed Kamel May 2004. Outlines. Metadata Mining Proposed Approach Experimental Results Future Research. Metadata. Metadata Definition: data about data, for example a library catalogue Metadata Application

jamar
Download Presentation

Metadata Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel May 2004

  2. Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

  3. Metadata • Metadata Definition: • data about data, for example a library catalogue • Metadata Application • cataloging (Item and Collections) • resource discovery • e-commerce and digital signatures • Intelligent Software Agents • content rating • intellectual property rights (IP rights) • Semantic Web • Learning Objects (LO): LOM standards such as IEEE LOM, DC, SCORM, CANCORE Masoud Makrehchi Metadata Mining

  4. Metadata LO metadata Map legend RDF Library catalogue Masoud Makrehchi Metadata Mining

  5. Metadata Masoud Makrehchi Metadata Mining

  6. Metadata Specifications Masoud Makrehchi Conceptual data architecture Metadata Mining

  7. Metadata Specifications • More structured (usually semi-structured) • Low dimensional (~400,000 vs. 15,791 terms) • More homogenous than raw data • Less noisy (stopwords) • Less time-varying Masoud Makrehchi Metadata Mining

  8. Metadata Specifications Masoud Makrehchi Metadata Mining

  9. Motivations • No access to the raw data • The raw data is inconvenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, the need to merge different metadata repositories • Attempt to use background knowledge in mining process Masoud Makrehchi Metadata Mining

  10. Applications • Either alternative or complementary approach, depends on having access to the raw data • Metadata enrichment (keyword extraction & automatic title generation), towards automatic metadata generation • Semi-structured data mining, such as XML,RDF, OWL and other semantic tagged scripts • Semi-automatic Ontology extraction • information integration based on metadata (LOs aggregation and integration) Masoud Makrehchi Metadata Mining

  11. Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

  12. Assumptions • There is a corpora or a metadata collection, usually labeled • No natural Language Processing (NLP), and no computational linguistic • No semantic analysis and processing • Metadata as a text document, no graph, tree or other data structures • No dictionaries, thesauruses, and any other global vocabularies • Using only frequency-based measures Masoud Makrehchi Proposed Approach

  13. Problem Statement • Mining metadata as a text document using classical document mining techniques by considering the background knowledge which is represented by proposed data model and using fuzzy approach to capture expert knowledge Masoud Makrehchi Proposed Approach

  14. Metadata Representation • Metadata as a text document (semi-structured format) • Using only statistical measures • frequency-based measures (the location and the order of words don’t matter) Masoud Makrehchi Proposed Approach

  15. Metadata Representation Masoud Makrehchi

  16. Metadata Representation Masoud Makrehchi

  17. Metadata Representation Title’s objective semantic Masoud Makrehchi

  18. Metadata Representation Whole document’s semantic about title Masoud Makrehchi

  19. Metadata Representation Title’s objective semantic Title’s subjective semantic Masoud Makrehchi

  20. Metadata Representation • The meaning of a metadata is broken into a set of objective semantics which can be easily extracted using a parser (to extract contents and attributes of every tag) • Adopting document vector space model (to use all classical document mining techniques) Metadata representation model = a set of objective semantics + Salton’s vector space model Masoud Makrehchi Multiple-partition document vector space model

  21. Metadata Representation Concepts about the domain (discourse) Background knowledge Expert Document Masoud Makrehchi

  22. Metadata Representation Concepts about the domain (discourse) Background knowledge- subjective abstract author subject Expert title Background knowledge- objective Masoud Makrehchi

  23. Metadata Representation Metadata as a text document The mining process is translated into a classical document mining. Metadata mining has no advantages because the background knowledge has been ignored. Masoud Makrehchi Proposed Approach

  24. Metadata Representation Metadata a a Multiple-partition Text document The mining process is translated into a classical document mining Considering the background knowledge Masoud Makrehchi Proposed Approach

  25. Metadata Representation • Vector Space Model- fuzzy representation Vocabulary di Masoud Makrehchi Proposed Approach

  26. Metadata Representation • Multi-Partition Vector Space Model Vocabulary di Masoud Makrehchi To model the objective semantic (a potion of background knowledge) Proposed Approach

  27. Metadata Representation • Multi-Partition Vector Space Model Masoud Makrehchi Proposed Approach

  28. Metadata Representation • Converting to standard vector space model Masoud Makrehchi Proposed Approach

  29. Metadata Representation • Weight of each partition • To be determined by expert, for example: • Wabstract=1.0, • Wtitile=1.5. Masoud Makrehchi Proposed Approach

  30. Metadata Representation • Membership degree of each term in every partition • By expert (considering the vocabulary), • Statistical (considering the collection) • Absolute frequency-based measures (like tfidf), • Relative frequency-based (Geometric) measures (location of each term in the partition). Masoud Makrehchi Proposed Approach

  31. Frequency Measures • Types of Frequency Measures • Within partition • Within document: by document-term frequency (like tf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Masoud Makrehchi Proposed Approach

  32. Class-Term Matrix • Document-Term Matrix • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Masoud Makrehchi Proposed Approach

  33. Class-Term Matrix • Class-Term Matrix • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is dual with respect to terms and classes. Masoud Makrehchi Proposed Approach

  34. Class-Term Matrix Significance factor: Representing the relationship between a term and a class Masoud Makrehchi Proposed Approach

  35. Concept Terminology • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Masoud Makrehchi Proposed Approach

  36. Term Definition • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Masoud Makrehchi Proposed Approach

  37. Term Relationships Masoud Makrehchi Proposed Approach

  38. Fuzzy Similarity For example: “sum” and “product” will be similar, if they are partially (fuzzily) co-occurring. Similarity is a bi-directional relationship. Masoud Makrehchi Proposed Approach

  39. Fuzzy Inclusion For example: “web” will include “world”, if “world” occurs wherever “web” appears. Inclusion is one-directional relationship. Masoud Makrehchi Proposed Approach

  40. Fuzzy Document Model • Fuzzy document representation • a fuzzy set of all terms in the vocabulary • obtaining Keyword set for the document; either a threshold on the fuzzy set or term categorization Masoud Makrehchi Proposed Approach

  41. Term Categories • Term Categorization: Categorizing all terms into three main groups • Features: Most frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes stopwords features Collection Class Document keywords Masoud Makrehchi Proposed Approach

  42. Domain-specific Stopwords • Stopwords: not contributing to the meaning of the document • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State • Medicine: Patient • Education: Learner, Instructor • Social sciences: Society • Anthropology: Human Masoud Makrehchi Proposed Approach

  43. y c Features n e u q e r F Stopwords s s a l c - n i Keywords h t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

  44. y c n e Features u q Stopwords e r F s s a l c - n i h t i Keywords W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

  45. y c Features n e u q e r Stopwords F s s a l c - n i h Keywords t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

  46. Contributions • A model for metadata representation: Multiple- partition document vector space model • Class-term model, representing the relationship between classes and the vocabulary • New fuzzy representations for documents, terms and concepts (word definition and terminology) Masoud Makrehchi Proposed Approach

  47. Contributions • Class-collection map to visualize the distribution of terms • Representing the keyword set of a document by a fuzzy model • New definitions for fuzzy similarity, fuzzy inclusion, and fuzzy term-relation based on fuzzy term definition • A framework for extracting domain-specific stopwords Masoud Makrehchi Proposed Approach

  48. Outlines • Problem Statement: Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

  49. Data Set • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Masoud Makrehchi Preliminary Results

  50. Data Set Masoud Makrehchi Preliminary Results

More Related