1 / 46

Learning Object Metadata Mining

Learning Object Metadata Mining. Masoud Makrehchi Supervisor: Prof. Mohamed Kamel. Outlines. Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks. Metadata Mining. Metadata Definition Data about data, for example a library catalogue

nhung
Download Presentation

Learning Object Metadata Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel

  2. Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

  3. Metadata Mining • Metadata Definition • Data about data, for example a library catalogue • Metadata Application: • Cataloging (Item and Collections) • Resource Discovery • Electronic Commerce and Digital Signatures • Intelligent Software Agents • Content Rating • Intellectual Property Rights • Semantic Web • Learning Objects • LOM Standards: IEEE LOM, DC, SCORM, CANCORE Makrehchi & Kamel

  4. Metadata Mining • Definition • extraction of implicit, previously unknown, and potentially useful information from metadata. • Methods • classification, clustering, summarization, mining association rules, ontology extraction, information integration, keyword extraction, automatic title generation. Makrehchi & Kamel

  5. Metadata Mining • Why metadata mining? • No access to the data itself, lack of raw data, • The data is not convenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, and need to merge different metadata repositories, • Ontology extraction is much easier in metadata level. Makrehchi & Kamel

  6. Metadata Mining Makrehchi & Kamel Conceptual data architecture

  7. Metadata Mining • Applications • Metadata mining instead of raw data mining, • Metadata enrichment (keyword extraction) • (Semi)-automatic Ontology extraction, • RDF, OWL and other semantic tagged script mining, • Information integration (LOs aggregation and integration), Makrehchi & Kamel

  8. Metadata Mining • Statistical methods based on word frequency analysis, • Syntactic methods based on linguistic parsing and pattern matching, • Structural methods studying the outline of the document, • Conceptual (semantic) methods on the use of knowledge base to interpret the meaning. Makrehchi & Kamel

  9. Metadata Mining • We don’t use • Natural Language Processing (NLP), • Semantic analysis and processing, • Graph, tree and other sophisticate data structures and models, • Dictionaries, thesauruses, and any other global vocabularies (only a simple Porter stemmer). Makrehchi & Kamel

  10. Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

  11. Metadata Representation Model • We treat metadata as a text document (semi-structured format), • The only measures are • statistical measures (like frequency) • geometric features (like location of a specific term, the order of words in a term or phrase) Makrehchi & Kamel

  12. Metadata Representation Model • Vector Space Model T Vocabulary Makrehchi & Kamel di

  13. Metadata Representation Model • Multi-Partition Vector Space Model T Vocabulary Makrehchi & Kamel di

  14. Metadata Representation Model • Multi-Partition Vector Space Model Makrehchi & Kamel

  15. Metadata Representation Model • Converting to standard vector model Makrehchi & Kamel

  16. Metadata Representation Model • Weight of each partition • To be determined by expert, for example: Wabstract=1.0, Wtitile=1.5. • Membership degree of each term in every partition • By expert, • Frequency based measures (tfidf), • Geometric measures (location of each term in the partition). Makrehchi & Kamel

  17. Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

  18. Class-Term Matrix • Document-Term Matrix (Collection X Vocabulary) • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Makrehchi & Kamel

  19. Class-Term Matrix • Class-Term Matrix (Class X Vocabulary) • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is still dual with respect to terms and classes. Makrehchi & Kamel

  20. Class-Term Matrix Class-term Frequency Term significance measure Normalized term significance measure Makrehchi & Kamel

  21. Class-Term Matrix Makrehchi & Kamel

  22. Class-Term Matrix • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Makrehchi & Kamel

  23. Class-Term Matrix • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Makrehchi & Kamel

  24. Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

  25. Case Study • Data set • There is no available LO metadata repository • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Makrehchi & Kamel

  26. Case Study Makrehchi & Kamel

  27. Case Study Makrehchi & Kamel

  28. Case Study • Types of Frequency Measures • Within document: by document-term frequency (like tfidf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Makrehchi & Kamel

  29. Case Study • Term Clustering: Categorizing all terms into three main groups • Features: More frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes • Introducing Class-Collection Map • To visualize the location of each category Makrehchi & Kamel

  30. Case Study Makrehchi & Kamel

  31. Case Study Makrehchi & Kamel

  32. Case Study Makrehchi & Kamel

  33. Case Study • Extraction of Stopwords (doesn’t contribute to the meaning of the document) • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State, • Medicine: Patient, • Education: Learner, Instructor, • Social sciences: Society, • Anthropology: Human. Makrehchi & Kamel

  34. Case Study • Why we need to remove domain specific stopwords? • Dimensionality reduction, • Accurate feature selection (drawbacks of information gain in selecting noise as feature) • Based on stopwords, we can find and separate phrases (based on our definition, a phrase is a set of words between two stopwords). Makrehchi & Kamel

  35. Makrehchi & Kamel

  36. Makrehchi & Kamel

  37. Makrehchi & Kamel

  38. Makrehchi & Kamel

  39. Makrehchi & Kamel

  40. Makrehchi & Kamel

  41. Case Study • Dimensionality reduction process ~400,000 Using metadata stemming 15,971 12,044 Multi-partition document Vector space model Fuzzy-based term clustering 5,605 Makrehchi & Kamel 226 features 507 stopwords 4,872 keywords

  42. Outlines • Metadata Mining • Metadata Representation Model • Class-Term Matrix • Case Study • Conclusion Remarks Makrehchi & Kamel

  43. Conclusion Remarks • Most statistic-based data mining methods do not use domain knowledge • Metadata (semi-structured data) mining uses domain knowledge embedded in tags and partitions. • We introduced multi-partition document vector space model. • We mine class-term matrix in addition to document-term matrix. Makrehchi & Kamel

  44. Conclusion Remarks • Based on the visualization model (class-collection map) and a fuzzy inference, we can cluster vocabulary for each class and extract three essential categories; • Features: to classify unknown documents, • Keywords: for indexing and access to specific document in IR applications, • Stopwords: for dimensionality reduction and noise removal. Makrehchi & Kamel

  45. Conclusion Remarks • Based on class-term matrix, we defined • Terminologies as fuzzy sets of all terms in the vocabulary • Definitions as fuzzy sets of all concepts Makrehchi & Kamel

  46. Conclusion Remarks • Future Works • Collecting LO metadata and constructing a LO metadata repository, • A keyword recall method to test and validate extracted keywords, • Implementing an average classifier (KNN or Fuzzy classifier) to test and validate selected features, • Applying multi-classifier architecture on metadata mining problem. Makrehchi & Kamel

More Related