1 / 5

Metadata in Carrot II

Metadata in Carrot II. Current metadata TF.IDF for both documents and collections Full-text index Metadata are transferred between different nodes Potential Problems Storage cost: metadata size is huge, computation cost: computation time is long

rirwin
Download Presentation

Metadata in Carrot II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata in Carrot II • Current metadata • TF.IDF for both documents and collections • Full-text index • Metadata are transferred between different nodes • Potential Problems • Storage cost: metadata size is huge, • computation cost: computation time is long • Communication cost: metadata transfer time is long • semantic meaning of text: less semantic • Goal • Need an more efficient mechanism to represent documents/collections

  2. Proposed Approach • Sources for metadata generation • Text Summaries vs. full text • Multi-document Summarization on collections • Metadata Organization • Topic Hierarchy • Automatic metadata generation • Statistical Language Model

  3. Document (Text) Summarization • Document summarization (DS) • “The process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).” (Mani & Maybury, 1999) • Full text can be reduced to an abstract without losing too much useful information • Multi-Document Summarization (MDS) • Work on related documents (same topic) • can capture relations across documents

  4. Language Model • Language model • An approximation to real language • Try to explain already observed phenomena or future behavior • A probability distribution over strings in a finite alphabet • Basic idea when using in IR • Infer a language model for each document • Estimate the probability of generating the query according the models, rather than estimating the probability of relevance each document to the query • Rank the documents according to these probabilities

  5. References • Inderjeet Mani, Mark T. Maybury. Advances in Automatic Text Summarization. MIT Press.1999 • J. Ponte. A Language Modeling Approach to Information Retrieval, In PhD Thesis. Dept. of Computer Science, University of Massachusetts, Amherst, 1998. • Michael P. Oakes. Statistics for Corpus Linguistics. Edinburgh University Press. 1998

More Related