1 / 23

An information retrieval system for parliamentary documents

An information retrieval system for parliamentary documents.

wan
Download Presentation

An information retrieval system for parliamentary documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An information retrieval system for parliamentary documents Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso E. Romero Chapter: 12 CSE 655 Probabilistic Reasoning Faculty of Computer Science, Institute of Business Administration Presented by Quratulain

  2. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  3. Introduction/Motivation • To fulfil the objective of democracy, need to make public all activities of parliament. • Previously, information was sent in a printed form to all official organization and libraries. • Currently, electronic document published on the web, which is fast, cheaper and an easier way. • The official bulletin, transcripts of all speeches in different session, after editing published on website in PDF. • The documents are accessible using database-like queries. Quratulain

  4. Problems • To access information user must know about: • Session number • Date of legislature • Difficult to access information Quratulain

  5. Goal • A website with real search engine based on content. • The natural language query is applied to access the information. • The obtained the relevant document through system. • The output will be a set of document components of varying granularity (from complete document to single paragraph, also sorted depending on degree of relevance). ** This will avoid manual search ** Quratulain

  6. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  7. Overview of information retrieval • Information retrieval is concerned with representation, storage, organization, and accessing of information items. • Information retrieval systems work as: • Given a set of documents • Pre-processing • remove words not useful in search(stopwords) • Convert word to its stem word(reduce vocabulary) • Each word is associated with weights expressing their importance (in document or collection of documents) • NLP query indexed to match query representation with the stored document using any IR model. • Finally, a set of document identifiers is presented to the user sorted according to their relevance degree. Quratulain

  8. Overview of information retrieval • Standard IR treat document as atomic entities. • XML allows structured documents with semantics. • Structured IR views documents as aggregates interrelated structural elements by indexing. • Structured IR models exploit the content and the structure of documents to estimate the relevance of document components to query. Quratulain

  9. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  10. Bayesian Networks and information retrieval • Bayesian networks were first applied to IR at the beginning of 1990 by croft and turtle. • Bayesian network in IR models compute the probability of relevance given a document and a query. • Two important model of BNs within IR: • Belief network model • Bayesian network retrieval model. • Common feature are: • Each index term and document represented as nodes in network. • Links connecting each document node with all the term nodes. • Model differ in: • The direction of arc. • Additional arc (relationship b/w documents and terms.) Quratulain

  11. BN-based retrieval model Terms T2 T3 T4 T5 T6 T7 T1 Documents D1 D2 D3 Quratulain

  12. Drawback of Bayesian network • Time and space require to assess the distributions and store them(conditional probability per node is exponential with the parent nodes) • The efficiency of carrying out inference, because general inference in BNs is NP-hard problem Therefore The direct approach where we propagate the evidence contained in a query through the whole network is unfeasible . Quratulain

  13. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  14. Theoretical foundations • Set of documents D={D1 ,D2 , ..., DM} • Set of terms used to index these documents • Each document Di is organized hierarchically, representing structural associations of elements in Di called structural unit. • These association to a document form a tree. For example scientific article. Quratulain

  15. The structure of scientific article Index Terms Title Parag 1 Parag 2 Title Parag 1 Ref 1 Ref 2 Title Parag 1 Subsec 1 Subsec 2 Section 1 Bibligraphy Title Author Abstract Section 2 Document 1 Quratulain

  16. BN model for document • BN modeling of document contain 3-kind of nodes • Terms set , T={T1, T2, ..., Tl} • Basic structural unit, Ub={B1, B2, ..., Bm} • Complex structural unit, Uc={S1, S2, ..., Sm} • Set of all structural unit U=UbUc • To each node T, B, S is associated a binary random variables as {t- , t+}, {b- , b+} or {s- , s+} respectively. (-) not relevant , (+) relevant. Quratulain

  17. BN model for document Ub S2 Uc Uc Us • , with Pa(S1) Pa(S2) = , S1 Uc T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 B1 B2 B3 B5 B6 B7 B4 S1 S3 S2 S4 Quratulain

  18. BN for document • Conditional Probability • P(t+) • P(b+|pa(B)) • P(s+|pa(S)) • Due to greater number of parent, efficient inference procedure is needed. Quratulain

  19. Influence Diagram Model • Once the BN has been constructed transform it into influence diagram by including decision and utility nodes. • Chance node : previous BN • Decision node : • Utility node : Quratulain

  20. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  21. Building the information retrieval system(PAIRS) • PAIRS is a software package (store document in relational database) • Written in C++ • Specifically developed to store and retrieve documents generated by the parliament of Andalusia • Based on probabilistic model. Indexing System PDF document collection XML document collection Query General scheme of PAIRS Indexed Query Indexed Document Collection Search Engine Retrieved Document Components Quratulain

  22. Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

  23. Conclusion • This paper present a retrieval system based on probabilistic model belong to parliament information. • The system has been proven efficient in term of indexing and retrieval time. • Bayesian network technologies can be employed in problem domains whose dimensionality would earlier avoid its use. • The system is not a finished product, still several possible improvement are required. Quratulain

More Related