1 / 36

FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University

FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai. Overview. Persian Language Related Work Fuzzy IR Farsi IR FuFaIR Explanation Experimental Results

rodd
Download Presentation

FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai

  2. Overview • Persian Language • Related Work • Fuzzy IR • Farsi IR • FuFaIR Explanation • Experimental Results • Conclusion and Future Work

  3. Spoken in several countries (Iran, Afghanistan, Tajikistan …) This language has evolved over the years been influenced by many languages Contains foreign words from many languages such as Arabic, Turkish, French, English, … In some cases these words still follow the grammatical rules of their original languages for example: “Maktab” مكتب (singular)  “MAKATEB” مكاتب (plural) In some cases these words could use grammatical rules of both languages i.e. “Khabar” خبر (singular)  “AKHBAR” اخبار(Arabic) “KHABAR-HA” خبرها (Persian) Morphological analyzers for this language need to deal with many forms of words Persian Language

  4. Information Retrieval and Natural Language Processing for Persian (Farsi) • Faculty of Engineering of University of Tehran started working on processing of Persian about 7 years ago. • From 3 years ago, it has been a joint co-operation between UT and UOWD. • Since then several thousand experiments on processing and retrieval of Persian text have been performed.

  5. Test Collections • Qvanin Collection • Documents: Iranian Law Collection • 177089 passages • 41 queries and Relevance Judgments • Hamshari Collection • Documents: 300 MB News from Hamshari Newspaper • Part of Speech Tagging Collection • A tag set of 40 tags • 2590000+ tagged words

  6. Natural Language Processing • Investigating Automatic Part of Speech Tagging based on machine learning approaches: • Probabilistic (Hidden Markov Model) • Rule based • Entropy based • Neural Networks • The best so far has reached a 96% accuracy.

  7. Information Retrieval Experiments • All Major Retrieval Models of English text retrieval have been tested and their combinations (i.e.) • Fuzzy Logic • MMM, Paice, • Vector Space • Probabilistic • BM25 • N-Grams • N=2, N=3, N=4 • Combinational • With many different term weighting schemes.

  8. List of Weights that produced the best results Best

  9. Best

  10. The context of the current work • Improving the quality of Persian retrieval • Improving IR systems that used Fuzzy Logic as their retrieval model

  11. Fuzzy logic has been used in IR from early days. But only a few of them could show superiority in comparison with Classical approaches like vector space. This has been confirmed for Persian language also. The current work has been mostly inspired by one of them: D.E. Losada, F.D. Hermida, A. Bugarin, S. Barro. Experiments on using fuzzy quantified sentences in adhoc retrieval. ACM Symposium on Applied Aomputin, 2004. Related Work – Fuzzy IR

  12. Mixed Min & Max – MMM • Calculates the degree of membership of a document to the fuzzy set of the terms in the query as below • OR Query: • (قيموميت يا حضانت)  ((Guardian OR GOD Parent • Q or = (A1OR A2 OR A3 OR …) • SIM(Qor, D) = C or1 * max(dA1, dA2, …) +C or2 * min(dA1, dA2, …) • AND Query • (املاك و ثبت ) (Registration AND Properties)  • Q and = (A1 AND A2 AND A3 AND …) • SIM(Qand, D) = C and1 * min(dA1, dA2, …) + • C and2 * max(dA1, dA2, …) • Cand , Cor softness coefficient • Cand1 = [0.5,0.8] Cand2 = 1 – Cand1 • Cor1 > 0.2 Cor2 = 1- Cor1

  13. Paice Model • Calculates the degree of membership of a document to the fuzzy set of terms in the query as below: • AND Query • (املاك و ثبت )  (Registration AND Properties) • Q and = (A1 and A2 and A3 and …) • OR Query: • (قيموميت يا حضانت)  (Guardian OR GOD Parent ) • Q or = (A1or A2 or A3 or …) • SIM(Q, D) =  ri-1 tdi/ ri-1 • r = 1.0 for and queries (tdi ascending order) • r = 0.7 for or queries (tdi descending order)

  14. Experiments on Qavanin Collection Comparison of Fuzzy Systems

  15. Experiments on Qavanin Collection Probabilistic Systems (BM25)

  16. Experiments on Qavanin Collection Comparison of Vector Space Systems With BM25

  17. Experiments on Qavanin Collection Comparison of Best Vector Space With Best N-grams

  18. FuFaIR • The query is considered as a fuzzy set of relevant documents in the database • The documents will be sent to the client sorted based on their degree of membership to the query's fuzzy set • The larger the value of µi the more relevant is the document to the query i

  19. FuFaIR (Cont.) • each term is assigned a membership degree to a document based on the importance of that term for representing the document’s content. • Membership degree can be computed with classical IR parameters such as tf/idf • The input query is considered as an algebraic sentence whose elements are: • Terms • Fuzzy operators such as AND, OR, and NOT • Applying the operators on terms the final Fuzzy Set results i

  20. FuFaIR (Cont.) • The membership degree of a document to an individual term is defined as follows in our method: i ft,d= Frequency of term t in document d idf (t) = Inverse document frequency of term t

  21. Overview • Persian Language • Related Work • Fuzzy IR • Farsi IR • Fuzzy Logic Overview • FuFaIR Explanation • Experimental Results • Conclusion and Future Work

  22. Experimental Results • Parameters: • Hamshahri Corpora has been used • Total size of the collection: 300+MB • Indexing has been performed after stop word elimination • No stemming has been applied • 30 queries have been used for these experiments • Precision has been computed for top 20 retrieved documents.

  23. Experimental Results (Cont.) Some Sample Queries:

  24. Experimental Results (Cont.) • As a bench mark the best Persian retrieval model so far has been selected. That is the Vector Space model with Lnu-ltu weighting scheme. • Pivot and the slope parameters have been set to 13.36, and 0.75, respectively • The effectiveness of these values had been shown by previous works (See Paper). • To calculate the performance of each run, the precision at 5, 10, 15 and 20 document cut-offs have been calculated and averaged over all 30 queries.

  25. Experimental Results (Cont.) Comparison Results:

  26. Conclusion & Future Work Conclusion • Main contribution of this paper: • Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language. • fuzzy quantifiers are also added to the original model to provide more flexibility • In comparison with Vector Space, FuFaIR significantly better performance Future Works: • Testing different interpretation of the Fuzzy operators on the Persian corpora • Examining the true value and contribution of a Persian stemmer in retrieval.

  27. Questions ?

  28. Conception of Fuzzy Logic • Many decision-making and problem-solving tasks are too complex to be defined precisely • however, people succeed by using imprecise knowledge • Fuzzy logic resembles human reasoning in its use of approximate information and uncertainty to generate decisions.

  29. “false” “true” Natural Language • Consider: • Joe is tall -- what is tall? • Joe is very tall -- what does this differ from tall? • Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1.

  30. Fuzzy Logic • An approach to uncertainty that combines real values [0…1] and logic operations • Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set membership often found in natural (e.g., spoken) language.

  31. Example: “Young” • Example: • Ann is 28, 0.8 in set “Young” • Bob is 35, 0.1 in set “Young” • Charlie is 23, 1.0 in set “Young” • Unlike statistics and probabilities, the degree is not describing probabilities that the item is in the set, but instead describes to what extent the item is the set.

  32. Membership function of fuzzy logic Fuzzy values DOM Degree of Membership Young Middle Old 1 0.5 0 25 40 55 Age Fuzzy values have associated degrees of membership in the set.

  33. Benefits of fuzzy logic • You want the value to switch gradually as Young becomes Middle and Middle becomes Old. This is the idea of fuzzy logic.

  34. Fuzzy Set Operations • Fuzzy OR (): the union of two fuzzy sets is the maximum (MAX) of each element from two sets. • E.g. • A = {1.0, 0.20, 0.75} • B = {0.2, 0.45, 0.50} • A  B = {MAX(1.0, 0.2), MAX(0.20, 0.45), MAX(0.75, 0.50)} = {1.0, 0.45, 0.75}

  35. Fuzzy Set Operations • Fuzzy AND (): the intersection of two fuzzy sets is just the MIN of each element from the two sets. • E.g. • A  B = {MIN(1.0, 0.2), MIN(0.20, 0.45), MIN(0.75, 0.50)} = {0.2, 0.20, 0.50}

  36. Fuzzy Set Operations • The complement of a fuzzy variable with DOM x is (1-x). • Complement: The complement of a fuzzy set is composed of all elements’ complement. • Example. • Ac = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, 0.8, 0.25}

More Related