Enhancing Text Retrieval Through Weighted Keyphrases in Document Vector Learning

Text Retrieval Improvement Based on Automatically Extracted Keyphrases Original Text Neural Network Learning Document Vector Hohyon Ryu Yonsei University, Seoul, Korea & School of Information Studies, University of Wisconsin-Milwaukee weighted Keywords (single noun) Rule-based Generation Introduction As Web 2.0 technologies such as free-tagging and folksonomy become popular, many researchers have paid attention on adopting those newly emerged information technology to enhance text retrieval. Generally, tags reflect the contents of a text. Thus, it is assumed that using tags as a feature for retrieval can improve the retrieval performance. However, as Hotho (2006) and Passant (2007) remarked, free-tagging has certain limits for use in text retrieval. Since free-tagging is done by users without any control, tags can be ambiguous and/or irrelevant to the original text. Instead of using the tags given randomly by users, the present study is concerned with using automatically extracted keywords or keyphrases in text retrieval. In this study, a keyword is defined as a single word that represents the contents of a given text and a keyphrase as a phrase that describes the contents well. Since keywords and keyphrases represent the main ideas and items of a given text, giving more weight on keyphrase or keyword terms would enhance the performance of retrieval. This study is based on Korean academic environment where index terms are compound words or phrases in more than 70% of the cases (Lee, et al., 2003). In Korean, extracted single words have only limited capability of representing the contents. Keywords are extracted first by a neural network algorithm and keyphrases are generated combining the extracted keywords within the context. To fully utilize the generated keywords and keyphrases, they are experimented in text retrieval. They are coded into a document vector with a certain weight and combined with the original document vector. By doing so, the original document vector can be enhanced to represent better the context of the original document and can improve overall retrieval performance. weighted Keyphrases (compound nouns or combination of keywords) This study is based on Korean academic environment where index terms are compound words or phrases in more than 70% of the cases (Lee, et al., 2003). In Korean, extracted single wordshave only limited capabilityof representing the contents. This study is based on Korean academic environment where index terms are compound words or phrases in more than 70% of the cases (Lee, et al., 2003). In Korean, extracted single words have only limited capability of representing the contents. Figure 3: The change of R-precision according to the assigned weight and features vs Result As Figure 3 suggests, text retrieval performance increased by 15% from R-precision of 0.64 to 0.74 when the words appear on both keyphrases and keywords on the original vector with double weight. For the keyword + keyphrase vector, lesser margin of the improvement was shown as higher weight is assigned to the additional terms. On the other hand, the higher the weights assigned to the keyword-added vector (the dashed line in Figure 2,) the better the performance. This is because the same weight is assigned to each keyword item, while words in the keyphrase get different weight according to their appearance in the keyphrase list. Highlighted text Plain text Highlighted keyphrases help people to read a document better. Can they help a search engine to work better? Conclusion & Future Research The present study shows that giving extra weight on words that appear in keywords or keyphrase affects positively the performance of text retrieval. Since current web retrieval provides a significant number of irrelevant documents at users’ request, modifying search algorithms to be more sensitive to the subject of documents will help improve the retrieval performance. Additionally, the neural network keyword extraction and the rule-based keyphrase generation performed with stable efficiency. An evaluation of keyword and keyphrase generation will be carried out in the future to utilize the modules as independent software. Also, since the result of retrieval test in the Korean environment has shown significant improvement, further experiments will be made for English. It is expected that positive improvement on retrieval performance will also occur here. Experimental Design As shown in Figure 1, a neural network which was implemented by Feed-forward Neural Network for Python (Wojciechowski et al., 2007) judges each word to see if it is eligible to be a keyword on the basis of TF*IDF and the location of each word in the document. Keyphrases are generated based on rule-based algorithm. The keyphrase generation algorithm makes a window with 1 preceding and 3 following words and merges adjacent or overlapping windows. The rule-based algorithm rules out the words inadequate for a noun phrase by analyzing the lexical category of each word. The words appear on the automatically extracted or generated keywords and keyphrases are added onto the original document vector with a certain weight to give more weight on essential words. For each vector, Okapi TF×IDF normalization was applied. As shown in the example below, Keyphrases include a certain keyword repeatedly. Thus, more important keywords get more weight while less important keywords often get no additional weight at all. The example of the extracted keywords and keyphrases are shown in figure 2. With the weighted vectors, along with the original vectors as a baseline, text retrieval experiments were carried out. The test collection consisted of 545 abstracts of academic papers from the Yonsei University Library in ten different academic fields. The result of retrieval experiments with keyword-added vector, keyphrase-added vector, and a vector with both keyword and keyphrase entry words is shown in Figure 2. The result was evaluated in R-precision. Literature Review Previous studies that are related to keyword extraction based on neural network, noun phrase generation/extraction, and improving text retrieval performance with contextual features are reviewed. Neural network or other machine learning methods have been utilized in several studies to decide if a given word should be recognized as a keyword (Medelyan et al., 2008; Jo T. C. et al., 2000). Extracting noun phrases also has been approached in many different ways (Tomokiyo et al., 2003; Yang, 2000; Lee S. S. et al., 2003; Lee C. Y., et al., 1993; Lee H. A., et al., 1997). Since keyphrases are more prevalent than single-noun keywords and since more complicated processes are involved in the Korean language, many studies have been done by Korean researchers. Finally, Hotho (2006) and Cho et al. (2005) conducted a study to improve text retrieval performance with keywords or folksonomy. References Cho M., Yun B., & Rim H. 1997. A Korean Document Retrieval Model Considering Compound Nouns and Derived Nouns. Proceedings of Korea Information Science Society Spring Conference 24(1). 449-502. Hotho, A., Jaschke, R., Schmitz, C., & Stumme, G. 2006. Information Retrieval in Folksonomies: Search and Ranking. Lecture Notes in Computer Science. Springer Berlin: Heidelberg. Jo, T. C., & Seo, J. 2000. Neural Based Approach to Keyword Extraction from Documents. Proceedings of Korea Information Science Society Autumn Conference 27(2). 317-319. Lee, C. Y., Kang, H., Jang, H., & Park, S. 1993. A design of the Automatic Keyword Maker. Proceedings of the 5th Conference of Hangul and Korean Information Processing. 71-77. Lee, H. A., Lee, J. H., & Lee, G. 1997. Noun Phrase Indexing using Clausal Segmentation. Journal of Korea Information Science Society(b) 25(3). 301-311. Lee, S. S., & Lee, T. 2003. Concept-based Compound Keyword Extraction. Journal of Korea Association of Computer Education 6(2). Medelyan, O., & Witten, I. H. 2008. Domain Independent Automatic Keyphrase Indexing with Small Training Sets. Jasist, 59(7). 1026-1040. Passant, A. 2007. Using Ontologies to Strengthen Folksonomies and Enrich Information Retrieval in Weblogs. International Conference on Web Services. Tomokiyo, T., & Hurst, M. 2003. A Language Model Approach to Keyphrase Extraction. Proceedings of the ACL Workshop on Multiword Expressions. Wojciechowski, M. 2007. Feed-forward neural network for python. Technical University of Lodz (Poland), Department of Civil Engineering, Architecture and Environmental Engineering, http://ffnet.sourceforge.net/, ffnet-0.6, March 2007. Yang J. 2000. Base Noun Phrase Recognition in Korean using Rule-based Learning. Journal of Korea Information Science Society: Software and Applications 27(10). Figure 1: Outline of the experiment. Title: Study on Guidelines for the Construction of a Korean Thesaurus Keywords: 1986, Korean, basic, Hangeul, definition, standard, 2788, relation, word, ISO, alphabet, thesaurus, rule, term, most Keyphrases: standard for Hangeul thesaurus construction, Hangeul thesaurus, word thesaurus, ISO standard, aspect of Hangeul thesaurus, Hangeul thesaurus test, Hangeul thesaurus data, ISO, word thesaurus construction standard, Hangeul thesaurus management system Figure 2. the example of the extracted keywords and keyphrases For more information, please send an email tohohyon@gmail.com

Enhancing Text Retrieval Through Weighted Keyphrases in Document Vector Learning

Enhancing Text Retrieval Through Weighted Keyphrases in Document Vector Learning

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction