FTFIDF Scoring for Fresh Information Retrieval

1. FTF*IDF Scoring for Fresh Information Retrieval Nobuyoshi Sato, Minoru Uehara Toyo University Yoshifumi Sakai Tohoku University Thank you chairperson. I�m Nobuyoshi Sato, from Toyo University, Japan. I will talk about our paper titled FTF*IDF Scoring for Fresh Information Retrieval. Thank you chairperson. I�m Nobuyoshi Sato, from Toyo University, Japan. I will talk about our paper titled FTF*IDF Scoring for Fresh Information Retrieval.

2. Outline Backgrounds & Objectives Related Works Cooperative Search Engine Fresh Information Retrieval FTF*IDF scoring Evaluation Conclusions In this slide, I describe outline of this presentation. First, As an introduction, I present background and objectives. Then, I talk about structure and behavior of Cooperative Search Engine, CSE. And, I talk about fresh information retrieval and FTF*IDF scoring and its evaluations. Finally, I describe future works and conclusions. In this slide, I describe outline of this presentation. First, As an introduction, I present background and objectives. Then, I talk about structure and behavior of Cooperative Search Engine, CSE. And, I talk about fresh information retrieval and FTF*IDF scoring and its evaluations. Finally, I describe future works and conclusions.

3. Background Intranet IR system is important Fresh IR is required Update interval is too long to retrieve fresh information in current centralized search engines Long time is spent to gather documents Google�s update interval is 2 or 3 weeks Our goal is to update in a few minutes We developed a distributed search engine, Cooperative Search Engine (CSE) CSE can retrieve fresh information Indices are made in each web server What is fresh information? Here, I talk about background. Recently, Internet and intranet information retrieval becomes important. So fresh information retrieval is required in many organizations such as companies and universities. At now, many search engines are used such as Google, AltaVista. These kind of search engines are based on centralized architecture. In centralized search engines, update interval is too long. For an example, Google takes about a month, recently 2 or 3 weeks. Our goal is to update in few minutes. Therefore a search engine based on distributed architecture is needed in order to make index faster and retrieve fresh information. So we developed a distributed search engine named Cooperative Search Engine, CSE. Here, I talk about background. Recently, Internet and intranet information retrieval becomes important. So fresh information retrieval is required in many organizations such as companies and universities. At now, many search engines are used such as Google, AltaVista. These kind of search engines are based on centralized architecture. In centralized search engines, update interval is too long. For an example, Google takes about a month, recently 2 or 3 weeks. Our goal is to update in few minutes. Therefore a search engine based on distributed architecture is needed in order to make index faster and retrieve fresh information. So we developed a distributed search engine named Cooperative Search Engine, CSE.

4. Features of CSE Fresh IR CSE can update indexes in a few minutes. Fast IR CSE is as fast as centralized search engines. The retrieval time of CSE is longer then centralized search engines because of high latency. We have developed several speedup techniques. In this slide, I talk about features of CSE. Most important CSE�s feature is fresh information retrieval. As I told before, CSE can update indices of search engines in few minutes. The next feature is fast information retrieval. CSE is as fast as centralized search engines. The retrieval response time of CSE is longer than centralized search engine because of high latency. However, we have developed some speed up techniques. In this slide, I talk about features of CSE. Most important CSE�s feature is fresh information retrieval. As I told before, CSE can update indices of search engines in few minutes. The next feature is fast information retrieval. CSE is as fast as centralized search engines. The retrieval response time of CSE is longer than centralized search engine because of high latency. However, we have developed some speed up techniques.

5. Objectives We have proposed Fresh Information Retrieval Fresh Information Retrieval retrieves documents by the time which their contents are modified Not just timestamps Time of the documents can be specified in the query Freshness of the documents There were no way to specify freshness of the documents We propose a scoring method to consider freshness of the documents Tf*idf based Newer documents have higher scores, old documents lower scores Scores of documents are decreased as the time goes up Documents become fresh again when they are modified Here, I talk about objectives of this presentation. Last year, we proposed fresh information retrieval. Fresh information retrieval is a subset of temporal information retrieval, and temporal information retrieval realizes retrieval by the time of contents of documents are modified. The time can be specified in the query. Although we proposed fresh information retrieval, however, there were no way to specify freshness of documents. So we propose a TF*IDF based scoring method to consider freshness of documents. In this method, newer document or words have higher scores, and they lose scores as time is go up, so old documents have low scores. And, documents become fresh again when they are modified. Here, I talk about objectives of this presentation. Last year, we proposed fresh information retrieval. Fresh information retrieval is a subset of temporal information retrieval, and temporal information retrieval realizes retrieval by the time of contents of documents are modified. The time can be specified in the query. Although we proposed fresh information retrieval, however, there were no way to specify freshness of documents. So we propose a TF*IDF based scoring method to consider freshness of documents. In this method, newer document or words have higher scores, and they lose scores as time is go up, so old documents have low scores. And, documents become fresh again when they are modified.

6. Centralized Search Engine This is the illustration of a conventional centralized search engine. A robot collects web pages and stores. An indexer makes an index or inverted file. Finally, web documents could be searched. This is the illustration of a conventional centralized search engine. A robot collects web pages and stores. An indexer makes an index or inverted file. Finally, web documents could be searched.

7. Distributed Search Engine In distributed search engines, normal robot is not needed. Indexers make indexes in local. So, they can update very fast. In the other hand, a searcher need to send requests to other searchers at retrieval phase. So, communication delay occurs at retrieval. In distributed search engines, normal robot is not needed. Indexers make indexes in local. So, they can update very fast. In the other hand, a searcher need to send requests to other searchers at retrieval phase. So, communication delay occurs at retrieval.

8. Components of CSE LS (Location Server) LS manages Forward Knowledge CS (Cache Server) CS asks LMSEs and holds retrieval results in the cache LMSE (Local Meta Search Engine) LMSE retrieves by using LSE, and re-computes scores LSE (Local Search Engine) Small search engine in each Web site. e.g. Namazu Now, I show components of CSE. First, Location Server, LS controls whole of CSE. And LS knows contents of each LMSE�s documents. There is only one LS in whole of CSE. LS has a location information database. Location information database records which LMSE has what keyword included in the specific documents. This information is called as Forward Knowledge. And, LS replies which LMSE could be used to find documents to Cache Server. Cache Server, CS caches the results retrieved by LMSEs. CS has look ahead cache to shorten response time at retrieval. Next, Local Meta Search Engine, LMSE searches documents by communicating with other LMSE and CS, or using their own LSE. And, LMSE manages LSE. Finally, Local Search Engine, LSE searches local documents only in a specific web server. LSE also could search other web server�s documents by using a robot. And, LSE updates indexes. Now, I show components of CSE. First, Location Server, LS controls whole of CSE. And LS knows contents of each LMSE�s documents. There is only one LS in whole of CSE. LS has a location information database. Location information database records which LMSE has what keyword included in the specific documents. This information is called as Forward Knowledge. And, LS replies which LMSE could be used to find documents to Cache Server. Cache Server, CS caches the results retrieved by LMSEs. CS has look ahead cache to shorten response time at retrieval. Next, Local Meta Search Engine, LMSE searches documents by communicating with other LMSE and CS, or using their own LSE. And, LMSE manages LSE. Finally, Local Search Engine, LSE searches local documents only in a specific web server. LSE also could search other web server�s documents by using a robot. And, LSE updates indexes.

9. Structure of CSE This illustration shows structure of CSE. We assume an LMSE and an LSE are installed into each web server. And, we assume a Cache Server is prepared in each domain. We assumed only a Location Server is prepared in whole of CSE. This illustration shows structure of CSE. We assume an LMSE and an LSE are installed into each web server. And, we assume a Cache Server is prepared in each domain. We assumed only a Location Server is prepared in whole of CSE.

10. Behavior at Update Time Next, I�ll show behavior of CSE at updating index databases. At First, LS sends requests to each LMSE to update index databases and to send back summary of index databases. This summary is Forward Knowledge. Each LMSE let LSE to update index databases, and sends back Forward Knowledge to LS. LS receives Forward Knowledge, and writes Forward Kowledge to a location database. Also, LMSE can start update process on its initiative. In this case, cron daemon will run update program. Next, I�ll show behavior of CSE at updating index databases. At First, LS sends requests to each LMSE to update index databases and to send back summary of index databases. This summary is Forward Knowledge. Each LMSE let LSE to update index databases, and sends back Forward Knowledge to LS. LS receives Forward Knowledge, and writes Forward Kowledge to a location database. Also, LMSE can start update process on its initiative. In this case, cron daemon will run update program.

11. Behavior at Retrieval Time Then, I describe behavior of CSE at retrieving. First, an LMSE receives a search request from a user. This LMSE optimizes given search expressions, and send it to near CS. If search results were in the cache, CS replies from the cache. Otherwise, CS sends search expression to LS. LS searches location database and examines which LMSE has documents. And, LS sends back the result. Then, CS sends search expression to other LMSEs. These LMSEs have documents that match specified search expression. These LMSEs search documents using LSE, and send back the result to CS. Then, CS merges all of the results and return then to first LMSE. At last first LMSE show the results to a user. Then, I describe behavior of CSE at retrieving. First, an LMSE receives a search request from a user. This LMSE optimizes given search expressions, and send it to near CS. If search results were in the cache, CS replies from the cache. Otherwise, CS sends search expression to LS. LS searches location database and examines which LMSE has documents. And, LS sends back the result. Then, CS sends search expression to other LMSEs. These LMSEs have documents that match specified search expression. These LMSEs search documents using LSE, and send back the result to CS. Then, CS merges all of the results and return then to first LMSE. At last first LMSE show the results to a user.

12. Basic Idea of Temporal Information Retrieval Retrieves existence of documents by their time Not to retrieve by time contained in documents A document written in 1998: �In 2002, Japan and Korea will hold World Cup� Retrieve by term ?2002 & ( japan & korea ) Retrieve by time and term ? time:1998 & ( japan & korea ) Time of the document is similar to timestamp of the document file Time of the document is NOT timestamp itself Details will be described later Here, I talk about basic idea of temporal information retrieval. Temporal information retrieval retrieves existence of documents by the time which documents are written. That is, temporal information retrieval searches documents by words contained in the documents, and sorts them by the time, and shows. This is not a retrieval by time contained in the documents. So, assuming there is a document written in 1998: In 2002, Japan and Korea will hold World Cup. Expression for retrieving by content is 2002 & (japan & korea), for retrieving by time and term is 1998 & (japan & korea). Here, time of the document is similar to timestamp of the document, however, the time of the document is not a timestamp itself because there is a problem of document contents. I will talk details later. Here, I talk about basic idea of temporal information retrieval. Temporal information retrieval retrieves existence of documents by the time which documents are written. That is, temporal information retrieval searches documents by words contained in the documents, and sorts them by the time, and shows. This is not a retrieval by time contained in the documents. So, assuming there is a document written in 1998: In 2002, Japan and Korea will hold World Cup. Expression for retrieving by content is 2002 & (japan & korea), for retrieving by time and term is 1998 & (japan & korea). Here, time of the document is similar to timestamp of the document, however, the time of the document is not a timestamp itself because there is a problem of document contents. I will talk details later.

13. Nature of Web and Temporal Information Retrieval Web documents are modified frequently Some of web documents are seldom modified Papers, Specification, Datasheets, Catalogues Character of Web Documents Only some are completely rewritten Should be considered as they are newly created Most pages are modified little by little Should be considered as they become fresh patrially Then, I talk about nature of web documents and temporal information retrieval. As a general nature of web documents, they are often, easily modified. However, some web pages such as papers, datasheets, specifications are hardly ever modified. In the all web documents, only some are completely rewritten, most of web pages are modified little by little. So completely rewritten documents should be considered as they are newly created. And, documents that modified little by little are considered as they become fresh partially. Then, I talk about nature of web documents and temporal information retrieval. As a general nature of web documents, they are often, easily modified. However, some web pages such as papers, datasheets, specifications are hardly ever modified. In the all web documents, only some are completely rewritten, most of web pages are modified little by little. So completely rewritten documents should be considered as they are newly created. And, documents that modified little by little are considered as they become fresh partially.

14. TF*IDF Scoring Traditional keyword based scoring Score of a keyword w is given by following K is a constant tf is term frequency (the number of times the word appears) Idf is inverse document frequency (how rare the document) n is the number of documents which have the word N is the total number of documents Here, as related works, I talk about tf*idf scoring. TF*IDF is a simple traditional keyword based scoring method. In TF*IDF, The score of a keyword in a specific document is calculated by this expression, multiplication of tf, idf, and constant. Here, tf is a term frequency of a word. Idf is a inverse document frequency, that means rareness of the document. Idf is calculated from the number of documents in all document set, and the number of documents that include the keyword. Here, as related works, I talk about tf*idf scoring. TF*IDF is a simple traditional keyword based scoring method. In TF*IDF, The score of a keyword in a specific document is calculated by this expression, multiplication of tf, idf, and constant. Here, tf is a term frequency of a word. Idf is a inverse document frequency, that means rareness of the document. Idf is calculated from the number of documents in all document set, and the number of documents that include the keyword.

15. Basic Idea of FTF*IDF Scoring There was no way to score freshness of documents Freshness of documents are sum of freshness of words Freshness is decreased by passing of time Freshness of words ? exp(�(t�ti)/a) t now ti the time a word is added a dumping factor Score of document is Freshness * Term Frequency (FTF) In this simple definition, position and timestamp of all words must be recorded Size of index will be lager than size of original document Here, I describe basic idea of FTF*IDF, Freshness Term Frequency Inverse Document Frequency scoring. As I told before, there was no way to score the freshness of the document. In FTF*IDF, freshness of the document is a sum of freshness of all words in the document. And, freshness of the word is decreased when of the time is go up, with this expression. The score of document is multiplication of Freshness TF and IDF. However, in this simple definition, position of words and time stamp of all words must be recorded. So size of index will be lager than size of original document. Here, I describe basic idea of FTF*IDF, Freshness Term Frequency Inverse Document Frequency scoring. As I told before, there was no way to score the freshness of the document. In FTF*IDF, freshness of the document is a sum of freshness of all words in the document. And, freshness of the word is decreased when of the time is go up, with this expression. The score of document is multiplication of Freshness TF and IDF. However, in this simple definition, position of words and time stamp of all words must be recorded. So size of index will be lager than size of original document.

16. FTF*IDF Scoring FTF is calculated from old FTF, TF and current TF FTF0 = TF0 FTFi = FTFi�1 * F + TFi � TFi �1 Here, I describe calculation of FTF*IDF scoring, to reduce index sizes. When a document is created, the document has freshness 1. So Freshness TF, FTF equals to TF. When the time going up, the document looses freshness. According the document becomes older, it looses freshness. When the document is modified, document become fresh again. In the second expression, first member decreases FTF by freshness. Second member adds freshness if the document is modified after previous update. In this method, only TF and FTF must be recorded in the index. Here, I describe calculation of FTF*IDF scoring, to reduce index sizes. When a document is created, the document has freshness 1. So Freshness TF, FTF equals to TF. When the time going up, the document looses freshness. According the document becomes older, it looses freshness. When the document is modified, document become fresh again. In the second expression, first member decreases FTF by freshness. Second member adds freshness if the document is modified after previous update. In this method, only TF and FTF must be recorded in the index.

17. Evaluation (1/5) Probability of a modified document appears in top 10 items of retrieval results Random 100 words in each document FTF*IDF�s probability is higher than tf*idf Here, I talk evaluation of FTF*IDF. At first, we prepared documents that contain randomly generated 100 words. And we modified a document, and evaluated probability of the modified document appears in top 10 items of retrieval result. Here, I talk evaluation of FTF*IDF. At first, we prepared documents that contain randomly generated 100 words. And we modified a document, and evaluated probability of the modified document appears in top 10 items of retrieval result.

18. Evaluation (2/5) Probability of a modified document appears in top 10% Random 100 words in each document FTF*IDF�s probability is higher than TF*IDF This graph shows probability of a modified document appears in the highest 10% of the retrieval results. The prepared document are the same to previous slide. In FTF*IDF, the probability is higher than TF*IDF. This graph shows probability of a modified document appears in the highest 10% of the retrieval results. The prepared document are the same to previous slide. In FTF*IDF, the probability is higher than TF*IDF.

19. Evaluation (3/5) Relative retrieval time of FTF*IDF and TF*IDF Almost the same This graph shows relative retrieval response time when FTF*IDF and TF*IDF are used. Retrieval response time is almost the same in TF*IDF and FTF*IDF. This graph shows relative retrieval response time when FTF*IDF and TF*IDF are used. Retrieval response time is almost the same in TF*IDF and FTF*IDF.

20. Evaluation (4/5) Relative necessary time to update the index when FTF*IDF and TF*IDF are used FTF*IDF takes almost 1.5 times longer This graph shows that relative necessary time to update the index when FTF*IDF and TF*IDF are used. FTF*IDF takes almost 1.5 times longer time than TF*IDF. This graph shows that relative necessary time to update the index when FTF*IDF and TF*IDF are used. FTF*IDF takes almost 1.5 times longer time than TF*IDF.

21. Evaluation (5/5) Anti-word spamming tf1, ftf1: originally word-spamming document tf2, ftf2: originally normal document, spam words were added at time 0 FTF*IDF has better tolerance against word spamming PageRank is not suited for widely distributed network Large traffic are generated each iteration to compute PageRank Next, we evaluated word-spamming tolerance. At first, we prepared two documents, Document 1 and 2. Document1 contains all the same 100 words. This is the spam document. Document 2 has all different words initially. This is the normal document initially. Next, we added spam words into Document2 in every hour. The graph shows transition of scores of two documents when TF*IDF and FTF IDF is used. Tf1 means TF*TDF scoring is used for document1, ftf1 means FTF*IDF is used for document 2, and so on. Using ftf1, score decreases when the time is going up although tf1 keeps the same score. Using ftf2, score increases slowly, although score of tf2 increases linearly. So we can conclude FTF*IDF has better tolerance against words spamming. Next, we evaluated word-spamming tolerance. At first, we prepared two documents, Document 1 and 2. Document1 contains all the same 100 words. This is the spam document. Document 2 has all different words initially. This is the normal document initially. Next, we added spam words into Document2 in every hour. The graph shows transition of scores of two documents when TF*IDF and FTF IDF is used. Tf1 means TF*TDF scoring is used for document1, ftf1 means FTF*IDF is used for document 2, and so on. Using ftf1, score decreases when the time is going up although tf1 keeps the same score. Using ftf2, score increases slowly, although score of tf2 increases linearly. So we can conclude FTF*IDF has better tolerance against words spamming.

22. Conclusions We proposed fresh information retrieval with FTF*IDF scoring for CSE FTF*IDF indicates freshness of documents FTF*IDF has higher word-spamming tolerance than normal TF*IDF In CSE, FTF*IDF works in low additional cost Size of indices Retrieval response time Update necessary time Finally, I talk conclusions. Today, we proposed FTF*IDF scoring for fresh information retrieval. FTF*IDF is a TF*IDF based scoring method that indicates freshness of the document. And FTF*IDF has higher tolerance against word spamming than normal TF*IDF. And, FTF*IDF works lower additional costs, such as size of indices, retrieval response time, and update necessary time. Finally, I talk conclusions. Today, we proposed FTF*IDF scoring for fresh information retrieval. FTF*IDF is a TF*IDF based scoring method that indicates freshness of the document. And FTF*IDF has higher tolerance against word spamming than normal TF*IDF. And, FTF*IDF works lower additional costs, such as size of indices, retrieval response time, and update necessary time.

FTFIDF Scoring for Fresh Information Retrieval

FTFIDF Scoring for Fresh Information Retrieval

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Galago for Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval