E N D
1. FTF*IDF Scoring for Fresh Information Retrieval Nobuyoshi Sato, Minoru Uehara
Toyo University
Yoshifumi Sakai
Tohoku University Thank you chairperson. I知 Nobuyoshi Sato, from Toyo University, Japan.
I will talk about our paper titled FTF*IDF Scoring for Fresh Information Retrieval. Thank you chairperson. I知 Nobuyoshi Sato, from Toyo University, Japan.
I will talk about our paper titled FTF*IDF Scoring for Fresh Information Retrieval.
2. Outline Backgrounds & Objectives
Related Works
Cooperative Search Engine
Fresh Information Retrieval
FTF*IDF scoring
Evaluation
Conclusions In this slide, I describe outline of this presentation.
First, As an introduction, I present background and objectives. Then, I talk about structure and behavior of Cooperative Search Engine, CSE. And, I talk about fresh information retrieval and FTF*IDF scoring and its evaluations.
Finally, I describe future works and conclusions. In this slide, I describe outline of this presentation.
First, As an introduction, I present background and objectives. Then, I talk about structure and behavior of Cooperative Search Engine, CSE. And, I talk about fresh information retrieval and FTF*IDF scoring and its evaluations.
Finally, I describe future works and conclusions.
3. Background Intranet IR system is important
Fresh IR is required
Update interval is too long to retrieve fresh information in current centralized search engines
Long time is spent to gather documents
Google痴 update interval is 2 or 3 weeks
Our goal is to update in a few minutes
We developed a distributed search engine, Cooperative Search Engine (CSE)
CSE can retrieve fresh information
Indices are made in each web server
What is fresh information?
Here, I talk about background.
Recently, Internet and intranet information retrieval becomes important. So fresh information retrieval is required in many organizations such as companies and universities. At now, many search engines are used such as Google, AltaVista. These kind of search engines are based on centralized architecture. In centralized search engines, update interval is too long. For an example, Google takes about a month, recently 2 or 3 weeks. Our goal is to update in few minutes. Therefore a search engine based on distributed architecture is needed in order to make index faster and retrieve fresh information. So we developed a distributed search engine named Cooperative Search Engine, CSE. Here, I talk about background.
Recently, Internet and intranet information retrieval becomes important. So fresh information retrieval is required in many organizations such as companies and universities. At now, many search engines are used such as Google, AltaVista. These kind of search engines are based on centralized architecture. In centralized search engines, update interval is too long. For an example, Google takes about a month, recently 2 or 3 weeks. Our goal is to update in few minutes. Therefore a search engine based on distributed architecture is needed in order to make index faster and retrieve fresh information. So we developed a distributed search engine named Cooperative Search Engine, CSE.
4. Features of CSE Fresh IR
CSE can update indexes in a few minutes.
Fast IR
CSE is as fast as centralized search engines.
The retrieval time of CSE is longer then centralized search engines because of high latency.
We have developed several speedup techniques. In this slide, I talk about features of CSE.
Most important CSE痴 feature is fresh information retrieval. As I told before, CSE can update indices of search engines in few minutes. The next feature is fast information retrieval. CSE is as fast as centralized search engines. The retrieval response time of CSE is longer than centralized search engine because of high latency. However, we have developed some speed up techniques. In this slide, I talk about features of CSE.
Most important CSE痴 feature is fresh information retrieval. As I told before, CSE can update indices of search engines in few minutes. The next feature is fast information retrieval. CSE is as fast as centralized search engines. The retrieval response time of CSE is longer than centralized search engine because of high latency. However, we have developed some speed up techniques.
5. Objectives We have proposed Fresh Information Retrieval
Fresh Information Retrieval retrieves documents by the time which their contents are modified
Not just timestamps
Time of the documents can be specified in the query
Freshness of the documents
There were no way to specify freshness of the documents
We propose a scoring method to consider freshness of the documents
Tf*idf based
Newer documents have higher scores, old documents lower scores
Scores of documents are decreased as the time goes up
Documents become fresh again when they are modified Here, I talk about objectives of this presentation.
Last year, we proposed fresh information retrieval. Fresh information retrieval is a subset of temporal information retrieval, and temporal information retrieval realizes retrieval by the time of contents of documents are modified. The time can be specified in the query.
Although we proposed fresh information retrieval, however, there were no way to specify freshness of documents. So we propose a TF*IDF based scoring method to consider freshness of documents. In this method, newer document or words have higher scores, and they lose scores as time is go up, so old documents have low scores. And, documents become fresh again when they are modified. Here, I talk about objectives of this presentation.
Last year, we proposed fresh information retrieval. Fresh information retrieval is a subset of temporal information retrieval, and temporal information retrieval realizes retrieval by the time of contents of documents are modified. The time can be specified in the query.
Although we proposed fresh information retrieval, however, there were no way to specify freshness of documents. So we propose a TF*IDF based scoring method to consider freshness of documents. In this method, newer document or words have higher scores, and they lose scores as time is go up, so old documents have low scores. And, documents become fresh again when they are modified.
6. Centralized Search Engine This is the illustration of a conventional centralized search engine. A robot collects web pages and stores. An indexer makes an index or inverted file. Finally, web documents could be searched. This is the illustration of a conventional centralized search engine. A robot collects web pages and stores. An indexer makes an index or inverted file. Finally, web documents could be searched.
7. Distributed Search Engine In distributed search engines, normal robot is not needed. Indexers make indexes in local. So, they can update very fast. In the other hand, a searcher need to send requests to other searchers at retrieval phase. So, communication delay occurs at retrieval. In distributed search engines, normal robot is not needed. Indexers make indexes in local. So, they can update very fast. In the other hand, a searcher need to send requests to other searchers at retrieval phase. So, communication delay occurs at retrieval.
8. Components of CSE LS (Location Server)
LS manages Forward Knowledge
CS (Cache Server)
CS asks LMSEs and holds retrieval results in the cache
LMSE (Local Meta Search Engine)
LMSE retrieves by using LSE, and re-computes scores
LSE (Local Search Engine)
Small search engine in each Web site.
e.g. Namazu Now, I show components of CSE.
First, Location Server, LS controls whole of CSE. And LS knows contents of each LMSE痴 documents. There is only one LS in whole of CSE. LS has a location information database. Location information database records which LMSE has what keyword included in the specific documents. This information is called as Forward Knowledge. And, LS replies which LMSE could be used to find documents to Cache Server.
Cache Server, CS caches the results retrieved by LMSEs. CS has look ahead cache to shorten response time at retrieval.
Next, Local Meta Search Engine, LMSE searches documents by communicating with other LMSE and CS, or using their own LSE. And, LMSE manages LSE.
Finally, Local Search Engine, LSE searches local documents only in a specific web server. LSE also could search other web server痴 documents by using a robot. And, LSE updates indexes. Now, I show components of CSE.
First, Location Server, LS controls whole of CSE. And LS knows contents of each LMSE痴 documents. There is only one LS in whole of CSE. LS has a location information database. Location information database records which LMSE has what keyword included in the specific documents. This information is called as Forward Knowledge. And, LS replies which LMSE could be used to find documents to Cache Server.
Cache Server, CS caches the results retrieved by LMSEs. CS has look ahead cache to shorten response time at retrieval.
Next, Local Meta Search Engine, LMSE searches documents by communicating with other LMSE and CS, or using their own LSE. And, LMSE manages LSE.
Finally, Local Search Engine, LSE searches local documents only in a specific web server. LSE also could search other web server痴 documents by using a robot. And, LSE updates indexes.
9. Structure of CSE This illustration shows structure of CSE.
We assume an LMSE and an LSE are installed into each web server. And, we assume a Cache Server is prepared in each domain. We assumed only a Location Server is prepared in whole of CSE. This illustration shows structure of CSE.
We assume an LMSE and an LSE are installed into each web server. And, we assume a Cache Server is prepared in each domain. We assumed only a Location Server is prepared in whole of CSE.
10. Behavior at Update Time Next, I値l show behavior of CSE at updating index databases.
At First, LS sends requests to each LMSE to update index databases and to send back summary of index databases. This summary is Forward Knowledge. Each LMSE let LSE to update index databases, and sends back Forward Knowledge to LS. LS receives Forward Knowledge, and writes Forward Kowledge to a location database.
Also, LMSE can start update process on its initiative. In this case, cron daemon will run update program. Next, I値l show behavior of CSE at updating index databases.
At First, LS sends requests to each LMSE to update index databases and to send back summary of index databases. This summary is Forward Knowledge. Each LMSE let LSE to update index databases, and sends back Forward Knowledge to LS. LS receives Forward Knowledge, and writes Forward Kowledge to a location database.
Also, LMSE can start update process on its initiative. In this case, cron daemon will run update program.
11. Behavior at Retrieval Time Then, I describe behavior of CSE at retrieving.
First, an LMSE receives a search request from a user. This LMSE optimizes given search expressions, and send it to near CS. If search results were in the cache, CS replies from the cache. Otherwise, CS sends search expression to LS. LS searches location database and examines which LMSE has documents. And, LS sends back the result. Then, CS sends search expression to other LMSEs. These LMSEs have documents that match specified search expression. These LMSEs search documents using LSE, and send back the result to CS. Then, CS merges all of the results and return then to first LMSE. At last first LMSE show the results to a user. Then, I describe behavior of CSE at retrieving.
First, an LMSE receives a search request from a user. This LMSE optimizes given search expressions, and send it to near CS. If search results were in the cache, CS replies from the cache. Otherwise, CS sends search expression to LS. LS searches location database and examines which LMSE has documents. And, LS sends back the result. Then, CS sends search expression to other LMSEs. These LMSEs have documents that match specified search expression. These LMSEs search documents using LSE, and send back the result to CS. Then, CS merges all of the results and return then to first LMSE. At last first LMSE show the results to a user.
12. Basic Idea of Temporal Information Retrieval Retrieves existence of documents by their time
Not to retrieve by time contained in documents
A document written in 1998:
的n 2002, Japan and Korea will hold World Cup
Retrieve by term ?2002 & ( japan & korea )
Retrieve by time and term ? time:1998 & ( japan & korea )
Time of the document is similar to timestamp of the document file
Time of the document is NOT timestamp itself
Details will be described later Here, I talk about basic idea of temporal information retrieval. Temporal information retrieval retrieves existence of documents by the time which documents are written. That is, temporal information retrieval searches documents by words contained in the documents, and sorts them by the time, and shows. This is not a retrieval by time contained in the documents. So, assuming there is a document written in 1998: In 2002, Japan and Korea will hold World Cup. Expression for retrieving by content is 2002 & (japan & korea), for retrieving by time and term is 1998 & (japan & korea). Here, time of the document is similar to timestamp of the document, however, the time of the document is not a timestamp itself because there is a problem of document contents. I will talk details later. Here, I talk about basic idea of temporal information retrieval. Temporal information retrieval retrieves existence of documents by the time which documents are written. That is, temporal information retrieval searches documents by words contained in the documents, and sorts them by the time, and shows. This is not a retrieval by time contained in the documents. So, assuming there is a document written in 1998: In 2002, Japan and Korea will hold World Cup. Expression for retrieving by content is 2002 & (japan & korea), for retrieving by time and term is 1998 & (japan & korea). Here, time of the document is similar to timestamp of the document, however, the time of the document is not a timestamp itself because there is a problem of document contents. I will talk details later.
13. Nature of Web and Temporal Information Retrieval Web documents are modified frequently
Some of web documents are seldom modified
Papers, Specification, Datasheets, Catalogues
Character of Web Documents
Only some are completely rewritten
Should be considered as they are newly created
Most pages are modified little by little
Should be considered as they become fresh patrially Then, I talk about nature of web documents and temporal information retrieval. As a general nature of web documents, they are often, easily modified. However, some web pages such as papers, datasheets, specifications are hardly ever modified.
In the all web documents, only some are completely rewritten, most of web pages are modified little by little. So completely rewritten documents should be considered as they are newly created. And, documents that modified little by little are considered as they become fresh partially. Then, I talk about nature of web documents and temporal information retrieval. As a general nature of web documents, they are often, easily modified. However, some web pages such as papers, datasheets, specifications are hardly ever modified.
In the all web documents, only some are completely rewritten, most of web pages are modified little by little. So completely rewritten documents should be considered as they are newly created. And, documents that modified little by little are considered as they become fresh partially.
14. TF*IDF Scoring Traditional keyword based scoring
Score of a keyword w is given by following
K is a constant
tf is term frequency (the number of times the word appears)
Idf is inverse document frequency (how rare the document)
n is the number of documents which have the word
N is the total number of documents Here, as related works, I talk about tf*idf scoring.
TF*IDF is a simple traditional keyword based scoring method. In TF*IDF, The score of a keyword in a specific document is calculated by this expression, multiplication of tf, idf, and constant. Here, tf is a term frequency of a word. Idf is a inverse document frequency, that means rareness of the document. Idf is calculated from the number of documents in all document set, and the number of documents that include the keyword. Here, as related works, I talk about tf*idf scoring.
TF*IDF is a simple traditional keyword based scoring method. In TF*IDF, The score of a keyword in a specific document is calculated by this expression, multiplication of tf, idf, and constant. Here, tf is a term frequency of a word. Idf is a inverse document frequency, that means rareness of the document. Idf is calculated from the number of documents in all document set, and the number of documents that include the keyword.
15. Basic Idea of FTF*IDF Scoring There was no way to score freshness of documents
Freshness of documents are sum of freshness of words
Freshness is decreased by passing of time
Freshness of words ? exp((t釦i)/a)
t now
ti the time a word is added
a dumping factor
Score of document is Freshness * Term Frequency (FTF)
In this simple definition, position and timestamp of all words must be recorded
Size of index will be lager than size of original document Here, I describe basic idea of FTF*IDF, Freshness Term Frequency Inverse Document Frequency scoring.
As I told before, there was no way to score the freshness of the document. In FTF*IDF, freshness of the document is a sum of freshness of all words in the document. And, freshness of the word is decreased when of the time is go up, with this expression. The score of document is multiplication of Freshness TF and IDF.
However, in this simple definition, position of words and time stamp of all words must be recorded. So size of index will be lager than size of original document. Here, I describe basic idea of FTF*IDF, Freshness Term Frequency Inverse Document Frequency scoring.
As I told before, there was no way to score the freshness of the document. In FTF*IDF, freshness of the document is a sum of freshness of all words in the document. And, freshness of the word is decreased when of the time is go up, with this expression. The score of document is multiplication of Freshness TF and IDF.
However, in this simple definition, position of words and time stamp of all words must be recorded. So size of index will be lager than size of original document.
16. FTF*IDF Scoring FTF is calculated from old FTF, TF and current TF
FTF0 = TF0
FTFi = FTFi1 * F + TFi TFi 1
Here, I describe calculation of FTF*IDF scoring, to reduce index sizes.
When a document is created, the document has freshness 1. So Freshness TF, FTF equals to TF. When the time going up, the document looses freshness. According the document becomes older, it looses freshness. When the document is modified, document become fresh again. In the second expression, first member decreases FTF by freshness. Second member adds freshness if the document is modified after previous update. In this method, only TF and FTF must be recorded in the index. Here, I describe calculation of FTF*IDF scoring, to reduce index sizes.
When a document is created, the document has freshness 1. So Freshness TF, FTF equals to TF. When the time going up, the document looses freshness. According the document becomes older, it looses freshness. When the document is modified, document become fresh again. In the second expression, first member decreases FTF by freshness. Second member adds freshness if the document is modified after previous update. In this method, only TF and FTF must be recorded in the index.
17. Evaluation (1/5) Probability of a modified document appears in top 10 items of retrieval results
Random 100 words in each document
FTF*IDF痴 probability is higher than tf*idf Here, I talk evaluation of FTF*IDF.
At first, we prepared documents that contain randomly generated 100 words. And we modified a document, and evaluated probability of the modified document appears in top 10 items of retrieval result. Here, I talk evaluation of FTF*IDF.
At first, we prepared documents that contain randomly generated 100 words. And we modified a document, and evaluated probability of the modified document appears in top 10 items of retrieval result.
18. Evaluation (2/5) Probability of a modified document appears in top 10%
Random 100 words in each document
FTF*IDF痴 probability is higher than TF*IDF This graph shows probability of a modified document appears in the highest 10% of the retrieval results. The prepared document are the same to previous slide. In FTF*IDF, the probability is higher than TF*IDF. This graph shows probability of a modified document appears in the highest 10% of the retrieval results. The prepared document are the same to previous slide. In FTF*IDF, the probability is higher than TF*IDF.
19. Evaluation (3/5) Relative retrieval time of FTF*IDF and TF*IDF
Almost the same This graph shows relative retrieval response time when FTF*IDF and TF*IDF are used. Retrieval response time is almost the same in TF*IDF and FTF*IDF. This graph shows relative retrieval response time when FTF*IDF and TF*IDF are used. Retrieval response time is almost the same in TF*IDF and FTF*IDF.
20. Evaluation (4/5) Relative necessary time to update the index when FTF*IDF and TF*IDF are used
FTF*IDF takes almost 1.5 times longer This graph shows that relative necessary time to update the index when FTF*IDF and TF*IDF are used. FTF*IDF takes almost 1.5 times longer time than TF*IDF. This graph shows that relative necessary time to update the index when FTF*IDF and TF*IDF are used. FTF*IDF takes almost 1.5 times longer time than TF*IDF.
21. Evaluation (5/5) Anti-word spamming
tf1, ftf1: originally word-spamming document
tf2, ftf2: originally normal document, spam words were added at time 0
FTF*IDF has better tolerance against word spamming
PageRank is not suited for widely distributed network
Large traffic are generated each iteration to compute PageRank Next, we evaluated word-spamming tolerance. At first, we prepared two documents, Document 1 and 2. Document1 contains all the same 100 words. This is the spam document. Document 2 has all different words initially. This is the normal document initially. Next, we added spam words into Document2 in every hour. The graph shows transition of scores of two documents when TF*IDF and FTF IDF is used. Tf1 means TF*TDF scoring is used for document1, ftf1 means FTF*IDF is used for document 2, and so on. Using ftf1, score decreases when the time is going up although tf1 keeps the same score. Using ftf2, score increases slowly, although score of tf2 increases linearly. So we can conclude FTF*IDF has better tolerance against words spamming. Next, we evaluated word-spamming tolerance. At first, we prepared two documents, Document 1 and 2. Document1 contains all the same 100 words. This is the spam document. Document 2 has all different words initially. This is the normal document initially. Next, we added spam words into Document2 in every hour. The graph shows transition of scores of two documents when TF*IDF and FTF IDF is used. Tf1 means TF*TDF scoring is used for document1, ftf1 means FTF*IDF is used for document 2, and so on. Using ftf1, score decreases when the time is going up although tf1 keeps the same score. Using ftf2, score increases slowly, although score of tf2 increases linearly. So we can conclude FTF*IDF has better tolerance against words spamming.
22. Conclusions We proposed fresh information retrieval with FTF*IDF scoring for CSE
FTF*IDF indicates freshness of documents
FTF*IDF has higher word-spamming tolerance than normal TF*IDF
In CSE, FTF*IDF works in low additional cost
Size of indices
Retrieval response time
Update necessary time Finally, I talk conclusions.
Today, we proposed FTF*IDF scoring for fresh information retrieval. FTF*IDF is a TF*IDF based scoring method that indicates freshness of the document. And FTF*IDF has higher tolerance against word spamming than normal TF*IDF.
And, FTF*IDF works lower additional costs, such as size of indices, retrieval response time, and update necessary time. Finally, I talk conclusions.
Today, we proposed FTF*IDF scoring for fresh information retrieval. FTF*IDF is a TF*IDF based scoring method that indicates freshness of the document. And FTF*IDF has higher tolerance against word spamming than normal TF*IDF.
And, FTF*IDF works lower additional costs, such as size of indices, retrieval response time, and update necessary time.