Information Filtering

Information Filtering Modern Information Retrival Course, Semantic web Research labratory

Outline • Introduction • Information Filtering concept • Previous work • Filtering general features • Filtering rules and attributes • Type of filters • Profling and Filtering Technologies • user-modeling techniques • Conclusion Modern Information Retrival Course, Semantic web Research labratory

Introduction • Internet and Information overloading • A vast amount of information of varying quality is disseminated. • There are lots of interesting things, but also lots of trash. • Filtering is tools to help people find the most valuable information Modern Information Retrival Course, Semantic web Research labratory

Introduction • The goal of an information filtering system is to sort through large volumes of dynamically generated information and present to the user those which are likely to satisfy his or her information requirement. Modern Information Retrival Course, Semantic web Research labratory

Introduction • In order to identify information that satisfies a user's information requirement or interest, an IF system needs to acquire an information filter that, when applied to an information item, evaluates whether the item is of interest or not. • Information filter represents the user's interests • Identifying only those pieces of information that a user would find interesting. • The key question for designing an IF system is • how to acquire such an information filter. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • Filtering information is not a new concept, nor is it one that is limited to electronic documents. • When we read standard paper texts, information filtering occurs. • We only buy certain magazines, since other magazines may contain information that is redundant with or irrelevant to our interests • With the increasing availability of information in electronic form, it becomes more important and feasible to have automatic methods to filter information. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • We can describe a filtering information system as being an automatic mechanism with the capacity of monitoring a continuous flow of documents and ability to select documents considering it’s relevance for a certain user or users’ groups, according to its needs. • Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • These needs are represented through a profile of interests associated to the user or users’ group. • The ability to select relevant documents is associated with the mechanisms of retrieval information that calculate the value of similarity between documents of the collection and the profiles. • Documents of great similarity with the profile are considered important for the user or users’ group. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • due to personal or professional reasons, a user’s interests may shift or change. • These changes may happen in a relatively short duration of time or over a long period of time. • The shifts can affect the user’s interests partially or fully. • To cope with this problem • it should be possible to do reformulation on the user’s profile. • This actualization is made through information sent to the system about the relevance of the received documents. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • One of the simplest methods of determining whether information matches a user's interests is through keyword matching. • If a user's interests are described by certain words, then information containing those words should be relevant. • This straightforward keyword matching often fails however. • Inappropriate matches can arise because • The words people use do not unambiguously reflect the topic or content. • A single word can have more than one meaning (e.g., chip) • The same concept can be described by surprisingly many different words (e.g., human factors, ergonomics). Modern Information Retrival Course, Semantic web Research labratory

Information Filtering concept • Furnas, showed that two people use the same main word to describe an object only 10 to 20 percent of the time. • Bates has reported comparably poor agreement in the generation of search terms by trained intermediaries. Modern Information Retrival Course, Semantic web Research labratory

Previous work • Conventional information retrieval (IR) is very closely related to information filtering (IF) • They both have the goal of retrieving information relevant to what a user wants • And minimizing the amount of irrelevant information retrieved Modern Information Retrival Course, Semantic web Research labratory

Previous work • One of the earliest forms of electronic information filtering came from work on Selective Disseminationof Information (SDI). • SDI was designed as an automatic way of keeping scientists informed of new documents published in their areas of specialization. • The scientist could create and modify a user profile of keywords that described his or her interests. • SDI used the profile to match the keywords against new articles in order to predict which new articles would be most relevant to the scientist's interests. Modern Information Retrival Course, Semantic web Research labratory

Previous work • Allen conducted a series of experiments to explore user models in predicting preferences for news articles. • He predicted which articles a person would read based on previous articles read using a measure of overlap of nouns between the new and old articles. • While the predictions were better than chance, the average correlation between the predicted articles and the subjects' ratings of the articles was fairly low (r=0.44). Modern Information Retrival Course, Semantic web Research labratory

Previous work • The models were more successful at predicting user preferences for general categories of articles than for specific articles. • Predicting what news articles a person will read may be an especially difficult task. • News topics vary from day to day, making it difficult to get stable estimates of interest. In addition, external sources of news probably influenced what people read in the experiment. • We believe that users' interests for technicalliterature will be more stable over time. Modern Information Retrival Course, Semantic web Research labratory

Previous work • In Allen's research, the subject's past preferences were used to construct an implicit model for retrieving relevant articles. • A different approach is to let the user explicitly structure the information. • For Example the Information Lens system allows users to create rules to filter mail messages based on keyword matches in the mail fields. • There is some structure in mail messages, (e.g. sender, subject) • These rules can take advantage of this structure to perform user specified actions on the messages. Modern Information Retrival Course, Semantic web Research labratory

Previous work • While a variety of information systems have been developed, there has been little systematic evaluation of what features are most effective for filtering. • This leaves many unanswered questions, such as: • What are the most effective methods for matching a user's interests to information available? • How should a user's interests be described? • How will the performance of filtering methods vary in different domains? Modern Information Retrival Course, Semantic web Research labratory

Filtering general features • An information filtering system is an information system designed for unstructured or semi structured data. • This contrasts with a typical database application that involves very structured data, such as employee records. • The notion of structure being used here is not only that the data conforms to a format such as a record type description, but also that the fields of the records consist of simple data types with well-defined meanings. • Email messages are an example of semi structured data in that they have well-defined header fields and an unstructured text body. Modern Information Retrival Course, Semantic web Research labratory

Filtering general features • Information filtering systems deal primarily with textual information. • Unstructured data is often used as a synonym for textual data. • It is, however, more general than that and should include other types of data • such as images, voice, and video that are part of multimedia information systems. • None of these data types are handled well by conventional database systems, and all have meanings that are difficult to represent. Modern Information Retrival Course, Semantic web Research labratory

Filtering general features • Filtering systems involve large amounts of data. • Typical applications would deal with gigabytes of text, or much larger amounts of other media. • Filtering applications typically involve streams of incoming data, either being broadcast by remote sources (such as newswire services), or sent directly by other sources (email). • Filtering has also been used to describe the process of accessing and retrieving information from remote databases, in which case the incoming data is the result of the database searches. Modern Information Retrival Course, Semantic web Research labratory

Filtering general features • Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests. • Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream. • In the first case: • The users of the system see what is left after the data is removed • In the later case: • they see the data that is extracted. • A common example of the first approach is an email filter designed to remove junk mail. • profiles may not only express what people want, but also what they do not want. Modern Information Retrival Course, Semantic web Research labratory

Filtering general features • Many of these features are virtually the same as those found in a variety of other text-based information systems. • Text routing, for example, involves sending relevant incoming data to individuals or groups. • This process is essentially identical to filtering. • Categorization systems are designed to attach one or more predefined categories to incoming objects (this is done by newswire services, for example). • The major difference from filtering in this case is the static nature of the categories, when compared to profiles. Modern Information Retrival Course, Semantic web Research labratory

IF vs. IR • The entities and processes relevant to IF are almost identical to those that are relevant to IR. • The major differences appear to be: • IR is typically concerned with single uses of the system, by a person with a one-time goal and one-time query. • IF is concerned with repeated uses of the system, by a person or persons with long-term goals or interests. Modern Information Retrival Course, Semantic web Research labratory

IF vs. IR • IR recognizes inherent problems in the adequacy of queries as representations of information needs. • IF assumes that profiles can be correct specifications of information interests. • IR is concerned with the collection and organization of texts. • IF is concerned with the distribution of texts to groups or individuals. Modern Information Retrival Course, Semantic web Research labratory

IF vs. IR • IR is typically concerned with the selection of texts from a relatively static database. • IF is mainly concerned with selection or elimination of texts from a dynamic data stream. • IR is concerned with responding to the user’s interaction with texts within a single information-seeking episode. • IF is concerned with long-term changes over a series of information-seeking episodes. Modern Information Retrival Course, Semantic web Research labratory

IF vs. IR • In addition to these distinctions based on the models of IR and IF, there seem to be some other, contextual differences that might also be relevant to research interests. • These arise from differences in the social and/or practical situations with which IR and IF have been concerned. • Differences could be categorized according to differences associated with • Texts • Users • General environment of concern to each. Modern Information Retrival Course, Semantic web Research labratory

IF vs. IR • Text-related issues. • For IF, the timeliness of a text is often of overriding significance. • For IR, this has typically not been the case. • User-related issues. • IR has, by-and-large, studied well-defined user groups, in well-defined, specific domains, largely in science and technology. • IF, however, is often concerned with very undefined user communities • Environmental issues. • IF is highly concerned, in many situations, with issues of privacy • IR, for a variety of reasons, has paid almost no attention to this kind of problem. Modern Information Retrival Course, Semantic web Research labratory

Filtering using IR • In general, the idea for filtering is to create a space ofdocuments, some of which have previously been judged by a user to be relevant to his or her interests. • If a new document is close to relevant documents in the space, then it would be considered likely to be interesting to the user. • For all these comparisons, the only difference between the LSI and the keyword matching methods is that LSI represents terms and documents in a reduceddimensionalspace of derived indexing dimensions. Modern Information Retrival Course, Semantic web Research labratory

Filtering using IR • Foltz compared LSI and keyword vector matching for filtering of Netnews articles. • In an experiment, subjects rated Netnews articles as either relevant or not relevant to their interests. • The ratings from the initial 80% of the articles they read were used to predict the relevance of the remaining 20% of the articles for each person. • Foltz found that the LSI filteringimproved prediction performance over the keyword matching method by an average of 13% and showed a 26% improvement in precision Modern Information Retrival Course, Semantic web Research labratory

Filtering using IR Modern Information Retrival Course, Semantic web Research labratory

Automatic vs. Social filtering • Automatic Filtering: • is where the computer evaluates what is of value for you. • Social Filtering(collaborative filtering): • is tools where other people help you evaluate what is of most value to read. Just like the publishers and organizations did in society before the Internet. Modern Information Retrival Course, Semantic web Research labratory

Social filtering • By social filtering is meant that some kind of ratings are assigned to documents. • The ratings can be compared to the stars (***) which newspapers often assign to films, books and other consumer products. • But the ratings can also include categorization intosubjectareas or according to particular scales. • Social filtering has some similarities to the filtering done by editors, journalists and publishers, since in both cases humans select the filtering attributes. Modern Information Retrival Course, Semantic web Research labratory

Social filtering • Why use social filtering? • It is difficult to design automatic or intelligentfiltering algorithms which really can evaluate the content of a document and evaluate its value. • Humans are more capable of really deciding the value of a document. • Who make the ratings? • Ratings for use in social filtering can be provided by: Modern Information Retrival Course, Semantic web Research labratory

Social filtering • Editors: • special people with the task of doing such rating. An example is the people selecting which messages to put into services like Yahoo. • Readers: • ordinary readers might input ratings on what they read, and these ratings might be collected and put into databases to help other people. • Authors: • can provide certain kinds of ratings themselves. Modern Information Retrival Course, Semantic web Research labratory

Social filtering • The most successful social filtering system is Yahoo. • Yahoo employs humans to evaluate documents, and puts documents, which are interesting into its structured information database. • This is very similar to what the publishers, editors, journalists and organizations did in the world before the Internet. Modern Information Retrival Course, Semantic web Research labratory

Social filtering • The simplest and most common filtering is by organizing discussions into groups (newsgroups, mailing lists, forums, etc.) • Each group has a topic, and wants only contributions within that topic. • Sometimes the right to submit contributions is restricted. • only members can submit. • competence control is done before accepting a new member. • special moderators must approve contributions before distribution. • The act when a recipient selects which groups to subscribe to, can thus be seen as an act of setting a personal filter. Modern Information Retrival Course, Semantic web Research labratory

Thread filtering • Another simple and common filtering method is to filter by thread. • A thread is a set of messages, which directly or indirectly refer to each other. • People can use threads for filtering by specifying that they want to skip reading of existing and future contributions in certain threads. • In Usenet News, this functionality is known under the term kill buffer. Modern Information Retrival Course, Semantic web Research labratory

Thread filtering • In discussion groups, messages often belong to threads. • It may then not be possible to understand a single message without seeing other messages in the same thread. • A filter or search facility which only selects certain individual messages, out of threads, might then not satisfy their users. • The filter must either select several items in the thread, or at least make it very easy for users, when reading one selected message, to traverse the tree up and down from this message. Modern Information Retrival Course, Semantic web Research labratory

Filtering rules and attributes • Filtering is done by applying filtering rules to attributes of the documents to be filtered. • Filtering rules are often Boolean conditions. • They are usually put in an ordered list, which is scanned for each item to be filtered. • The attributes of documents, to be used in filtering, are words in: • the titles, abstracts or the whole document • automatic measurements of stylistic and language quality • name of author, and ratings on the documents supplied by its author or by other people Modern Information Retrival Course, Semantic web Research labratory

Filtering rules and attributes • Filtering can be done in servers or in clients. • This figure shows how a server can filter messages before downloading them to the client. • Advantage: • Filtering can be done in the background • Disadvantage: • Communication between user and filtering system becomes more complex. Modern Information Retrival Course, Semantic web Research labratory

Filtering rules and attributes • Alternatively, filters may be part of the client, and apply to sets of documents after they have been downloaded to the client. Modern Information Retrival Course, Semantic web Research labratory

Delivery of filtering results • The most common way of delivery of filtering results is that documents are filtered into different folders. • Users choose to read new items one folder at a time. • The filter helps users read messages on the same topic at the same time. • The user can also have a personal priority on the order of reading news in different folders. • Unwanted messages can be filtered to special “trashcan” folders. Modern Information Retrival Course, Semantic web Research labratory

Intelligent filtering • By intelligent filtering is meant use of artificial intelligence (AI) methods to enhance filtering. • This can be done in different ways: • to derive attributes for documents, • to derive filtering rules, • for the filtering process itself. With the machine learning approach • Such filtering can be done in the background, with little or no interaction with the user • it can also be done in a way where a user can interact with the filter and help the filter understand why the user likes certain messages. Modern Information Retrival Course, Semantic web Research labratory

Filtering against spamming • Many people want filters which will remove unsolicited direct marketing e-mail messages, so called spamming. • The filter has to recognize special properties of spam messages, which distinguish them from other messages. • Examples of such properties are: • A message does not have your name or e-mail address in the message heading, but it does not come from any mailing list, which you subscribe to. Modern Information Retrival Course, Semantic web Research labratory

Filtering against spamming • Examples of such properties are: • The author or sender of a message has an illegal e-mail address. • Certain words, such as “money” or “$$$” in the subject. This is not very dependable. It has the same problem as all intelligent filtering. • If you often get similar spam, you might be able to recognize special properties of them to use to stop further similar spam. • The same message, with identical content, was sent to very many users Modern Information Retrival Course, Semantic web Research labratory

Type of filters • Various Type of Filters: • Content-based Filters • Collaborative Filters • Hybrid Filters Modern Information Retrival Course, Semantic web Research labratory

Content-based Filters • A content-based filter makes use of the content of the information items to evaluate whether the item is interesting • profiles are either in the form of user-specified keywords or rules and reflects the long-term interests of the user. • the user would like the system to learnthe user profile rather than impose upon the user to provide one. • This generally involves the application of Machine Learning (ML) techniques. • The user’s feedback can be acquired either implicitly by observing the user or explicitly by asking the user to rate the seen information item Modern Information Retrival Course, Semantic web Research labratory

Content-based Filters (cont.) • The two primary weaknesses of using ML techniques to learn a user profile is that • Most techniques require large amounts of data • If a new information item is significantly different from anything seen (and hence labeled) by the user before, the learned profile cannot make an accurate prediction • Content-based filters have been used successfully in various domains including: • Web browsing (Letizia and Syskill&Webert), • News filtering (NewsWeeder2,WebMate and NewsDude3) • Email filtering (Re:Agent and EmailValet). Modern Information Retrival Course, Semantic web Research labratory

Collaborative Filters • Collaborative filters also known as Social Filters, are often used in Recommender Systems. • A collaborative filter makes use of a database of user preferences to find users with similar interests • Predict whether an unseen information item is likely to be of interest to you based on how other users have rated this item. • A community of users has to continuously rate whether the information they have seen is interesting to them or not • Generally this rating is on a scale (e.g., from 1, meaning “not interesting” to 5, meaning “very interesting”.) Modern Information Retrival Course, Semantic web Research labratory

Collaborative Filters (cont.) • Collaborative filters have two common weaknesses: • The first rater problem • If no users have rated an information item, the filter cannot evaluate whether that item is likely to be of interest to its user • Sparse data • Most users do not rate all that much information due to the time it takes, and as such, it is not always easy to find users with similar profiles. • Collaborative filters work quite well and have successfully been applied in a variety of domains including: • Finding people who are knowledge in a given field (Tapestry) • Netnews (GroupLens4) • Music recommendation (Ringo) • Helping people to find Web resources (PHOAKS) • CDNow.com, reel.com, and Amazon.com. Modern Information Retrival Course, Semantic web Research labratory

Information Filtering