1 / 25

Finding Out About (FOA)

Finding Out About (FOA). “Finding Out About”, by Richard Belew (2000), Cambridge University Press. Finding Out About (FOA) - research activities that allow a decision maker to draw on others’ knowledge, especially the WWW = “Information Retrieval”.

tamika
Download Presentation

Finding Out About (FOA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Out About (FOA) • “Finding Out About”, by Richard Belew (2000), Cambridge University Press. • Finding Out About (FOA) - research activities that allow a decision maker to draw on others’ knowledge, especially the WWW = “Information Retrieval”. • A library (or WWW) contains many books (“documents”) on many topics. The authors are typically people far from our time or place, using language similar but not identical to our own. We must FOA a topic of special interest, looking only for those things which are relevant to our search. This basic skill is a fundamental part of an academic’s job:

  2. What we look for • We look for references to search for an essay or paper • We read a textbook, looking for specific answers to questions • We comb through scientific journals to see if a question has already been answered. • We can search the WWW for (others’ opinions of) music, or movies or software. • Typically an author anticipates the interests of some imagined audience and produces text which is a balance between what the author wants to say and what he or she thinks the audience want to hear. • A textual corpus will contain many such documents, written by many different authors in may different styles and for many different purposes (genres, registers).

  3. Links with Natural Language Processing (NLP) • FOA’s concern with the semantics (meaning) of entire documents is complemented by techniques from computational linguistics which have tended to focus on syntactic analysis of individual sentences. • The recent availability of new types of electronic artefacts from email corpora, web logs, to the logged browsing behaviour of millions of users brings an empirical grounding for new theories of language.

  4. FOA has 3 phases: • asking a question • constructing an answer • assessing an answer

  5. 1. Asking a question • Users of a search engine may be aware of a specific gap in their knowledge, and are motivated to fill it (meta-cognition about ignorance). They may not be able to articulate their knowledge gap. Forming a clearly posed question is the hardest part of answering it! This common cognitive state is the user’s information need. Users try to take their ill-defined, internal cognitive state and turn it into an external expression of their question. This external expression is called the query, and the language in which it is constructed the query language.

  6. 2. Constructing an answer • A human question-answerer might consider: • can they translate the user’s ill-formed query into a better one? • do they know the answer themselves? • are they able to verbalise this answer in terms the user will understand? • can they provide the necessary background knowledge for the user to understand the answer itself? • Current search engines are slightly more limited in scope. The search engine has available to it only a pre-existing set of “canned” texts (although this may be very large), and its response is limited to identifying one or more of these passages and presenting them to the users.

  7. 3. Assessing the answer • Assessing the answer • We would normally give feedback to a human answerer, e.g. “That isn’t what I meant”, “Let me ask it another way”, “That helps, but I still have this problem” or “What does that have to do with anything?”. So we “close the loop” when the user provides an assessment of how relevant the found the answer provided. In an automatic system this is relevance feedback - the user reacts to each retrieved document as “relevant”, “not relevant” or “neutral”. • See fig 1.4. The three steps in a computerised, algorithmic context, information retrieval.

  8. Keywords • The fundamental operation performed by a search engine is a match between descriptive features mentioned by users in their queries, and documents sharing those features. By far the most important kind of features are keywords. • Keywords are linguistic atoms - typically words, pieces of words, or phrases - used to characterise the subject or content of the document. • They are pivotal because they must bridge the gap between the users’ characterisation of their information need (queries) and the characterisation of the documents’ topical focus against which these will be matched. • Contrast natural language queries with bag-of-words queries.

  9. Keyword hierarchies (ontology, thesaurus) • Keywords tend to be related in a hierarchy e.g. computer science “is broader than” artificial intelligence, while machine learning and robotics “are narrower than” artificial intelligence. • These relations may be stored in a thesaurus (an ontology for information retrieval). • We must constrain the vocabulary so that it is exhaustive enough to express any imaginable topic within a domain, and specific enough that any subject we are interested in will be distinguished from all others. • Computer science is (too) exhaustive, while robotic vacuum cleaners for 747 airliners may be too specific. • The vocabulary size must be in the range 10,000 to 500,000 words.

  10. Keywords as document descriptors • Keywords are also used as document descriptors. • Indexing is the process of associating one or more keywords with each document. • The vocabulary used can either be controlled or uncontrolled (also known as closed or open). If we organise a conference, and ask the authors of each paper to index it manually using only terms on a fixed list of potential keywords, this is a closed vocabulary.

  11. Query syntax • Query syntax. A typical search engine query consists of 2 to 3 words. • Queries defined only as sets of keywords are simple queries - most search engines use this “bag of words” approach. • Other possibilities exist e.g. Boolean operators and / or / not e.g. “neural networks AND speech recognition”. • Verb(subject, object) triples e.g. “aspirin treats blood_clotting”

  12. Document length • Document length. Longer documents can discuss more topics, and hence be associated with more keywords, and thus are more likely to be retrieved. • This means we must normalise documents’ indices in some way to compensate for differing lengths. • We also assume that the smallest unit of text with appreciable “aboutness” is the paragraph, and larger documents are constructed of a number of paragraphs.

  13. Document proxies • Nowadays as computer storage capacities and network communication rates have increased, full text retrieval and presentation is possible. • With older systems and index cards, retrieval is limited to some proxy of the document searched for, such as a bibliographic citation, title or abstract. The text of ultimate interest remains quite distinct from the search engine used to find it. • Proxies are still useful - scanning abridged versions of retrieved documents is easier than if they had to read the whole document (gisting).

  14. Indexing • Indexing is the process by which a vocabulary of keywords is assigned to all documents of a corpus. Mathematically, an index is a relation mapping each document to the set of keywords that it is about: • Index : doc_i --about {kw_j} • The inverse mapping (inverted index) shows for each keyword, the documents it describes: • Index-1 : {kw_j}--describes doc_i • This assignment can be done manually or automatically.

  15. Tokens for indexing • In automatically selecting keywords, our first candidates will be tokens, sequences of characters broken by white space. The simplest solution would be to say that all tokens are keywords, but problems with this are: • Morphological features such as plurals: car, cars should be interchangeable. • Word segmentation in some Asian texts • Rules for hyphenation e.g. DATA-BASE • Phrase recognition: SPEED LIMIT is semantically cohesive, but what algorithm could distinguish it from other bigrams (consecutive pairs of words) which that happen to occur sequentially? • High frequency terms with low information content (stop words) • Words which only come up once (hapax legomena).

  16. IR vs. database retrieval

  17. Research area: integration of the two technologies • As well as their free text, many documents will carry meta-data that gives some facts about the document, e.g. author, data, publication. Both types of information might be sought in the following query: • I’m interested in documents about Fools’ Gold that have been published in children’s magazines in the last five years. • The Fools’ Gold element would be found by a typical search engine process. But the other elements “childrens’ magazines” and “last five years” seem to be queries against structured attributes that are typical of database queries. Hence we would need a hybrid of database and IR technologies. (actually “Children’s magazines” is an intermediate case.)

  18. How well are we doing (1)? • Evaluation of search engines is notoriously difficult. However, we have two measures of objective assessment. The first step is to focus on a particular query. • We identify the set of documents Rel that are determined to be relevant to it (subjective!). • A perfect search engine would retrieve all and only the documents in Rel. • See fig. 1.10

  19. Recall • Clearly, the number of documents that were designated both relevant and retrieved, Retr ∩ Rel will be a key measure of success. • But we must compare the size of the set |Retr ∩ Rel| to something. • If we were very concerned that the search engine retrieve every relevant document (e.g. every prior ruling relevant to a judicial case) , we should compare the intersection to the number of documents marked as relevant |Rel|. • This measure is known as recall = |Retr ∩ Rel| / |Rel| :

  20. Precision • However, we might instead be worried about how much of what we see is relevant (search engine users want a lot of relevant hits on the first page), so an equally reasonable standard of comparison is precision, the proportion of retrieved documents which are in fact relevant: • P = |Retr ∩ Rel| \ |Retr|

  21. Retrieved versus Relevant Docs High Recall Retrieval Retrieved Relevant High Precision Retrieval

  22. Summary (1) • We constantly and naturally Find Out About (FOA) many things. Computer search engines need to support this activity, just as naturally. • Language is central to our FOA activities. Our understanding of prior work in linguistics and the philosophy of language will inform our search engine development, and the increasing use of search engines will provide empirical evidence reflecting back to these same disciplines. • IR is the field of computer science that traditionally deals with retrieving free-text documents in response to queries. This is done by indexing all the documents in a corpus with keyword descriptors. There are a number of techniques for automatically recommending keywords, but it also involves a great deal of art.

  23. Summary (2) • Users’ interests must be shaped into queries constructed from these same keywords. Retrieval is accomplished by matching the query against the documents’ descriptions and returning a list of those that appear closest. • A central component of the FOA process is the users’ relevance feedback, assessing how closely the retrieved documents match what they had in mind. • Search engines have a function similar to databases, but their natural language foundations create fundamental differences as well. • In order to know how to shop for a good search engine, as well as to allow the science of FOA to move forward, we need an evaluation methodology by which we can fairly compare alternatives.

More Related