Review for IST 441 exam

Review for IST 441exam

Exam structure • Closed book and notes • Graduate students will answer more questions • Extra credit for undergraduates.

Hints All questions covered in the exercises are appropriate exam questions Past exams are good study guides

How much information is there in the world Informetrics - the measurement of information • What can we store • What do we intend to store. • What is stored. • Why are we interested.

What is information retrieval • Gathering information from a source(s) based on a need • Major assumption - that information exists. • Broad definition of information • Sources of information • Other people • Archived information (libraries, maps, etc.) • Web • Radio, TV, etc.

Information retrieved • Impermanent information • Conversation • Documents • Text • Video • Files • Etc.

What IR is usually not about • Usually just unstructured data • Retrieval from databases is usually not considered • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult for large data collections

What an IR system should do • Store/archive information • Provide access to that information • Answer queries with relevant information • Stay current • WISH list • Understand the user’s queries • Understand the user’s need • Acts as an assistant

How good is the IR system Measures of performance based on what the system returns: • Relevance • Coverage • Recency • Functionality (e.g. query syntax) • Speed • Availability • Usability • Time/ability to satisfy user requests

How do IR systems work Algorithms implemented in software • Gathering methods • Storage methods • Indexing • Retrieval • Interaction

Existing Popular IR System:Search Engine - Spring 2019

Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Index Query Engine Interface Indexer Users Crawler No Databases! Use an index Web A Typical Web Search Engine

Specialty Search Engines • Focuses on a specific type of information • Subject area, geographic area, resource type, enterprise • Can be part of a general purpose engine • Often use a crawler to build the index from web pages specific to the area of focus, or combine crawler with human built directory • Advantages: • Save time • Greater relevance • Vetted database, unique entries and annotations

Information Seeking Behavior • Two parts of the process: • search and retrieval • analysis and synthesis of search results

Size of information resources • Why important? • Scaling • Time • Space • Which is more important?

Scalability for Search • Scaling means how a system must grow if resources or work grows • Scalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth (wikipedia) • Search usually must scale in various ways: • Number of things searched – N • Number of searchers – M • Algorithms for indexing, ranking, etc.

Measuring the Growth of Work or Hardness of a Problem • While it is possible to measure the work done by an algorithm for a given set of inputs, we need a way to: • Measure the rate of growth of an algorithm based upon the size of the input (or output) • Compare algorithms to determine which is better for the situation • Compare and analyze for large problems • Examples of large problems?

Time vs. Space • Very often, we can trade space for time: • For example: maintain a collection of students’ with ID information. • Use an array of a billion elements and have immediate access (better time) • Use an array of number of students and have to search (better space)

Introducing Big O Notation • Will allow us to evaluate algorithms and understand scaling. • Has precise mathematical definition • Used in a sense to put algorithms into families • Worst case scenario • What does this mean? • Other types of cases?

Simplifying O( ) Answers • We say Big O complexity of • 3n2 + 2 = O(n2) drop constants! • because we can show that there is a n0 and a c such that: • 0  3n2 + 2  cn2 for n  n0 • i.e. c = 4 and n0 = 2 yields: • 0  3n2 + 2  4n2 for n  2 • What does this mean?

Big O issues • Useful for scaling • Sometimes constants matter for real problems • c is .0001 vs c is 106 • Use Big O carefully

Usually death to scaling Measuring the Growth of Work • As input size N increases, how well does our automated system work? • Depends on what you want to do! • Use algorithmic complexity theory: • Use measure big O: O(N) which means worst case • Important for • Search engines • Databases • Social networks • Crime/terrorism Performance classes Polynomial Sub-linear O(Log N) Linear O(N) Nearly linear O(N Log N) Quadratic O(N2) Exponential O(2N) O(N!) O(NN) practical impractical MapReduce may help

Two Categories of Algorithms Lifetime of the universe 1010 years = 1017 sec Unreasonable 1035 1030 1025 1020 1015 trillion billion million 1000 100 10 NN 2N Reasonable Runtime sec N5 N Don’t Care! 2 4 8 16 32 64 128 256 512 1024 Size of Input (N)

Reasonable vs. Unreasonable • Reasonable algorithms have polynomial factors • O (Log N) • O (N) • O (NK) where K is a constant • Unreasonable algorithms have exponential factors • O (2N) • O (N!) • O (NN)

Reasonable vs. Unreasonable • Reasonable algorithms • May be usable depending upon the input size • Unreasonable algorithms • Are impractical and useful to theorists • Demonstrate need for approximate solutions • Remember we’re dealing with large N (input size)

Definitions for IR • Document • what we will index, usually a body of text which is a sequence of terms • Tokens or terms • semantic word or phrase • Collections or repositories • particular collections of documents • sometimes called a database • Query • request for documents on a topic

What is a Document? • A document is a digital object • Indexable • Can be queried and retrieved. • Many types of documents • Text • Image • Audio • Video • data

Information Retrieval from Collections of Textual Documents Major Categories of Methods • Exact matching (Boolean) • Ranking by similarity to query (vector space model) • Ranking of matches by importance of documents (PageRank) • Combination methods What happens in major search engines

Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on thevector space model. Web searchmethods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution

Zipf Distribution • The Important Points: • a few elements occur veryfrequently • a medium number of elements have medium frequency • manyelements occur very infrequently

Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant • Rank = order of words’ frequency of occurrence • Another way to state this is with an approximately correct rule of thumb: • Say the most common term occurs C times • The second most common occurs C/2 times • The third most common occurs C/3 times • …

Zipf Distribution(linear and log scale)

What Kinds of Data Exhibit a Zipf Distribution? • Words in a text collection • Virtually any language usage • Library book checkout patterns • Incoming Web Page Requests (Nielsen) • Outgoing Web Page Requests (Cunha & Crovella) • Document Size on Web (Cunha & Crovella)

Why the interest in Queries? • Queries are ways we interact with IR systems • Nonquery methods? • Types of queries?

Query Engine Index Users Interface On-line Indexer Crawler Off-line Web Online vs offline processing

Queries Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Issues with Query Structures Matching Criteria • Given a query, what document is retrieved? • In what order?

Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries • Natural Language Queries • Vector queries • Others?

Simple query language: Boolean • Earliest query model • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT

Simple query language: Boolean • Geek-speak • Variations are still used in search engines!

Problems with Boolean Queries • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony

Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?

Order of Preference • Define order of preference • EX: a OR b AND c • Infix notation • Parenthesis evaluated 1st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s • a OR b AND c becomes • a OR (b AND c)

Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?

Boolean Query - Summary • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial systems until the WWW

Representation Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

Review for IST 441 exam