LEXICAL ANALYSIS AND STOPLISTS

LEXICAL ANALYSIS AND STOPLISTS Cha- 7

Content • Introduction • Lexical Analysis for Automatic Indexing • Lexical Analysis for Query Processing • The Cost of Lexical Analysis • Implementing a Lexical Analyzer • STOPLISTS • Implementing Stoplists • A Lexical Analyzer Generator

1. Introduction • Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. Lexical analysis is the first stage of automatic indexing, and of query processing. Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. The lexical analysis phase produces candidate index terms that may be further processed, and eventually added to indexes (see Chapter 1 for an outline of this process). Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. Lexical analysis of a query produces tokens that are parsed and turned into an internal representation suitable for comparison with indexes.

2. Lexical Analysis for Automatic Indexing • The first decision that must be made in designing a lexical analyzer for an automatic indexing system is: What counts as a word or token in the indexing scheme? At first, this may seem an easy question, and there are some easy answers to it--for example, terms consisting entirely of letters should be tokens. Problems soon arise, however. Consider the following: Digits--Most numbers do not make good index terms, so often digits are not included as tokens. However, certain numbers in some kinds of databases may be important (for example, case numbers in a legal database).

Contd.. • Also, digits are often included in words that should be index terms, especially in databases containing technical documents. • For example, a database about vitamins would contain important tokens like "B6" and "B12." One partial (and easy) solution to the last problem is to allow tokens to include digits, but not to begin with a digit.

Contd.. • Hyphens--Another difficult decision is whether to break hyphenated words into their constituents, or to keep them as a single token. Breaking hyphenated terms apart helps with inconsistent usage (e.g., "state-of-the-art" and "state of the art" are treated identically), but loses the specificity of a hyphenated phrase. • Also, dashes are often used in place of ems, and to mark a single word broken into syllables at the end of a line. Treating dashes used in these ways as hyphens does not work. On the other hand, hyphens are often part of a name, such as "Jean-Claude," "F-16," or "MS-DOS."

Contd.. • Other Punctuation--Like the dash, other punctuation marks are often used as parts of terms. For example, periods are commonly used as parts of file names in computer systems (e.g., "COMMAND.COM" in DOS), or as parts of section numbers; slashes may appear as part of a name (e.g., "OS/2"). If numbers are regarded as legitimate index terms, then numbers containing commas and decimal points may need to be recognized. The underscore character is often used in terms in programming languages (e.g., "max_size" is an identifier in Ada, C, Prolog, and other languages).

Contd.. • Case--The case of letters is usually not significant in index terms, and typically lexical analyzers for information retrieval systems convert all characters to either upper or lower case. Again, however, case may be important in some situations. • For example, case distinctions are important in some programming languages, so an information retrieval system for source code may need to preserve case distinctions in generating index terms. There is no technical difficulty in solving any of these problems, but information system designers must think about them carefully when setting lexical analysis policy.

Contd.. • Recognizing numbers as tokens adds many terms with poor discrimination value to an index, but may be a good policy if exhaustive searching is important. • Breaking up hyphenated terms increases recall but decreases precision, and may be inappropriate in some fields (like an author field). Preserving case distinctions enhances precision but decreases recall.

Contd.. • Commercial information systems differ somewhat in their lexical analysis policies, but are alike in usually taking a conservative (recall enhancing) approach. For example, Chemical Abstracts Service, ORBIT Search Service, and Mead Data Central's LEXIS/NEXIS all recognize numbers and words containing digits as index terms, and all are case insensitive. • None has special provisions for most punctuation marks in most indexed fields. However, Chemical Abstracts Service keeps hyphenated words as single tokens, while the ORBIT Search Service and LEXIS/NEXIS break them apart (if they occur in title or abstract fields).

3. Lexical Analysis for Query Processing • Designing a lexical analyzer for query processing is like designing one for automatic indexing. It also depends on the design of the lexical analyzer for automatic indexing: since query search terms must match index terms, the same tokens must be distinguished by the query lexical analyzer as by the indexing lexical analyzer. In addition, however, the query lexical analyzer must usually distinguish operators (like the Boolean operators, stemming or truncating operators, and weighting function operators), and grouping indicators (like parentheses and brackets).

Contd.. • A lexical analyzer for queries should also process certain characters, like control characters and disallowed punctuation characters, differently from one for automatic indexing. • Such characters are best treated as delimiters in automatic indexing, but in query processing, they indicate an error. Hence, a query lexical analyzer should flag illegal characters as unrecognized tokens.

4. The Cost of Lexical Analysis • Lexical analysis is expensive because it requires examination of every input character, while later stages of automatic indexing and query processing do not. Although no studies of the cost of lexical analysis in information retrieval systems have been done, lexical analysis has been shown to account for as much as 50 percent of the computational expense of compilation (Wait 1986). Thus, it is important for lexical analyzers, particularly for automatic indexing, to be as efficient as possible.

5. Implementing a Lexical Analyzer • Lexical analysis for information retrieval systems is the same as lexical analysis for other text processing systems; in particular, it is the same as lexical analysis for program translators. This problem has been studied thoroughly, so we ought to adopt the solutions in the program translation literature (Aho, Sethi, and Ullman 1986). There are three ways to implement a lexical analyzer:

Contd.. • Use a lexical analyzer generator, like the UNIX tool lex (Lesk 1975), to generate a lexical analyzer automatically; Write a lexical analyzer by hand ad hoc; or Write a lexical analyzer by hand as a finite state machine

Contd.. • The first approach, using a lexical analyzer generator, is best when the lexical analyzer is complicated; if the lexical analyzer is simple, it is usually easier to implement it by hand. In our discussion of stoplists below, we present a special purpose lexical analyzer generator for automatic indexing that produces efficient lexical analyzers that filter stoplist words. • Consequently, we defer further discussion of this alternative. • The second alternative is the worst. An ad hoc algorithm, written just for the problem at hand in whatever way the programmer can think to do it, is likely to contain subtle errors. Furthermore, finite state machine algorithms are extremely fast, so ad hoc algorithms are likely to be less efficient.

Contd.. • The third approach is the one we present in this section. We assume some knowledge of finite state machines (also called finite automata), and their use in program translation systems. Readers unfamiliar with these topics can consult Hopcroft and Ullman (1979), and Aho, Sethi, and Ullman (1986). • Our example is an implementation of a query lexical analyzer as described above. The easiest way to begin a finite state machine implementation is to draw a transition diagram for the target machine. • A transition diagram for a machine recognizing tokens for our example query lexical analyzer is pictured in Figure 7.1.

Contd..

6. STOPLISTS • It has been recognized since the earliest days of information retrieval (Luhn 1957) that many of the most frequently occurring words in English (like "the," "of," "and," "to," etc.) are worthless as index terms. A search using one of these terms is likely to retrieve almost every item in a database regardless of its relevance, so their discrimination value is low (Salton and McGill 1983; van Rijsbergen 1975). • Furthermore, these words make up a large fraction of the text of most documents: the ten most frequently occurring words in English typically account for 20 to 30 percent of the tokens in a document (Francis and Kucera 1982).

Contd.. • Eliminating such words from consideration early in automatic indexing speeds processing, saves huge amounts of space in indexes, and does not damage retrieval effectiveness. • A list of words filtered out during automatic indexing because they make poor index terms is called a stoplist or a negative dictionary

Contd.. • One way to improve information retrieval system performance, then, is to eliminate stopwords during automatic indexing. As with lexical analysis, however, it is not clear which words should be included in a stoplist. Traditionally, stoplists are supposed to have included the most frequently occurring words. However, some frequently occurring words are too important as index terms. • For example, included among the 200 most frequently occurring words in general literature in English are "time," "war," "home," "life," "water," and "world." On the other hand, specialized databases will contain many words useless as index terms that are not frequent in general English. • For example, a computer literature database probably need not use index terms like "computer," "program," "source," "machine," and "language.

Contd.. • As with lexical analysis in general, stoplist policy will depend on the database and features of the users and the indexing process. Commercial information systems tend to take a very conservative approach, with few stopwords. For example, the ORBIT Search Service has only eight stopwords: "and," "an," "by," "from," "of," "the," and "with." Larger stoplists are usually advisable. • An oft-cited example of a stoplist of 250 words appears in van Rijsbergen (1975). Figure 7.5 contains a stoplist of 425 words derived from the Brown corpus (Francis and Kucera 1982) of 1,014,000 words drawn from a broad range of literature in English. • Fox (1990) discusses the derivation of (a slightly shorter version of) this list, which is specially constructed to be used with the lexical analysis generator described below.

7. Implementing Stoplists • There are two ways to filter stoplist words from an input token stream: (a) examine lexical analyzer output and remove any stopwords, or (b) remove stopwords as part of lexical analysis. • The first approach, filtering stopwords from lexical analyzer output, makes the stoplist problem into a standard list searching problem: every token must be looked up in the stoplist, and removed from further analysis if found. • The usual solutions to this problem are adequate, including binary search trees, binary search of an array, and hashing (Tremblay and Sorenson, 1984, Chapter 13). Undoubtedly the fastest solution is hashing.

Contd.. • When hashing is used to search a stoplist, the list must first be inserted into a hash table. Each token is then hashed into the table. If the resulting location is empty, the token is not a stopword, and is passed on; otherwise, comparisons must be made to determine whether the hashed value really matches the entries at that hash table location. If not, then the token is passed on; if so, the token is a word, and is eliminated from the token stream. This strategy is fast, but is slowed by the need to re-examine each character in a token to generate its hash value, and by the need to resolve collisions.

8. A Lexical Analyzer Generator • The heart of the lexical analyzer generator is its algorithm for producing a finite state machine. The algorithm presented here is based on methods of generating minimum state deterministic finite automata (DFAs) using derivatives of regular expressions (Aho and Ullman 1975) adapted for lists of strings. (A DFA is minimum state if it has a few states as possible.) This algorithm is similar to one described by Aho and Corasick (1975) for string searching.

Contd.. • During machine generation, the algorithm labels each state with the set of strings the machine would accept if that state were the initial state. It is easy to examine these state labels to determine: (a) the transition out of each state, (b) the target state for each transition, and (c) the states that are final states. • For example, suppose a state is labeled with the set of strings {a, an, and, in, into, to}. This state must have transitions on a, i, and t. The transition on a must go to a state labeled with the set {n, nd, }, the transition on i to a state labeled {n, nto}, and the transition on t to a state labeled {o}. • A state label L labeling a target state for a transition on symbol a is called a derivative label L with transition a. A state is made a final state if and only if its label contains the empty string.

Contd.. Thank You

LEXICAL ANALYSIS AND STOPLISTS

LEXICAL ANALYSIS AND STOPLISTS

Presentation Transcript

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

LEXICAL ANALYSIS

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Chapter 7 Lexical Analysis and Stoplists

Lexical Analysis

Lexical Analysis

Lexical Analysis