1-INTRODUCTION 1.1-DEFINITION

1-INTRODUCTION1.1-DEFINITION • Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech). The term has slightly different meanings in different branches of linguistics and computer science.

Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence, sometimes with the aid of devices such as sentence diagrams. It usually emphasizes the importance of grammatical divisions such as subject and predicate.

Within computational linguistics, the term is used to refer to the formal analysis by computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information.

The term is also used in psycholinguistics when describing language comprehension. In this context, parsing refers to the way that human being analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc." This term is especially common when discussing what linguistic cues help speakers to interpret garden-path sentences.

Within computer science, the term is used for the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of the compilers and interpreters.

1.2-HUMAN LANGUAGES1.2A-TRADITIONAL METHODS • The traditional grammatical exercise of parsing, sometimes known as clause analysis, involves breaking down a text into its component parts of speech with an explanation of the form, function, and syntactic relationship of each part. This is determined in large part from study of the language's conjugations and declensions, which can be quite intricate for heavily inflected languages.

To parse a phrase such as 'man bites dog' involves notting that the singular noun 'man' is the subject of the sentence, the verb 'bites' is the third person singular of the present tense of the verb 'to bite', and the singular noun 'dog' is the object of the sentence. Techniques such as sentence diagrams are sometimes used to indicate the relation between elements in the sentence.

Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language. However the teaching of such techniques is no longer current.

1.2B-COMPUTATIONAL METHODS • In some machine translation and natural language processing systems, written texts in human languages are parsed by computer programs. How?

Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language, whose usage is to convey meaning (or semantics) amongst a potentially unlimited range of possibilities but only some of which are germane to the particular case. So an utterance "Man bites dog" versus "Dog bites man" is definite on one detail but in another language might appear as "Man dog bites" with a reliance on the larger context to distinguish between those two possibilities, if indeed that difference was of concern. It is difficult to prepare formal rules to describe informal behavior even though it is clear that some rules are being followed.

In order to parse natural language data, researchers must first agree on the grammar to be used. The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete. Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank. Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is dependency grammar parsing.

Most modern parsers are at least partly statistically; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech). However such systems are vulnerable to overfitting and require some kind of smoothing to be effective.

Parsing algorithms for natural language cannot rely on the grammars having 'nice' properties as with manually designed grammars for programming languages. As mentioned earlier some grammar formalisms are very difficult to parse computationally; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time. However some systems trade speed for accuracy using, e.g., linear-time versions of shift-reduce algorithm. A somewhat recent development has been parser reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.

1.2C-PSYCHOLINGUISTICS • In psycholinguistics, parsing involves not just the assignment of words to categories, but the evaluation of the meaning of a sentence according to the rules of syntax drawn by inferences made from each word in the sentence. This normally occurs as words are being heard or read.

Consequently, psycholinguistic models of parsing are of necessity incremental, meaning that they build up an interpretation as the sentences are being processed, which is normally expressed in terms of a partial syntactic structure. Creation of the wrong structure can lead to the phenomenon known as garden-pathing.

1.3-PROGRAMMING LANGUAGES1.3A-PARSER • In computing, a parser is one of the components in an interpreter or compiler that checks for correct syntax and builds a data structure (often some kind of parse tree, abstract syntax tree or other hierarchical structure) implicit in the input tokens. The parser uses a different lexical analyser to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be (semi-)automatically generated (in some programming languages) by a tool.

The most common use of a parser is as a component of a compiler or interpreter. This parses the source code of a computer programming language to create some form of internal representation. Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them. Parsers are written by hand or generated by parser generators.

Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.

1.3B-OVERVIEW OF PROCESS • The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.

The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, 2, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningful tokens like "12*" or "(3" will not be generated.

The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars.

The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program, a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.

1.4-TYPES OF PARSER • The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:

Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.

A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

LL parsers and recursive-descent parser are examples of top-down parsers which cannot accommodate left recursive production rules. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing an ambiguous context-free grammars, more sophisticated algorithms for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodated ambiguity and left recursion in polynomial time and which generate polynomial-size representations of the potentially exponential number of parse tree. Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given CFG (context-free grammar).

An important distinction with regard to parsers is whether a parser generates a leftmost derivation or a rightmost derivation. LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).

1.5-EXAMPLES OF PARSERS1.5a-TOP-DOWN PARSERS • Some of the parsers that use top-down parsing include: recursive descent parser, LL parser, and Earley parser.

1.5B-BOTTOM-UP PARSERS • Some of the parsers that use bottom-up parsing include: precedence parser (operator-precedence parser and simple precedence parser), BC (bounded context) parsing, LR parser, and CYK parser.

1.5C-PARSER DEVELOPMENT SOFTWARE • Some of the well known parser development tools include the following: ANTLR, Bison, Coco/R, GOLD, JavaCC, Lemon, Lex, Parboiled, ParseIt, Ragel, SHProto (FSM parser language), and Spirit Parser Framework.

1.6-LOOKAHEAD • Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Lookahead is especially relevant to LL, LR, and LALR parsers, where it is often explicitly indicated by affixing the lookahead to the algorithm name in parentheses, such as LALR(1).

Most programming languages, the primary target of parsers, are carefully defined in such a way that a parser with limited lookahead, typically one, can parse them, because parsers with limited lookahead are often more efficient. One important change to this trend came in 1990 when Terence Parr created ANTLR for his Ph.D. thesis, a parser generator for efficient LL(k) parsers, where k is any fixed value.

Parsers typically have only a few actions after seeing each token. They are shift (add this token to their stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce).

Lookahead has two advantages. • It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states. • It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause.

To correctly parse without lookahead, there are three solutions: • The user has to enclose expressions within parentheses. This often is not a viable solution. • The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers. • Alternatively, the parser or grammar needs to have more logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth.

2-AUGMENTED TRANSITION NETWORK • An augmented transition network (ATN) is a type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelligence. Difinition of augmented transition network in Oxford dictionaries is as follows: A type of grammar that represents a sentence as a series of states and possible continuations, augmented with rules for such matters as word agreement.

An ATN can, theoretically, analyze the structure of any sentence, however complicated. ATNs were built on the idea of using finite state machines (Markov model) to parse sentences. W. A. Woods in his "Transition Network Grammars for Natural Language Analysis" claims that by adding a recursive mechanism to a finite state model, parsing can be achieved much more efficiently. Instead of building an automaton for a particular sentence, a collection of transition graphs are built. A grammatically correct sentence is parsed by reaching a final state in any state graph. Transitions between these graphs are simply subroutine calls from one state to any initial state on any graph in the network. A sentence is determined to be grammatically correct if the final state is reached by the last word in the sentence.

This model meets many of the goals set forth by the nature of language in that it captures the regularities of the language. That is, if there is a process that operates in a number of environments, the grammar should encapsulate the process in a single structure. Such encapsulation not only simplifies the grammar, but has the added bonus of efficiency of operation.

Another advantage of such a model is the ability to postpone decisions. Many grammars use guessing when an ambiguity comes up. This means that not enough is yet known about the sentence. By the use of recursion, ATNs solve this inefficiency by postponing decisions until more is known about a sentence.

3-AUGMENTED PHRASE STRUCTURE GRAMMAR • Augmented phrase structure grammars consist of phrase structure rules with embedded conditions and structure-building actions in a specially developed language. An attribute-value, record-oriented information structure is an integral part of the theory.

4-DEFINITE CLAUSE GRAMMAR • A definite clause grammar (DCG) is a way of expressing grammar, for either natural or formal languages,in a logic programming language, such as Prologue. It is closely related to the concept of attribute languages/affix languages from which Prologue was originally developed. DCGs are usually associated with Prologue, but similar languages such as Mercury also include DCGs. They are called definite clause grammars because they represent a grammar as a set of definite clauses in first-order logic.

The term DCG refers to the specific type of expression in Prologue and other similar languages; not all ways of expressing grammars using definite clauses are considered DCGs. However, all of the capabilities or properties of DCG will be the same for any grammar that is represented with definite clauses in essentially the same way as in Prologue.

The definite clauses of a DCG can be considered a set of axioms where the validity of a sentence, and the fact that it has a certain parse tree can be considered theorems that follow from these axioms. This has the advantage of making it so that recognition and parsing of expressions in a language becomes a general matter of proving statements, such as statements in a logic programming language.

Since DCGs were introduced by Pereira and Warren, several extensions have been proposed. Pereira himself proposed an extension called extraposition grammars (Exs). This formalism was intended, in part to make it easier to express certain grammatical phenomena, such as left-extraposition. Pereira states, "The difference between XG rules and DCG rules is then that the left-hand side of an XG rule may contain several symbols." This makes it easier to express rules for context-sensitive grammars.

Another, more recent, extension was made by researchers at NEC Corporation called Multi-Modal Definite Clause Grammars (MM-DCGs) in 1995. Their extensions were intended to allow the recognizing and parsing expressions that include non-textual parts such as pictures. Another extension, called definite clause translation grammars (DCTGs) was described in 1984. DCTG notation looks very similar to DCG notation; the major difference is that one uses ::= instead of  in the rules. It was designed to handle grammatical attributes conveniently. The translation of DCTGs into normal Prologue clauses is like that of DCGs, but 3 arguments are added instead of 2.

DCGs can serve as a convenient syntactic sugar to hide certain parameters in code in other places besides parsing applications. In the programming language Mercury, which borrows DCG syntax from Prologue, for example, DCGs can be used to hide io__state arguments in I/O code. They are also used in other, similar situations in Mercury.

Definite Clause Grammars (DCGs) are convenient ways to represent grammatical relationships for various parsing applications. They can be used for natural language work, for creating formal command and programming languages. For example, DCG is an excellent tool for parsing and creating tagged input and output streams, such as HTML or XML. The index and table of contents in this documentation are generated by a Prologue program that uses DCG to parse HTML, looking for headings and index entries.

Difference lists are pairs of lists used to represent the list of elements (tokens, words, character codes, …) being parsed. The actual list being represented is the 'difference' between the first list and the second list. Parsing is analyzing an input stream. Difference lists are powerful tools for parsing applications, in which the input stream is represented by difference lists.

In parsing applications, the first list contains the elements to be parsed. Different parsing predicates find what they are looking for at the front of that first list, and unify the second list with what's left to parse. In other words, the difference lists are chained together in the parsing application.

A natural language parsing application tests to see if a sentence is grammatically correct. It does this using difference lists to chain together various grammatical categories. And the 'terminal' case is when there are one or more specific elements to be identified at the head of the first list.

1-INTRODUCTION 1.1-DEFINITION

1-INTRODUCTION 1.1-DEFINITION

Presentation Transcript

Chapter 1.1 Introduction

Introduction to Silverlight 1.1

FUNDAMENTALS 1.1. INTRODUCTION

Journeys Vocabulary introduction 1.1

Lecture 1 Introduction: Ch 1.1-1.6

DEFINITION 1.

Introduction Version 1.1

1. DEFINITION

Lecture 1: Introduction (Sections 1.1-1.3)

Introduction 1.1 General Points

Introduction Definition

1-1.1

Introduction Version 1.1

1-1.1

1.1 Introduction to Cells

Lecture 1 Introduction: Ch 1.1-1.6

1-1.1

Chapter 1: Introduction 1.1 Images and Pictures

1.1 Introduction