To perform any computational process on programs written in a language (e.g., compile or execute those programs), we must translate ASCII strings representing program text into higher-level tree representations for the text. This process, called parsing, conveniently breaks into two parts: lexical analysis and context-free parsing (which is often simply called parsing).
Consider this sequence of characters:
begin middle end
As readers of English, we recognize this text as a phrase consisting of three words: "begin", "middle", and "end". These words are the smallest meaningful pieces of syntax in the phrase. The process of taking a stream of characters and converting it into a stream of distinct, meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that converts a character stream to a token stream is called a tokenizer or a lexer. Example: In Scheme, we tokenize
(set! x (+ x 1))
as (, set!, x, (, +, x, 1, ), and ).
The process of tokenizing is straightforward for most languages because it can be performed by a finite automaton. The rules governing this process are (a very boring) part of the language definition.
The task of assembling a structural description of a program (typically a tree) from a sequence of tokens is harder. Consider the C statement
x = x + 1 ;
Why is this sequence of tokens a legal C statement?
In general, sequences of legal tokens don't always form legal program phrases, just as in English. The rules that determine the which sequences of tokens are legal program phrases is called a (context-free) grammar. The act of applying the rule of a grammar to determine which sequence of tokens are legal, and of classifying these groups, is called parsing.
In practice, parsing does more than simply accept or reject sequences of tokens. It constructs a tree structure representing the structure of the program as analyzed by the rules of the grammar.Let (PI) (for Potential Inputs) denote the set of all possible sequences of tokens generated the lexer for a trivial programming language. Let PI be defined by the following equation:
PI ::= empty | T PI
where T is defined by:
T ::= Num | +
The set of legal inputs AE is defined by the equation:
AE ::= Num | Num + AE
For example, the input
1 2
is a PI but not an AE. A parser is responsible for determining this fact. The following program performs this task assuming pi is a Scheme list containing the sequence of tokens in the given input:
(define AE? (lambda (a-pi) (cond ((null? a-pi) false) ((number? (first a-pi)) (or (null? (rest a-pi)) (and (eq? (first (rest a-pi)) '+) (AE? (rest (rest a-pi)))))) (else false))))
Consider the following syntaxes for assignment:
x = x + 1 x := x + 1 (set! x (+ x 1))
Which of these is best?
The answer is that the choice of surface syntax (also called concrete syntax) doesn't matter very much. As computer scientists, we are really interested in the abstract hierarchical structure of the preceding phrases.
To eliminate the irrelevant syntactic details, we can represent programs as trees (or equivalently, words in a free term algebra). Such a tree representation is called an abstract syntax for the language. Example: the abstract syntax for the assignment code given above could be
(make-assignment <Representation of "x"> <Representation of "(x + 1)">)
in Scheme notation or, equivalently,
make-assignment(<Representation of "x">, <Representation of "x + 1">)
in algebraic notation.
Abstract syntax typically eliminates ambiguities in the meaning of tokens with more than one meaning. For example, a language might use the symbol - to denote both a unary operation and a binary operation. In this case, the abstract syntax could represent
-1 as (make-apply (make-unary-op '-) 1) 1 - 2 as (make-apply (make-binary-op '-) 1 2)
Abstract syntax is a much more powerful program representation that surface syntax. Why?
Let us consider a concrete example of parsing a surface syntax into abstract syntax. Assume the following defintions for potential inputs (PI) and legal programs (L):
PI ::= empty | T PI
T ::= Num | Sym | ( | )
E ::= Num | Var | (E E) | (lambda Var E)
where Sym is the set of Scheme symbols, Var is the set Sym less the elements lambda, ( and ). Names that are specified explicitly (like lambda) are called keywords.
We will use the following Scheme data definitions to specify an abstract representations for program syntax:
;; Exp := (make-num number) + (make-var symbol) + (make-app Exp Exp) + ;; (make-proc symbol Exp) (define-structure (num n)) (define-structure (var s)) (define-structure (app rator rand)) (define-structure (proc param body)) ;; param is a symbol not a var!
where Exp, num, TT>var, app, and proc, correspond to E, Num, Var, (E E), and (lambda Var E), respectively, in the concrete syntax. Since the param field in a proc must be a variable, we represent it by a Scheme symbol rather than a var structure.
A simple strategy for parsing a sequence of input tokens into abstract syntax is to determine how the string is generated from the inductive definition by matching the clauses of the definition against the input stream proceeding left-to-right and using using K symbol lookahead to determine which choice is made at each point in the generation process. This strategy is called "recursive descent" or LL(K) parsing. Unfortunately, this strategy fails miserably if the inductive definition has left recursive clauses such as
E ::= E + Numbecause it will infinitely loop. In addition, some inductive definitions of syntax will not accommodate this approach because K symbol lookahead is not sufficient to determine which clause of the definition applies.
The notes define a Scheme function parse that uses recursive descent to parse the input stream into a program expression E where the input stream is represented by a procedure stream of no arguments that returns the next token and advances the cursor in the token stream.
We can now summarize our view of syntax: