Programming Assignment 1:
Parsing and Abstract Syntax

 

 

 

Assigned: Wednesday, September 5, 2012
Due: 12:00 noon, Monday, September 17, 2012
Points: 100

Files

Overview

Preparation   We suggest that you immediately download and install the latest version of the Java 6 JDK and the latest version of DrScala or DrJava on your preferred computing platform. On Windows and Linux, the best Java 6 JDK is provided by Sun/Oracle. On Macs, the Java 6 JDK is part of the supported system.

The latest version of the Sun Java 6 SE JDK systems (for platforms other than Macs) can be obtained at http://java.sun.com/javase/downloads/index.jsp. We recommend downloading the "SE" JDK, not the JDK with NetBeans (which greatly increases the size of the download). The latest version of DrScala can be obtained from The DrJavaRice downloads page. We will try to maintain the latest versions of the Java 6 JDK and DrJava on CLEAR, but CLEAR is under the control of Rice IT, which limits our ability to configure what software is available. You will need to use CLEAR to submit your programs, but we otherwise recommend using DrJava/DrScala on your personal computer. On CLEAR, we are providing a turnin411 script, which submits your program to our repository. Unfortunately, the IT submission facilities will not permit us to run a script testing that submitted programs pass the trivial unit tests in Assign1Test.java (including in our supporting library files.

On CLEAR, you may need to use the path ~comp411/bin/turnin411 to access the turnin command.

Please create a comp411 directory in your home directory on both your development machine and on CLEAR. Each programming team should submit their solution under the userid one team member. For this assignment, please create a programs/1 directory within the directory comp411. All of your files for this assignment should be stored in the programs/1 subdirectory.

Documentation for DrJava is available at drjava.org. The DrScala user interface is derived from DrJava, so the DrJava documentation should provide reasonable guidance on how to use DrScala. At this point, DrScala does not include language levels or a source level debugger. If you have further questions about DrScala or DrJava, send email to javaplt@rice.edu

All of your files for this assignment should be stored in the programs/1 subdirectory.

Teamwork   You are strongly encouraged to do this assignment and all other programming assignments using pair programming. When you submit your assignment, indicate the composition of your team (names, student ids, and email addresses) in your README file.

If you cannot find a partner and would like one, send an email to comp411@rice.edu and we will try to find one for you. Teams of more than two members are not permitted.

Testing   You are expected to write unit tests for all of your non-trivial methods or functions. DrScala and DrJava provide integrated support for JUnit which makes this task easy.

Task   Your task is to write a parser in Scala or generic Java (Java 6) that recognizes expressions in the pedagogic functionallanguage Jam, which will be defined subsequently. The course staff has provided lexers in Scala and Java for this assignment. Some lines in this lexer have been "commented out" because they support symbols that are not part of the Jam definition for this assignment. These extra symbols correspond to extensions to Jam made in subsequent assignments.

Java Support Library   In our Scala and Java libraries, the parser's input token stream is provided by a Lexer object supporting the parameterless method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token trait/interface. The Lexer class supports two constructors: a zero-ary constructor that converts standard input to a Lexer and a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file lexer.scala (lexer.java) contains the definition of the Lexer class and the collection of classes representing tokens (the Scala trait (Java interface) Token) and abstract syntax trees (the scala trait (java interface) AST).

All classes should be in the default (top-level) package.

Your parser should be expressed as a class Parser with constructors Parser(Reader stream) and Parser(String filename) which create parsers reading the specified stream or file. The Reader form of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods.

The Parser class should contain a method AST parse() (explicitly public in Java) that returns the abstract syntax tree for the Jam expression in the input stream. Our grading program will print the output using the method println (System.out.println in Java), which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file lexer.scala (lexer.java). This library files are available on the web in the directory http://www.cs.rice.edu/~javaplt/411/Assignments/1/scala. (http://www.cs.rice.edu/~javaplt/411/Assignments/1/java.)

If your parser encounters an erroneous input, simply print an explanation of the syntax error and abort parsing the file by throwing a ParseException. Recovering from a syntax error to continue parsing is a complex problem with many special cases. It is not a realistic goal for this assignment.

Input Language Specification

The following paragraphs define the parser's input language and the subset that your parser should recognize as legal Jam programs.

The Meta Language The parser's input language is described using Backus-Naur Form (BNF). When reading the notation, keep the following in mind:

The Input Language   The parser's input language is Input where:

Input ::= Token*
Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | Int

The lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.

Adjacent tokens must be separated by whitespace (a non-empty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.

Lower             ::= a | b | c | d | ... | z
Upper             ::= "A" | "B" | "C" | "D" | ... | "Z"
Other             ::= ? | _
Digit             ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Alpha             ::= Upper | Lower
AlphaOther        ::= Alpha | Other
AlphaOtherNumeric ::= AlphaOther | Digit
Delimiter         ::= ( | ) | [ | ] | , | ;
Operator          ::= "+" | - | ~ | "*" | / | = | != | < | > | <= | >= | & | "|" | :=

Jam   The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions Exp defined by the following grammar:

Expressions

Exp         ::= Term { Binop Exp }
              | if Exp then Exp else Exp
              | let Def+ in Exp
              | map IdList to Exp
Term        ::= Unop Term
              | Factor { ( ExpList ) }
              | Empty
              | Int
              | Bool
Factor      ::= ( Exp ) | Prim | Id
ExpList     ::= { PropExpList }
PropExpList ::= Exp | Exp , PropExpList
IdList      ::= { PropIdList }
PropIdList  ::= Id | Id , PropIdList

Definitions

Def ::= Id := Exp ;

Primitive Constants, Operators, and Operations

Empty  ::= empty
Bool  ::= true | false
Unop  ::= Sign | ~
Sign  ::= "+" | -
Binop ::= Sign | "*" | / | = | != | < | > | <= | >= | & | "|"
Prim  ::= number? | function? | list? | empty? | cons? | cons | first | rest | arity

Identifiers

Id ::= AlphaOther {AlphaOther | Digit}*

Please note that Id does not contain anything that matches Prim or the keywords if, then, else, map, to, let, in, empty, true, and false.

Numbers

Int ::= Digit+

The preceding grammar requires one symbol lookahead in a few situations. The Scala (Java) lexer in our support code includes a method Token peek() that behaves exactly like the method Token readToken() except for the fact that it leaves the the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream).

The Scala file lexer.scala (Java file lexer.java) defines all of the abstract syntax classes required to represent Jam programs using the composite design pattern. There is one constructor for each fundamentally different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes.

For example, suppose the Jam program under consideration is

f(x) + (x * 12))

The abstract syntax representation for this program phrase is:

new BinOpApp(
       new Op("+"),
       new App(new Variable("f"), new AST[] { new Variable("x") }),
       new BinOpApp(new Op("*"), new Variable("x"), new IntConstant(12))
     )

This construction does not tell the whole story. The lexer is guaranteed to always return the same object for a given operator. Similarly, there is only copy of each variable encountered by the lexer. Hence, there is only one copy of the Operator "+", one copy of the Operator "*", one copy of the Variable "x", etc. Note that the Operators "+" and "-" can be used either as unary or binary operators.

Testing and Submitting Your Program

The directory www.cs.rice.edu/~javaplt/411/Assignments/1/tests contains a sample input programs.

Create a README file in the your directory program/1 that

Make sure that your README file is structured correctly:

  1. It has to be named README (all upper case, no .txt or anything)
  2. It has to be in the assignment's "root directory", e.g. ~/comp411/programs/1
  3. The first two lines should only contain your usernames and nothing else. The third line should be blank (if you're working alone, leave lines 2 and 3 blank), e.g.:
    mgricken
    dmp

    This is where the readme text starts.

Your test data files should be stored in the programs/1 directory.

Each procedure or method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, give grammar rules describing that form.

To submit your program, make sure that everything that you want to submit is located in the directory programs/1 and type the command

turnin411 1

from within the comp411/programs directory.

The command will inform you whether or not your submission succeeded. Only submit one copy of your program per team. If you need to resubmit an improved version your program, submit it from the same account as the original so that the old version is replaced.

Note if turnin411 is not on your path, you will need to add /clear/courses/comp411/bin to your path:

export PATH=$PATH:/clear/courses/comp411/bin

Implementation Hints

Use an "unparser" to print a concrete representation for an abstract syntax tree. Then you can directly compare test input strings and output strings (up to differences in whitespace and parentheses using for grouping).

Extra Credit (5 points)

The preceding grammar for Jam has an annoying flaw. To make the language easy to parse using recursive descent (top-down) parsing, we used right recursion in the definition of arithmetic expressions. (Left recursion forces infinite looping.) But right recursion yields right-associative ASTs (where computation must be performed right to left), while the usual mathematical convention for arithmetic expression is for binary operators to be left-associative (where computation must be performed left to right). There is a widely used trick in writing grammars for top-down parsing that avoids this problem. The trick is replace right recursion by iteration. In our Jam grammar, the right recursive rule (production):

Exp         ::= Term { Binop Exp }
can be written using iteration instead of recursion as follows:
Exp         ::= Term { Binop Term }*
Then the code for recogizing program text conforming to this rule can use a help function with an accumulator or an explicit loop to build a left associative AST for such an Exp.

For extra credit, revise your parser to produce left-associative ASTs for arithmetic expressions.

Back to course website