Programming
Assignment 1: |
|
|
|
Preparation
Please create a comp411 directory in your home directory on both your development machine and on CLEAR (assuming that you run the turnin411 script as recommended below). Each programming team should submit their solution under the userid of one team member. For this assignment, please create a programs/1 directory within the directory comp411. All of your files for this assignment should be stored in the programs/1 subdirectory.
If you are using a Windows, Mac OS X, or Linux machine, we suggest that you download and install the latest version of the Amazon Corretto 8 distribution of the Java 8 OpenJDK. We also recommend installing the latest version of DrJava in ".jar" from http://www.cs.rice.edu/~javaplt/drjavarice. If you already have a Java 8 OpenJDK installed in Linux or Mac OS X, you can use that disrtribution of the OpenJDK as well. DrJava is not as comprehensive as professional IDEs so you are welcome to use a professional IDE like IntelliJ or Eclipse if you prefer. Beware of the fact that the assignments you submit must run from the command line and cannot use named packages.
Our scripts can handle your embedding the directory containing your program source (.java) files and README file someplace in your programs/1 file tree. For example, your .java source files may be stored in folder with path programs/1/src. DrJava assumes the same file organization as java command line tools (like java and javac); professional IDEs typically do not. We do not recommend using later versions (9-13) of Java which completely re-organize the JDK; we will grade all assignments using Java 8. The DrJava IDE does not run on Java 9+. Moreover, Java 8 is the version of Java installed on CLEAR. Nothing in this course specifically depends on Java 8. A Java 7 JDK will suffice, but the performance of Java 7 is not quite as good as Java 8. The DrJava interactions pane does not yet support the language extensions added in Java 8. But you can still use these extensions in any source files that you compile because the Java 8 compiler translates all Java 8 source code to standard Java byte code.
We will maintain recent versions of the Amazon Corretto Java 8 OpenJDK distribution of Java and DrJava on CLEAR, but CLEAR is under the control of the Rice OIT, which limits our ability to configure what software is available. We recommend using CLEAR only to run our turnin411 script. You can probably do everything else in the course more easily using your own computer. On CLEAR, we are providing a turnin411 script, which compiles your program and superficially tests it using the same scripts that we will use to grade it. It also checks the format of your README file which is how we determine the identity of the other team member when a project is done jointly by two students (since we only ask for one submission per team). The Canvas submission facilities will not permit us to run a script to check that your program compiles and passes some trivial unit tests (essentially Assign1Test.java). As a result, we are still providing our own turnin411 script, which will perform the specified checks even though it no longer uploads your program. You must compress the contents of the file tree rooted at programs/1 (as described below) into a .zip file that you upload to Canvas.
See below for detailed instructions on how to use turnin411. On CLEAR, you may need to use the path ~comp411/bin/turnin411 to access the turnin411 command.
DrJava is distributed as a Java .jar file at the web site http://drjava.org as well our local site but the local site typically has more recent builds. You can run DrJava on a Linux or Mac system (with Java JDK 7/8 installed) by typing
java -jar drjava.jar
assuming that drjava.jar is located in the current directory. On Windows machines (with a properly installed Java JDK 7/8), you can run the drjava.jar file simply by double-clicking on the icon. If the Registry has the wrong file type associated the .jar files, then double-clicking .jar icons will not work. In this case, we recommend running DrJava insider a Windows Command Prompt or the Power shell using he same command as shown above for Linux.
Documentation for DrJava is available at drjava.org. If you have further questions about DrJava, ask your question on Piazza.
All of your files for this assignment should be stored in the programs/1 subdirectory.
Teamwork You are encouraged to do this assignment and all other programming assignments using pair programming. When you submit your assignment, indicate the composition of your team (names, student ids, and email addresses) in your README file as stipulated in detail below.
If you cannot find a partner and would like one, post a message to our Piazza page and we will try to find one for you. Teams of more than two members are not permitted.
Testing You are expected to write unit tests for all of your non-trivial methods or functions and submit your test code as part of your program. DrJava provides integrated support for JUnit which makes this task easy.
Task Your task is to write a parser in Generic Java (Java 7/8) that recognizes expressions in the pedagogic functional language Jam, which will be defined subsequently. The course staff has provided a lexer in Java for this assignment. This lexer returns a larger set of symbols than what appears in the Jam definition. These extra symbols anticipate future extensions to Jam. At this point, your parser should simply reject any program containing one of these extra symbols as ill-formed, just like it would for any other syntax error.
Java Support Library In our Java library, the parser's input token stream is provided by a Lexer object supporting the parameterless method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token interface. The Lexer class supports two constructors: a zero-ary constructor that converts standard input to a Lexer and a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file lexer.java contains the definition of the Lexer class and the collection of classes representing tokens (the Java interface Token) and abstract syntax trees (the java interface AST).
All classes must be in the default (top-level) package. Using named packages will break our compilation and grading scripts, causing a catastrophic loss of points on the assignment.
Your parser should be expressed as a class Parser with constructors Parser(Reader stream) and Parser(String filename) which create parsers reading the specified stream or file. The Reader form of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods.
The Parser class should contain a public method AST parse() that returns the abstract syntax tree for the Jam expression in the input stream. Our grading program will print the output using the method System.out.println, which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file lexer.java. This library files are available on the web in the directory ../../Assignments/1/java.
If your parser encounters an erroneous input (including illegal
tokens), simply print an explanation of the syntax error and abort
parsing the file by throwing a ParseException. Recovering
from a syntax error to continue parsing is a complex problem with many
special cases. It is not a realistic goal for this assignment.
Input Language Specification
The following paragraphs define the parser's input language and the subset that your parser should recognize as legal Jam programs.
The Meta Language The parser's input language is described using Extended Backus-Naur Form (EBNF). When reading the notation, keep the following in mind:
The Input Language The parser's input language is Input where:
Input ::= Token* Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | Int
The lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.
Adjacent tokens must be separated by whitespace (a non-empty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.
Lower ::= a | b | c | d | ... | z Upper ::= 'A' | 'B' | 'C' | 'D' | ... | 'Z' Other ::= ? | _ Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Alpha ::= Upper | Lower AlphaOther ::= Alpha | Other AlphaOtherNumeric ::= AlphaOther | Digit Delimiter ::= ( | ) | [ | ] | , | ; Operator ::= '+' | - | ~ | '*' | / | = | != | < | > | <= | >= | & | '|' | :=
Note: The terminal symbols 'A', 'B', ... are enclosed in single quotation marks because A, B, ..., would automatically be the names of non-terminal symbols in our notation system. For essentially the same reason, we enclose the symbols '+', '*', and '|' in single quotation marks; these symbols without quotation marks are metasymbols in the notation scheme described above.
Jam The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions Exp defined by the following grammar:
Expressions
Exp ::= Term { Binop Exp } | if Exp then Exp else Exp | let Def+ in Exp | map IdList to Exp Term ::= Unop Term | Factor { ( ExpList ) } | Empty | Int | Bool Factor ::= ( Exp ) | Prim | Id ExpList ::= { PropExpList } PropExpList ::= Exp { , Exp }* IdList ::= { PropIdList } PropIdList ::= Id { , Id}*
Definitions
Def ::= Id := Exp ;
Primitive Constants, Operators, and Operations
Empty ::= empty Bool ::= true | false Unop ::= Sign | ~ Sign ::= '+' | - Binop ::= Sign | '*' | / | = | != | < | > | <= | >= | & | '|' Prim ::= number? | function? | list? | empty? | cons? | cons | first | rest | arity
Identifiers
Id ::= AlphaOther {AlphaOther | Digit}*
Please note that Id does not contain the anything that matches Prim or the keywords if, then, else, map, to, let, in, empty, true, and false.
Numbers
Int ::= Digit+
The preceding grammar requires one symbol lookahead in a few situations. The Java lexer in our support code includes a method Token peek() that behaves exactly like the method Token readToken() except for the fact that it leaves the the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream).
The Java file lexer.java defines all of the abstract syntax classes required to represent Jam programs using the composite design pattern. There is one constructor for each fundamentally different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes.
For example, suppose the Jam program under consideration is
f(x) + (x * 12))
The abstract syntax representation for this program phrase is:
new BinOpApp( new Op("+"), new App(new Variable("f"), new AST[] { new Variable("x") }), new BinOpApp(new Op("*"), new Variable("x"), new IntConstant(12)) )
This construction does not tell the whole story. The lexer is guaranteed to always return the same object for a given operator. Similarly, there is only copy of each variable encountered by the lexer. Hence, there is only one copy of the Operator "+", one copy of the Operator "*", one copy of the Variable "x", etc. As a result, the == operator in Java can be used to compare operators and variables. Note that the Operators "+" and "-" can be used either as unary or binary operators.
The preceding grammar for Jam has an annoying flaw. To make the language easy to parse using recursive descent (top-down) parsing, we used right recursion in the definition of arithmetic expressions (Exp). (Left recursion forces infinite looping.) But right recursion yields right-associative ASTs (where computation must be performed right to left), while the usual mathematical convention in arithmetic expressions is that binary operators are left-associative (where computation must be performed left to right). There is a widely-used trick in writing grammars for top-down parsing using syntax diagrams that avoids this problem. The trick is replace right recursion in a simple production (only one alternative on the right hand side) by iteration: an edge looping back to the input edge of the diagram. We used this trick in generating the syntax diagrams for ExpList and IdList for our original grammar. Given our original Jam grammar, we can convert the production
Exp ::= Term { Binop Exp } | ...to
Exp ::= BinaryExp | ... BinaryExp ::= Term { Binop BinaryExp }at the cost of excluding chains of binary operations terminating in an expression other than a term. Then this revised grammar can be represented using syntax diagrams that define BinaryExp iteratively (just like PropExpList in the original grammar).
The preceding definition of BinaryExp generates the same set of strings as the extended rule:
BinaryExp ::= Term { BinOp Term }*which is ambiguous about left or right associativity. When iteration is used to recognize a BinaryExp, the parsing process will construct a left-associative tree since the tree is constructed left-to-right (each new Binop adds a new root to the AST).
The file SyntaxDiagrams contains iterative syntax diagrams corresponding to the left associative version of the language, which precisely describes the syntax of Jam and how we want the parser to behave. Of course, this language eliminates a few quirky inputs from the language based on our original grammmar. Your parser should treat these quirky inputs as erroneous.
It is possible to write syntax diagrams that treat binary expressions iteratively and still allow arbitrary expressions at the end of the binary expression. But the corresponding grammar and syntax diagrams are much more complex. So we have not followed this design choice in creating the revised syntax for Jam. In fact, we view our Jam grammar as a case study showing that context-free grammars are not quite the right formalism for expressing easily parsed program syntax. Syntax diagrams where iteration is always left-associative (equivalent to extended context-free-grammars where iteration always expands into left-associative parse trees) are much better. I am not aware of a rigorous exposition of such extended BNF grammars in the literature; I suspect that most theoretical computer scientists would not find such an exposition very interesting even though the standard notion of parsing language specified by a context-free grammar is an imperfect formalization of practical parsing.
Write your parser to produce left-associative ASTs for arithmetic expressions. If your parser implements either of the "revised" syntax diagram definitions given above (which are equivalent) using while loops for iteration and constructs the corresponding abstract syntax trees in a functional style (no mutation of trees in theconstruction process), your parser will naturally build syntax trees that are left-associative.
The directory tests/ contains a sample input programs.
Create a README file in the your directory program/1 that
We have provided a sample README for your reference. Make sure that your README file is structured correctly:
mgricken dmp This is where the readme text starts. |
We do not recommend using separate test data files, but if you do, your test data files must be stored in the programs/1 directory. All JUnit test classes submitted for grading should be compatible with a generic solution that conforms to the public interface defined in the assignment. In other words, we should be able to compile and run your unit tests against a reference solution used for grading, which has the same public interface. Your test classes should be defined in files matching the pattern "*Assign#Test.java", where "#" is the the current assignment number. For example, "MyAssign2Test.java" would be a valid name for Assignment 2 unit tests. If you want to include other tests that you do not want graded (e.g., to test your private interfaces), then simply use a different suffix that "Test" for the test class names (e.g., "Assign1Test#.java"). If you have additional utility classes that are required dependences of your unit test classes, please name those classes using the pattern "*Assign#TestSupport.java", which signals that the class is required to support your unit tests, but is not a unit test class that should be independently launched. You may also include test support files with names matching the pattern "*AssignAllTestSupport.java", which can contain common test support code that can be reused across multiple assignment subissions without needing to rename the file.
Each procedure or method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, give grammar rules describing that form.
To submit your program, make sure that everything that you want to submit is located in the directory programs/1 and type the command
~comp411/bin/turnin411 1
from within the comp411/programs directory.
The command will inform you whether or not your submission succeeded. Only submit one copy of your program per team. If you need to resubmit an improved version your program, submit it from the same account as the original so that the old version is replaced.
The toString() methods for each AST class effectively provide an unparser for ASts. You can directly compare test input strings and the output AST by unparsing the AST (using toString()). The two strings should match up to differences in whitespace, new lines, and parentheses using for grouping.