datatype
Types are the thought-police of the programming language world.
In the first part of this course, we discussed how to model the traditional class of language facilities with interpreters. In the second part, we are studying how to understand typed languages. Remember that types impose syntactic restrictions on programmers to prevent simple mistakes. Typically these mistakes re-inforce the abstraction boundaries a programmer thinks about as he programs, e.g., that some variable ranges over integer values and another over integer functions. In practice, we will restrict programmers by making certain program phrases illegal, and not accepting such phrases.
Typed languages require two extensions:
Consider the Scheme procedure assq
, which takes two
inputs, x
of type sym
and list
of type (sym value) list
. It returns a value
(x,v)
if x
is associated with v
in l
, or #f
otherwise. We cannot write
assq
in TLC; the essence of the problem can be described
in the following expression:
(if P #t 0)where
P
is some expression whose type is
bool
. This expression cannot be typed in TLC because
both branches of a conditional expression in TLC have to have the same
type. Likewise, we cannot use lists with assq
unless all
the elements have the same type. Both restrictions are severe
considering the flexibility of Scheme's assq
.
We could avoid these problems if we added a new kind of type constructor to our type language:
t :== ... | (t + t)which is known as a union type. A type
(t1 +
t2)
indicates that the corresponding expressions have either
type t1
or type t2
.
Now consider the following program, which sums up the leaves of a binary tree:
(define Sigma (lambda (tea : tree) (if (is-a-leaf? tea) tea (+ (Sigma (left tea)) (Sigma (right tea))))))Can this program be typed in TLC?
Clearly, tea
is of type tree
; but we want to
view it as a number or as a combination of two trees. In Scheme, we
would write a data description that looked like this:
A Binary Tree is either a number, or a pair of Binary Trees.But with the current type system, we cannot write down such a description for arbitrarily large binary trees; we can only construct trees of fixed, finite depth.
What we really need is the ability to make a component of a type refer to a part of itself, ie, a recursive type. In C, we would solve this problem by using pointers, but this solution is extremely low-level. Pointers are to data structures what labels and goto's are for program structures: they are all low-level methods that make it easy for the programmer to make mistakes. Indeed, pointers are typically worse than goto's: few languages provide ways to jump to computed labels, but many languages, especially C, provide ways to dereference computed pointers.
A datatype
declaration has the following format:
(datatype tid ((constr-1 type-11 type-12 ...) ... (constr-n type-n1 type-n2 ...)) exp)This declaration creates a new type whose name is
tid
.
The type can have several forms, called variants, which are
described by the entries in the declaration. Each variant has a
constructor, whose name is the first field of that variant's
entry. Each n
-place constructor is followed in the entry
by n
types, which specify the types of the arguments
allowed for the constructor. The datatype
declaration
provides a means for declaring union types; most crucially, the scope
of tid
is the entire datatype
declaration,
so recursive types can also be created in this manner.
For example, consider the declaration of the type of binary trees, of the sort we summed above:
(datatype BT ((leaf int) (node BT BT)) exp)However, if we try writing code that uses this type declaration, we soon find that we are missing some key items, such as a means for telling the different variants apart (predicates) and means for getting at the individual fields in a component (selectors). Hence, we extend the syntax of a
datatype
declaration to include these:
(datatype tid ((constr-1 pred-1 (type-11 sel-11) (type-12 sel-12) ...) ... (constr-n pred-n (type-n1 sel-n1) (type-n2 sel-n2) ...)) exp)
datatype
combines union types with recursive types.
After all, the latter only make sense with the former, and the former
can easily be introduced with datatype
. For example, one
might write the return type for assq
as
(datatype assoc-type ((SOME int) (NONE)) ...)
datatype
declaration, and then be explicitly typed in
terms of that declaration. What would this datatype
look
like?
datatype
We have thusfar appealed to intuition to understand the typing
properties of datatype
. This intuitive understanding is
not sufficient to answer questions such as, ``How many types are
created if we place a datatype
statement inside a loop?''
Hence, we shall construct a typing rule for datatype
.
Clearly, the type of the (datatype ...)
expression should
be that of the result expression contained therein. So we can write
... tenv' |- exp : t ------------------------------ tenv |- (datatype ... exp) : twhere
tenv'
is possibly an augmented version of
tenv
. However, it would be quite useless if the type
environments tenv
and tenv'
were the same,
since the declarations would then have had no effect at all. In fact,
the declarations augment the type environment in the expected manner.
Each of the constructors, selectors and predicates is typed --
constr-1 : type-11 ... type-1m -> tid pred-1 : tid -> bool sel-11 : tid -> type-11 ...-- and these types are then incorporated into
tenv'
,
shadowing type bindings in tenv
as appropriate:
tenv + [constr-1 : type-11 ... type-1m -> tid, pred-1 : tid -> bool, sel-11 : tid -> type-11, constr-2 : type-21 ... type-2k -> tid, ... ] |- exp : t ---------------------------------------------- tenv |- (datatype ... exp) : t
This seems sufficient: we augment the type environment with the types of the procedures created by the datatype declaration, and then, using their types, type the contained expression, returning its type as that of the overall expression. What is wrong with this judgement?
One reasonable question is, ``What if the value returned were of type
tid
?'' It would not be a very meaningful value, since
tid
's extent is the lexical (datatype ...)
body; not only would there be no way of inspecting or using the
returned value, it would not even have a meaningful type. Thus, we
should prevent against this possibility. (Incidentally, the
naïve approach is the one taken by ML.)
We therefore impose the restriction that tid
cannot be
free in the type of the expression portion of the
datatype
declaration. However, this is unfortunate,
since there are, in fact, situations where it is meaningful to return
an object of type tid
along with some procedures that
operate on such an object. This cannot be done with our typing
restrictions. The types in such a program would, per force, have to
be anticipated, and the declaration would have to be moved into a
sufficiently global scope such that the creation and all uses of
objects of that type are encompassed.
We are still not done! (This example should illustrate the extremely subtle problem of designing type systems.) Here is a ``poem'' that illustrates one remaining problem:
(datatype A ((cons-1 (int sel-1))) (datatype _ ((cons-2 (bool sel-2))) (if (sel-2 (cons-1 5)) ... ...)))Depending on what
_
is, we may or may not notice the
problem. We do not if it were, say, B
; but we definitely
do if it is A
.