Section 3: The Scope of Variables

 

 

 

The Scope of Variables

Almost every programming language has:

  1. constructs for introducing a new variable x, and
  2. scoping rules for determining which uses of the variable name x in the rest of the program refer to this variable.

If the same name x is used for several different variables, a programmer can resolve which variable is really meant by a particular use of the name x.

For example, in Java a loop header may introduce a new loop variable and its scope:

for (int i = 0; i < NumOfElements; i++) ... ;

The scope of the variable i is the program text following the = all the way to the end of the loop body. If the loop contains a break construct and if the loop is followed by some statement like

System.out.println("the index where we stopped searching is " + i);

then we know from the scoping rules for Java that the program is almost certainly erroneous. Since the use of i in the invocation of println does not refer to the variable i that was introduced as the loop variable, the reference to variable i in the print statement either is illegal or refers to a variable with a scope that encloses the loop and the subsequent print statement.

Note: We analyzed the preceding program fragment with a rough understanding of its execution semantics but with a precise understanding of the scoping rules. These scoping rules are crucial for building a good mental execution model.

Other examples of scoping constructs in Java and C++ include

Exercise: Describe what scope each of the preceding constructs establish in Java.

Free and Bound Scope

For pedagogic purposes, we introduce the language LC modeling a core functional subset of Scheme. It is called LC because the language is essentially the lambda-calculus invented by mathematician Alonzo Church in the 1930's. In fact, LC consists of the (untyped) lambda-calculus augmented by natural numbers (constants) and the binary operator +. It has the following syntax:

M :== x | n | (lambda (x) M) | (M M) | (+ M M)

where x is a variable (any alphanumeric identifier beginning with a letter other lambda) and n is an integer number ...,-2, -1, 0, 1, 2, ... On the right hand side of the our definition, each occurence of M represents an independent choice of an expression in our recursively defined set of strings/trees.

Our definition of the syntax of LC is ambiguous in the sense that we haven't made it clear whether we a defininig a set of strings of a set of trees. In fact, we interpret the definition both ways. Technically, it defines a set of strings, but the strategic inclusion of parentheses (as in Scheme) means that every string has a obvious unique parse. Hence, we can think of each alternative on the right hand of the definition as a distinct tree construction.

Using LC, we can show how to define the notion of free and bound occurrences of program variables rigorously. This specification will serve two purposes:

If we interpret LC as a sub-language of Scheme, it contains only one binding construct: lambda-expressions. In

(lambda (a-variable) an-expression)

a-variable is introduced as a new, unique variable whose scope is the body an-expression of the lambda-expression (with the exception of possible "holes", which we describe in a moment). We have included parentheses around the parameter of a lambda expression (a-variable in the example above) to make LC syntax consistent with Scheme syntax.

One way to determine the scope of variable binding constructs is to define when some variable occurs free in some expression. By specifying where a variable occurs free and how this relation is affected by the various constructs, we also define the scope of variable definitions. Here is the definition for LC.

Definition (Free occurrence of a variable in LC):

Let x,y range over the elements of Var.
Let M, N range over the elements of Exp.

Then x occurs free in:

The relation x occurs free in y is the least relation on LC expressions satisfying the preceding constraints.

If x occurs free in M, then some specific occurrence of x in M can be identified as a "free occurrence". In fact, we could adapt the preceding definition to define the related notion "occurrence k of x is free in M", but the details are tedious because we must label each occurrence of x with a unique index k, or alternatively, we must uniquely identify each occurrence by a path from the root in the abstract syntax tree for M.

An occurrence of variable x is bound in M iff it is not free in M. The most blatant form of bound occurrence is the second component of a lambda-expression. This form of variable occurrence is called the binding occurrence for the variable. Each distinct variable obviously has one binding occurrence; if x occurs twice as the second component of a lambda-expression, it is the name of two distinct variables.

Note that the definition of the "occurs free" and "free occurence" relations are inductive over expressions precisely because the definition of Exp is itself inductive.

Exercise: Formulate the notion of bound occurrence explicitly as an inductively defined function in Scheme or method in Java.

The definitions of free and bound implicitly determine the notion of a scope in LC expressions. Clearly, a lambda expression opens a new scope; or, to put it differently, the scope of the binding occurrence x in

(lambda (x) M)

is that textual region of M where x might occur free.

Exercise: Devise an expression in which x occurs both free and bound..

Static Distance Representation

Given a variable occurrence it is natural for a programmer to ask where the binding occurrence is. This may tell him whether or not some nearby variable occurrence is related, or it may help him to determine something about the value that it stands for. Consider the expression

(lambda (z) (lambda (x) ((lambda (x) (z (z (z x)))) x)))

The two occurrences of x in the fragment

                                          ... x)))) x) 

are unrelated (the first corresponds to the third lambda, and the second corresponds to the second lambda), but only the binding occurrences for the two can tell them apart.

For a human being, a representation that includes arrows from bound occurrences to binding occurrences is clearly preferable. We can approximate such graphical representations of programs by replacing variable occurrences with numbers that indicate how far away in the surrounding context the binding construct is. The above expression would translate into

(lambda (z) (lambda (x) ((lambda (x) (3 (3 (3 1)))) 1)))

Note that the index "1" refers to a different lambda in each case.

Indeed, since the parameters in lambda's are now superfluous, we can omit them completely:

(lambda (lambda ((lambda (3 (3 (3 1)))) 1)))

This representation is often called either the static distance representation of the term or the "deBruijn (de BRAWN) notation" for the term. deBruijn is a Dutch mathematician who recognized the theoretical advantages of the notation in the context of idealized programming language called the lambda-calculus, which was invented by Alonzo Church in the 1930's. LC is a slight extension of the lambda-calculus. Although the static distance representation is not particularly helpful for people, it is valuable for compilers and interpreters.

We could specify the process that replaces variable occurrences with static distances in English along the lines of the above definitions. Instead, we write a program that performs the translation.

First, recall the abstract representation of the set of LC expressions is given by the equation:

E ::= Num | Var | (lambda Var E) | (E E) 
The corresponding abstract representation for the set of LC expression in deBruijn (static distance) form is
E ::= Num | VarRef | (lambda E) | (E E) 

where VarRef is a "tagged" integer indicating a variable reference rather than an integer constant.

In Scheme, we can implement the translation from conventional abstract syntax to static distance abstract syntax as follows:

;; conventional abstract syntax
(define-struct var (symbol))
(define-struct const (number))
(define-struct proc (var body))
(define-struct app (rator rand))
;; Expr ::= (make-var S) | (make-const N) | (make-proc Var Expr)) | (make-app Expr Expr)
;; where S is the set of symbols and N is the set of numbers

;; constructor for static distance (sd) abstract syntax
(define-struct sdvar (index))
(define-struct sdproc (body))
;; SD-Expr ::= (make-var I) | (make-const N) | (make-sdproc SD-Expr) | (make-app SD-Expr SD-Expr)
;; where I is the set of positive numbers and N is the set of numbers

(define sd
  (lambda (an-ar binding-vars)
    (cond
      ((var? an-ar) (make-sdvar (sdlookup (var-name an-ar) binding-vars)))
      ((const? an-ar) an-ar)
      ((proc? an-ar) 
       (make-sdproc 
	(sd (proc-body an-ar) (cons (proc-var an-ar) binding-vars))))
      (else
       (make-app (sd (app-rator an-ar) binding-vars) 
		 (sd (app-rand an-ar) binding-vars))))))

where

(define sdlookup
  (lambda (a-var vars)
    (cond
      ((null? vars) (error 'sdlookup "free occurrence of ~s" a-var))
      (else (if (eq? (car vars) a-var)
		1
		(add1 (sdlookup a-var (cdr vars))))))))

Variables are replaced by their static distance coordinate, which is determined by looking up how deep in the list of binding vars it occurs. If the list does not contain the variable, we signal an error. Constants do not need to be translated. Applications are traversed and re-constructed.

We can test sd by applying it to the result of parsing some surface syntax and the empty list.

Exercise:What should the output of this be?

(sd (parse '(lambda (z) (lambda (x) ((lambda (x) (z (z (z x)))) x)))) null)

(The empty list of variables indicates that we consider this to be the complete program with no further context.)


Supplementary Material on OCaml

The same code can be written in O'Caml as follows:

type exp =
    Con of int
  | Var of string
  | Lam of string * exp
  | App of exp * exp

Intutively, the abstract syntax for the de Bruijn representation would be:

D ::= Num | #Num | (lambda D) | (D D)

In OCaml, this can be implemented by the datatype:

type debruijn =
    Cn of int
  | Vr of int
  | Lm of debruijn
  | Ap of debruijn * debruijn

In OCaml, it is easy to implement the translation back and forth between these different representations:

exception Err of string;;

(* Conversion from "exp" to "debruijn" *)

(* An environment mapping string names to indices 
   (implemented as a function) *)

 (* The initial (empty) environment *)

let en0 x = raise (Err "Unknown variable during De Bruijn conversion")

 (* Extending the environment with a new name *)

let debExt env s =
  fun y -> if y=s then 0 else (env y)+1;;

 (* The translation *)

let rec debIt e env =
  match e with
      Con i -> Cn i
    | Var s -> Vr (env s)
    | App (e1,e2) -> Ap (debIt e1 env, debIt e2 env)
    | Lam (s,e) -> Lm (debIt e (debExt env s))

open Printf;;

let i = ref 0;;
let gensym x = 
 let s = x^(sprintf "%i" (!i)) in
  i:=(!i)+1; s;;

let deb e = debIt e en0;;

let expExt env s =
  fun y -> if y=0 then s else (env (y-1));;

let rec expIt e env =
  match e with
    Cn i -> Con i
  | Vr n -> Var (env n)
  | Ap (e1,e2) -> App (expIt e1 env, expIt e2 env)
  | Lm (e) -> let s = gensym "v" in
              Lam (s,expIt e (expExt env s))

let ne0 x = raise (Err "Unknown variable during Expresion conversion")

let exp e = expIt e ne0;;

(* Example term *)

let term = App (Lam ("x", Var"x"), Lam ("y", Var "y"));;

let dterm = deb term  
  (* = Ap (Lm (Vr 0), Lm (Vr 0)) *)

let eterm = exp dterm 
  (* = App (Lam ("v1", Var "v1"), Lam ("v0", Var "v0")) *)