Interpreting

This section contains a brief introduction to how an interpreter works.

Programs written in Pascal or C++ are normally compiled before they are executed. The process of compilation consists of translating the entire source program into machine language (object code) and placing that in a separate file which can then be executed. In principle, the compilation process need only occur one time, and the program can be executed many times after that.

One alternative to compilation is to ``interpret'' the high level language. An interpreter is similar to a compiler in one sense - it translates the source program into machine language. It differs in that the translation occurs one line at a time, and the resulting machine language instructions are executed as they are generated. No object file is generated, so the translation must occur each time you execute the program.

The interpreter that makes up this take-home problem doesn't actually translate BBB programs into machine language. Your interpreter is a Pascal or C++ program that inputs the BBB code and simulates what it would do if it were executed on a PC. If the BBB code calls for input or output, your interpreter will cause the same thing to occur. It will also carry out calculations, loops, subroutine calls, etc.

There are three main parts to an interpreter. A scanner, a parser, and the execution unit. The source code of a program is just plain text. This is a very difficult form to work with, so the scanner processes the text and turns it into a form that's easier to deal with. The parser takes the output of the scanner and makes sure that it conforms to the proper syntax of the language. At the same time it analyzes the structure of the program and turns it into a form that the execution unit can understand. The execution unit takes the output of the parser and uses it to actually execute the code. In practice, the parser and execution unit are usually combined into one piece, and the code is executed as it is parsed.

The remainder of this section will deal with how each of these steps is accomplished.

Scanning

The goal of the scanner is to process the source code text and produce from it a stream of tokens. A token is a code that is used to identify the basic pieces of the source code. A token might represent a punctuation mark, a variable name, a number, etc. It does this by examining the source text one character at a time and identifying the bits that are there.

Here is the basic algorithm:

  1. Skip any whitespace. The first non-whitespace character encountered determines what kind of token is to be extracted.
  2. Get characters until you get one that does not belong in the token. When this happens, you are done extracting the token.
  3. Process the token you have just extracted.
  4. Repeat these steps to extract the next token.

Parsing

The goal of the parser is to determine if the program is syntactically correct. In the process of doing this, the parser also analyzes the structure of the program in a way that makes it possible to understand how the program should behave. In the case of our interpreter, the program will actually be executed as it is being parsed.

BNF

Since the main idea behind a parser is to check the syntax of a program, we need some way of describing the syntax in a concrete way. We do this using Backus-Nauer Form (BNF). BNF is a kind of special language that is used to describe other languages. It consists of a number of rules, which show the relationship between different symbols. The symbols may be terminal symbols or nonterminal symbols. A terminal symbol is a token. A nonterminal symbol is a symbol that is defined by a BNF rule.

Let's look at a bit of BNF, then I'll describe how to read it:

 primary_exp ::=  Tfor var = expression to

condition ::= relation tex2html_wrap_inline966

relation and condition tex2html_wrap_inline966

relation or condition

rel_op ::= = tex2html_wrap_inline966 <> tex2html_wrap_inline966 < tex2html_wrap_inline966 > tex2html_wrap_inline966 <= tex2html_wrap_inline966 >=

relation ::= expression rel_op expression tex2html_wrap_inline966

not condition

expression ::= term tex2html_wrap_inline966

term + expression tex2html_wrap_inline966

term - expression

The ::= symbol separates the left-hand side of a rule (a nonterminal) from its right-hand side (the definition). The vertical bar separates alternate forms of the definition. Typically, the different forms that a nonterminal can be are listed on separate lines, but this is not a requirement (as the rel_op rule shows). A nonterminal symbol is shown in italics. If you are working in an area where italics are not possible (such as the comments in the parser source code), a nonterminal is often displayed inside angle brackets (i.e., <expression>). Terminal symbols are shown in plain text.

If we were to read the rule for expressions, we would say that an expression can be either a single term, or a term followed by a plus or a minus, followed by another expression. Because this rule is defined recursively, it allows for any number of terms in an expression, separated by pluses and minuses.

Note: there are a few nonterminals, such as number and var, which are actually tokens, so they won't have rules to define them. They are written as nonterminals because they could have many different values.

Recursive-descent parsing

  Going from a BNF description of a language to a parser is fairly easy. Basically, you have a procedure for each nonterminal symbol. Inside these procedures, you look at the next token available from the scanner and decide which definition of the rule to use, if there is more than one. You then step through the right-hand side of the rule. If the rule contains a terminal, you check to make sure you got that same token from the scanner, then get the next token from the scanner. If the rule contains a nonterminal, you just call the procedure associated with that nonterminal. This is called a recursive-descent parser.

For example, in the expression rule above, you would first call the term function (since every kind of an expression begins with a term). Then you would check to see if the next token is a plus or a minus. If it is, you would get the next token and make a recursive call to the expression function.

Take a look at the code that we have provided for the parser. Try and see how each procedure follows the BNF for the nonterminal it is representing.

Execution

The execution unit is the real meat of the interpreter. The only purpose of the other parts of the interpreter is to get us to the point where we can actually execute the code.

Symbol table

Most programs make use of variables. This means that we need some way of keeping track of what variables exist, and what values they hold. The data structure we use to do this is called a symbol table. A symbol table stores any information we want to associate with a variable name. In the case of the BBB interpreter, the only information we really care about is the variable's value. The symbol table should provide a way to lookup the current value of a variable, as well as change it's value. You also need some way of adding new variables to the table.

Subroutines

Many programs make use of subroutines. A subroutine can be called from anywhere, so the interpreter needs to keep track of where it was called from so it can return to that point when the subroutine finishes. Since a subroutine can call another subroutine, which might call yet a third subroutine, something more complicated than a simple variable is needed to keep track of all the return locations. The data structure that is used is called a call stack. The call stack is a last-in, first-out list. When you call a subroutine, you add the line number the subroutine was called from to the call stack. When the subroutine finishes, you take the last address off of the call stack, and continue executing from there.

Executing statements

Every statement has some sort of effect on the program. Otherwise, why have the statement? The value of a variable may be changed, there may be something printed on the screen, or the flow of control may be changed. When executing a statement, you need only make sure that the appropriate effect happens. If a variable should be changed, you find its new value and modify its entry in the symbol table. If there should be some output, figure out what it is and print it out. If the flow of control should be changed, figure out which line to go to and continue executing from there rather than the next line.

Evaluating expressions

Evaluating an expression can be a bit trickier than executing a statement. Most of the work in an expression is done by operators. In order to evaluate an operation, you must first evaluate the operand(s), then you can do the operation. To accomplish this, the procedures in the parser pass around records containing the results of evaluating an expression. To use one of the built-in functions, you first evaluate the argument, then apply the function on that argument. The simple expressions just get their value from a constant or a variable lookup.

Type checking

Another thing that must be done during execution is type checking. Before adding two expressions together, you should check to make sure that the two expressions are the same type. It doesn't make any sense to add a string and an integer together. When storing a value into a variable, you should make sure that the value is of the same type as the variable. For a complete description of the type rules in BBB, see section 3.3.



Alyosha Efros
Mon Mar 3 17:43:27 MST 1997