teachask

compiler design

 

Compiler Design

Department of Computer Science Compilers





1         Introduction

 

1.1       Compilation

 

Definition. Compilation is a process that translates a program in one language (the source language) into an equivalent program in another language (the object or target language).

 

Error messages

 

 

 

Source program

 

 

Compiler

 

Target program

 

 

 

 

An important part of any compiler is the detection and reporting of errors; this will be dis- cussed in more detail later in the introduction. Commonly, the source language is a high-level programming language (i.e. a problem-oriented language), and the target language is a ma- chine language or assembly language (i.e. a machine-oriented language). Thus compilation is a fundamental concept in the production of software: it is the link between  the (abstract)  world of application development and the low-level world of application execution on machines.

Types of Translators. An assembler is also a type of translator:

 

 

Assembly program

 

 

Assembler

 

Machine program

 

 

 

 

Source program

 
An interpreter is closely related to a compiler, but takes both source program and input data. The translation and execution phases of the source program are one and the same.

 

 

 

 

 

Interpreter

 

Output data

 

 

 

 

 

Although the above types of translator are the most well-known, we also need knowledge of compilation techniques to deal with the recognition and translation of many other types of languages including:


   Command-line interface languages;

   Typesetting / word processing languages (e.g.TEX);

   Natural languages;

   Hardware description languages;

   Page description languages (e.g. PostScript);

   Set-up or parameter files.

 

Early Development of Compilers.

1940’s. Early stored-program computers were programmed in machine language. Later, as- sembly languages were developed where machine instructions and memory locations were given symbolic forms.

1950’s. Early high-level languages were developed, for example FORTRAN. Although more problem-oriented than assembly languages, the first versions of FORTRAN still had many machine-dependent features. Techniques and processes involved in compilation were not well- understood at this time, and compiler-writing was a huge task: e.g. the first FORTRAN compiler took 18 man years of effort to write.

Chomskys  study  of  the  structure  of  natural  languages  led  to  a  classification  of  languages according to the complexity of their grammars. The context-free languages proved to be useful in describing the syntax of programming languages.

1960s onwards.  The study of the parsing problem for context-free languages during the 1960s and 1970s has led to efficient algorithms for the recognition of context-free languages.  These al- gorithms, and associated software tools, are central to compiler construction today. Similarly, the  theory  of  finite  state  machines  and  regular  expressions  (which  correspond  to  Chomskys regular languages) have proven useful for describing the lexical structure of programming lan- guages.

From Algol 60, high-level languages have become more problem-oriented and machine indepen- dent, with features much removed from the machine languages into which they are compiled. The theory and tools available today make compiler construction a managable task, even for complex languages. For example, your compiler assignment will take only a few weeks (hope- fully) and will only be about 1000 lines of code (although, admittedly, the source language is small).

 

 

1.2       The Context of a Compiler

 

The complete process of compilation is illustrated as:



skeletal source program

source program


assembly program

relocatable m/c code

absolute m/c code

 

 

1.2.1        Preprocessors

 

Preprocessing performs (usually simple) operations on the source file(s) prior to compilation. Typical preprocessing operations include:

 

(a)   Expanding macros (shorthand notations for longer constructs). For example, in C,

#define foo(x,y) (3*x+y*(2+x))

defines a macro foo, that when used in later in the program, is expanded by the preprocessor. For example, a = foo(a,b) becomes

a = (3*a+b*(2+a))

(b)   Inserting named files. For example, in C,

#include "header.h"

is replaced by the contents of the file header.h

 

1.2.2        Linkers

 

A linker combines object code (machine code that has not yet been linked) produced from compiling and assembling many source programs, as well as standard library functions and resources supplied by the operating system. This involves resolving references in each object file to external variables and procedures declared in other files.


1.2.3        Loaders

 

Compilers, assemblers and linkers usually produce code whose memory references are made relative to an undetermined starting location that can be anywhere in memory (relocatable machine code). A loader calculates appropriate absolute addresses for these memory locations and amends the code to use these addresses.

 

1.3       The Phases of a Compiler

 

The process of compilation is split up into six phases, each of which interacts with a symbol table manager and an error handler. This is called the analysis/synthesis model of compilation. There are many variants on this model, but the essential elements are the same.

 


source program

target program

 

 

1.3.1        Lexical Analysis

 

A lexical analyser or scanner is a program that groups sequences of characters into lexemes, and outputs (to the syntax analyser) a sequence of tokens. Here:

 

(a)   Tokens are symbolic names for the entities that make up the text of the program;

e.g. if for the keyword if, and id for any identifier. These make up the output of the lexical analyser.


(b)   A pattern is a rule that specifies when a sequence of characters from the input constitutes a token; e.g the sequence i, f for the token if , and any sequence of alphanumerics starting with a letter for the token id.

(c)   A lexeme is a sequence of characters from the input that match a pattern (and hence constitute an instance of a token); for example if matches the pattern for if , and foo123bar matches the pattern for id.

 

For example, the following code might result in the table given below.

 

program  foo(input,output);var  x:integer;begin readln(x);writeln(value  read  =,x)  end.

 

Lexeme                   Token                                     Pattern

program                 program                                            p,  r,  o,  g,  r,  a,  m

newlines, spaces, tabs

foo                          id (foo)                                 letter followed by seq. of alphanumerics

(                             leftpar                                             a left parenthesis

input                    input                                                i, n, p, u, t

,                             comma                                            a comma

output                  output                                             o, u, t, p, u, t

)                             rightpar                                         a right parenthesis

;                             semicolon                                     a semi-colon

var                        var                                                     v, a, r

x                             id (x)                                     letter followed by seq. of alphanumerics

:                             colon                                                a colon

integer               integer                                            i, n, t, e, g, e, r

;                             semicolon                                     a semi-colon

begin                    begin                                                b, e, g, i, n

newlines, spaces, tabs

readln                 readln                                              r, e, a, d, l, n

(                             leftpar                                             a left parenthesis

x                             id (x)                                     letter followed by seq. of alphanumerics

)                             rightpar                                         a right parenthesis

;                             semicolon                                     a semi-colon

writeln                 writeln                                               w,  r,  i,  t,  e,  l,  n

(                             leftpar                                             a left parenthesis

value  read  =   literal  (value  read  =)    seq.  of chars enclosed in quotes

,                             comma                                            a comma

x                             id (x)                                     letter followed by seq. of alphanumerics

)                             rightpar                                            a right parenthesis newlines, spaces, tabs

end                         end                                                     e, n, d

.                             fullstop                                           a fullstop


It is the sequence of tokens in the middle column that are passed as output to the syntax analyser.

This token sequence represents almost all the important information from the input program required by the syntax analyser. Whitespace (newlines, spaces and tabs), although often im- portant in separating lexemes, is usually not returned as a token. Also, when outputting an id or literal token, the lexical analyser must also return the value of the matched lexeme (shown in parentheses) or else this information would be lost.

 

1.3.2        Symbol Table Management

 

A symbol table is a data structure containing all the identifiers (i.e. names of variables, proce- dures etc.) of a source program together with all the attributes of each identifier.

For variables, typical attributes include:

 

   its type,

   how much memory it occupies,

   its scope.

 

For procedures and functions, typical attributes include:

 

   the number and type of each argument (if any),

   the method of passing each argument, and

   the type of value returned (if any).

 

The purpose of the symbol table is to provide quick and uniform access to identifier attributes throughout the compilation process. Information is usually put into the symbol table during the lexical analysis and/or syntax analysis phases.

 

1.3.3        Syntax Analysis

 

A syntax analyser or parser is a program that groups sequences of tokens from the lexical analysis phase into phrases each with an associated phrase type.

A phrase is a logical unit with respect to the rules of the source language. For example, consider:

 

a  :=  x  *  y  +  z


After lexical analysis, this statement has the structure

 

id1 assign id2 binop1 id3 binop2 id4

 

Now, a syntactic rule of Pascal is that there are objects called expressions for which the rules are (essentially):

 

(1)    Any constant or identifier is an expression.

(2)    If exp1 and exp2 are expressions then so is exp1 binop exp2.

 

Taking all the identifiers to be variable names for simplicity, we have:

 

   By rule (1) exp1 = id2  and exp2 = id3  are both phrases with phrase type expression;

   by rule (2) exp3 = exp1  binop1  exp2  is also a phrase with phrase type expression;

   by rule (1) exp4 = id4  is a phase with type expression;

   by rule (2), exp5 = exp3  binop2  exp4  is a phrase with phrase type expression.

 

Of course, Pascal also has a rule that says:

id assign exp

 

is  a  phrase  with  phrase  type  assignment,  and so  the  Pascal  statement  above  is  a  phrase  of type assignment.

Parse Trees and  Syntax  Trees. The  structure  of  a  phrase  is best  thought  of  as a  parse tree or a syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases. A syntax tree is a compacted form of parse tree in which the operators appear as the interior nodes. The construction of a parse tree is a basic activity in compiler-writing.

A parse tree for the example Pascal statement is:

 

assignment


id1


assign


exp5


 


exp3


binop2


exp4


 

                                        


exp1


binop1


exp2


id4


                                              

id2                                id3


and a syntax tree is:

 


assign

id1     binop2


binop1     id4

id2        id3

 

 

Comment. The distinction between lexical and syntactical analysis sometimes seems arbitrary. The main criterion is whether the analyser needs recursion or not:

 

   lexical analysers hardly ever use recursion; they are sometimes called linear analysers

since they scan the input in a straight line (from from left to right).

   syntax analysers almost always use recursion; this is because phrase types are often defined in terms of themselves (cf.  the phrase type expression above).

 

1.3.4        Semantic Analysis

 

A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree and a symbol table. Its purpose is to determine if the input has a well-defined meaning; in practice semantic analysers are mainly concerned with type checking and type coercion based on type rules. Typical type rules for expressions and assignments are:

Expression Type Rules. Let exp be an expression.

 

(a)   If exp is a constant then exp is well-typed and its type is the type of the constant.

(b)   If exp is a variable then exp is well-typed and its type is the type of the variable.

(c)   If exp is an operator applied to further subexpressions such that:

(i)    the operator is applied to the correct number of subexpressions,

(ii)    each subexpression is well-typed and

(iii)    each subexpression is of an appropriate type,

then exp is well-typed and its type is the result type of the operator.

 

Assignment Type Rules. Let var be a variable of type T1 and let exp be a well-typed expression of type T2. If


(a)   T1 = T2 and

(b)   T1 is an assignable type

 

then var assign exp is a well-typed assignment. For example, consider the following code fragment:

intvar  :=  intvar  +  realarray

 

where intvar is stored in the symbol table as being an integer  variable,  and realarray as an array or reals. In Pascal this assignment is syntactically correct, but semantically incorrect since + is only defined on numbers, whereas its second argument is an array. The semantic analyser checks for such type errors using the parse tree, the symbol table and type rules.

 

1.3.5        Error Handling

 

Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. On detecting an error the compiler must:

 

   report the error in a helpful way,

   correct the error if possible, and

   continue processing (if possible) after the error to look for further errors.

 

Types of Error. Errors are either syntactic or semantic:

Syntax errors are errors in the program text; they may be either lexical or grammatical:

 

(a)   A lexical error is a mistake in a lexeme, for examples, typing tehn instead of then, or missing off one of the quotes in a literal.

(b)   A grammatical error is a one that violates the (grammatical) rules of the language, for example if  x  =  7  y  :=  4 (missing then).

 

Semantic errors are mistakes concerning the meaning of a program construct; they may be either type errors, logical errors or run-time errors:

 

(a)   Type errors occur when an operator is applied to an argument of the wrong type, or to the wrong number of arguments.


(b)   Logical errors occur when a badly conceived program is executed, for example: while  x  =  y  do  ...  when x and y initially have the same value and the body of loop need not change the value of either x or y.

(c)   Run-time errors are errors that can be detected only when the program is executed, for example:

 

var  x  :  real;  readln(x);  writeln(1/x)

 

which would produce a run time error if the user input 0.

 

Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way). If possible, the compiler should make the appropriate correction(s). Semantic errors are much harder and sometimes impossible for a computer to detect.

 

1.3.6        Intermediate Code Generation

 

After the analysis phases of the compiler have been completed, a source program has been decomposed into a symbol table and a parse tree both of which may have been modified by the semantic analyser. From this information we begin the process of generating object code according to either of two approaches:

 

(1)    generate code for a specific machine, or

(2)    generate  code  for  a  general  or  abstract  machine,  then  use  further  translators  to turn the abstract code into code for specific machines.

 

Approach (2) is more modular and efficient provided the abstract machine language is simple enough to:

 

(a)   produce and analyse (in the optimisation phase), and

(b)   easily translated into the required language(s).

 

One of the most widely used intermediate languages is Three-Address Code (TAC).

TAC Programs. A TAC program is a sequence of optionally labelled instructions. Some common TAC instructions include:

 

(i)    var1  := var2  binop var3

(ii)    var1  := unop var2

(iii)    var1  := num


(iv)    goto label

(v)    if var1 relop var2 goto label

 

There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will will use only the above for the following discussion.

Syntax-Directed Code Generation. In essence, code is generated by recursively walking through a parse (or syntax) tree, and hence the process is referred to as syntax-directed code generation. For example, consider the code fragment:

 

z  :=  x  *  y  +  x

 

and its syntax tree (with lexemes replacing tokens):

 

:=

z     +

*     x

x     y

 

We use this tree to direct the compilation into TAC as follows.

At the root of the tree we see an assignment whose right-hand side is an expression, and this expression is the sum of two quantities. Assume that we can produce TAC code that computes  the  value  of  the  first  and  second  summands  and  stores  these  values  in  temp1  and temp2 respectively.  Then the appropriate TAC for the assignment statement is just

 

z  :=  temp1  +  temp2

 

Next we consider how to compute the values of temp1 and temp2 in the same top-down recursive way.

For temp1 we see that it is the product of two quantities.  Assume that we can produce TAC code that computes the value of the first and second multiplicands and stores these values in temp3 and temp4 respectively.  Then the appropriate TAC for the computing temp1 is

 

temp1  :=  temp3  *  temp4

 

Continuing the recursive walk, we consider temp3.  Here we see it is just the variable x and thus the TAC code


temp3  :=  x

 

is sufficient.  Next we come to temp4 and similar to temp3 the appropriate code is

 

temp4  :=  y

 

Finally, considering temp2, of course

 

temp2  :=  x

 

suffices.

Each code fragment is output when we leave the corresponding node; this results in the final program:

 

temp3  :=  x temp4  :=  y

temp1  :=  temp3  *  temp4 temp2  :=  x

z  :=  temp1  +  temp2

 

Comment. Notice how a compound expression has been broken down and translated into a sequence of very simple instructions, and furthermore, the process of producing the TAC code was uniform and simple. Some redundancy has been brought into the TAC code but this can be removed (along with redundancy that is not due to the TAC-generation) in the optimisation phase.

 

1.3.7        Code Optimisation

 

An optimiser attempts to improve the time and space requirements of a program. There are many ways in which code can be optimised, but most are expensive in terms of time and space to implement.

Common optimisations include:

 

   removing redundant identifiers,

   removing unreachable sections of code,

   identifying common subexpressions,

   unfolding loops and


   eliminating procedures.

 

Note that here we are concerned with the general optimisation of abstract code.

Example. Consider the TAC code:

 

temp1  :=  x temp2  :=  temp1

if  temp1  =  temp2  goto  200 temp3  :=  temp1  *  y

goto 300

200    temp3  :=  z

300    temp4  :=  temp2  +  temp3

 

Removing redundant identifiers (just temp2) gives

 

temp1  :=  x

if  temp1  =  temp1  goto  200 temp3  :=  temp1  *  y

goto 300

200    temp3  :=  z

300    temp4  :=  temp1  +  temp3

 

Removing redundant code gives

 

temp1  :=  x

200    temp3  :=  z

300    temp4  :=  temp1  +  temp3

 

Notes.  Attempting to find a best optimisation is expensive for the following reasons:

 

   A given optimisation technique may have to be applied repeatedly until no further optimi- sation can be obtained. (For example, removing one redundant identifier may introduce another.)

   A given optimisation technique may give rise to other forms of redundancy and thus sequences of optimisation techniques may have to be repeated. (For example, above we removed a redundant identifier and this gave rise to redundant code, but removing redundant code may lead to further redundant identifiers.)

   The order in which optimisations are applied may be significant. (How many ways are there of applying n optimisation techniques to a given piece of code?)


1.3.8        Code Generation

 

The final phase of the compiler is to generate code for a specific machine. In this phase we consider:

 

   memory management,

   register assignment and

   machine-specific optimisation.

 

The output from this phase is usually assembly language or relocatable machine code.

Example. The TAC code above could typically result in the ARM assembly program shown below. Note that the example illustrates a mechanical translation of TAC into ARM; it is not intended to illustrate compact ARM programming!

 

 

.x

EQUD

0

four bytes for x

.z

EQUD

0

four bytes for z

.temp

EQUD

0

four bytes each for temp1,

 

EQUD

0

temp3, and

 

EQUD

0

temp4.

 

.prog

 

MOV

 

R12,#temp

 

R12 = base address

 

MOV

R0,#x

R0 = address of x

 

LDR

R1,[R0]

R1 = value of x

 

STR

R1,[R12]

store R1 at R12

 

MOV

R0,#z

R0 = address of z

 

LDR

R1,[R0]

R1 = value of z

 

STR

R1,[R12,#4]

store R1 at R12+4

 

LDR

R1,[R12]

R1 = value of temp1

 

LDR

R2,[R12,#4]

R2 = value of temp3

 

ADD

R3,R1,R2

add temp1 to temp3

 

STR

R3,[R12,#8]

store R3 at R12+8


2         Languages

 

In this section we introduce the formal notion of a language, and the basic problem of recognising strings from a language.  These are central concepts that we will use throughout the remainder of the course.

Note.This section contains mainly theoretical definitions; the lectures will cover examples and diagrams illustrating the theory.

 

 

2.1       Basic Definitions

   An alphabet Σ is a finite non-empty set (of symbols).

   A string or word over an alphabet Σ is a finite concatenation (or juxtaposition) of symbols from Σ.

   The length of a string w (that is, the number of characters comprising it) is denoted |w|.

   The empty or null string is denoted ǫ. (That is, ǫ is the unique string satisfying |ǫ| = 0.)

   The set of all strings over Σ is denoted Σ.

 


   For each n 0 we define


 

Σn = {w Σ | |w| = n}.


 

   We define                                                            [

Σ+ =        Σn.

n1

(Thus Σ = Σ+ {ǫ}.)

   For a symbol or word x, xn denotes x concatenated with itself n times, with the convention that x0 denotes ǫ.

   A language over Σ is a set L Σ.

   Two languages L1 and L2 over common alphabet Σ are equal if they are equal as sets. Thus L1 = L2 if, and only if, L1 L2 and L2 L1.

 

 

2.2       Decidability

 

Given a language L over some alphabet Σ, a basic question is: For each possible word w Σ, can we effectively decide if w is a member of L or not? We call this the decision problem for L.

Note  the  use  of  the  word  effectively:   this  implies  the  mechanism  by  which  we  decide  on membership (or non-membership) must be a finitistic, deterministic and mechanical procedure


that can be carried out by some form of computing agent.  Also note the decision problem asks if a given word is a member of L or not; that is, it is not sufficient to be only able to decide when words are members of L.

More precisely then, a language L Σ is said to be decidable if there exists an algorithm such that for every w Σ

 

(1)    the algorithm terminates with output Yes when w L and

(2)    the algorithm terminates with output No when w /∈ L. If no such algorithm exists then L is said to be undecidable.

Note.  Decidability is based on the notion of an algorithm.  In standard theoretical computer

science this is taken to mean a Turing Machine; this is an abstract, but extremely low-level model of computation that is equivalent to a digital computer with an infinite memory. Thus it is sufficient in practice to use a more convenient model of computation such as Pascal programs provided that any decidability arguments we make assume an infinite memory.

Example. Let Σ = {0, 1} be an alphabet. Let L be the (infinite) language

L = {w Σ | w = 0n1for some n}.

Does the program below solve the decision problem for L?

 

read( char );

if  char  =  END_OF_STRING  then print( "No" )

else           /*  char  must  be  0  or  1  */ while  char  =  0  do

read( char )

od;         /*  char  must  be  1  or  END_OF_STRING  */ if  char  =  1  then

print( "Yes" ) else

print( "No" ) fi

fi

 

Answer: ....

 

2.3       Basic Facts

(1)    Every finite language is decidable. (Hence every undecidable language is infinite.)


(2)    Not every infinite language is undecidable.

(3)    Programming languages are (usually) infinite but (always) decidable. (Why?)

 

 

2.4       Applications to Compilation

 

Languages may be classified by the means in which they are defined. Of interest to us are

regular languages and context-free languages.

Regular Languages. The significant aspects of regular languages are:

 

   they are defined by patterns called regular expressions;

   every regular language is decidable;

   the decision problem for any regular language is solved by a deterministic finite state automaton (DFA); and

   programming languages lexical patterns are specified using regular expressions, and lex- ical analysers are (essentially) DFAs.

 

Regular languages and their relationship to lexical analysis are the subjects of the next section.

Context-Free Languages. The significant aspects of context-free languages are:

 

   they are defined by rules called context-free grammars;

   every context-free language is decidable;

   the decision problem for any context-free language of interest to us is solved by a deter- ministic push-down automaton (DPDA);

   programming language syntax is specified using context-free grammars, and (most) parsers are (essentially) DPDAs.

 

Context-free languages and their relationship to syntax analysis are the subjects of sections 4 and 5.


3         Lexical Analysis

 

In this section we study some theoretical concepts with respect to the class of regular languages and apply these concepts to the practical problem of lexical analysis. Firstly, in Section 3.1, we define the notion of a regular expression and show how regular expressions determine regular languages. We then, in Section 3.2, introduce deterministic finite automata (DFAs), the class of algorithms that solve the decision problems for regular languages. We show how regular expressions and DFAs can be used to specify and implement lexical analysers in Section 3.3, and in Section 3.4 we take a brief look at Lex, a popular lexical analyser generator built upon the theory of regular expressions and DFAs.

Note.This section contains mainly theoretical definitions; the lectures will cover examples and diagrams illustrating the theory.

 

 

3.1       Regular Expressions

 

Recall from the Introduction that a lexical analyser uses pattern matching with respect to rules associated with the source languages tokens.  For example, the token then is associated with  the  pattern  t,  h,  e,  n,  and  the  token  id  might  be  associated  with  the  pattern  an alphabetic  character  followed  by  any  number  of  alphanumeric  characters.   The  notation  of regular expressions is a mathematical formalism ideal for expressing patterns such as these, and thus ideal for expressing the lexical structure of programming languages.

 

3.1.1        Definition

 

Regular expressions represent patterns of strings of symbols. A regular expression r matches a set of strings over an alphabet.  This set is denoted L(r) and is called the language determined or generated by r.

Let Σ be an alphabet. We define the set RE(Σ) of regular expressions over Σ, the strings they match and thus the languages they determine, as follows:

 

   RE(Σ) matches no strings.  The language determined is L() = .

   ǫ RE(Σ) matches only the empty string.  Therefore L(ǫ) = {ǫ}.

   If a Σ then a RE(Σ) matches the string a. Therefore L(a) = {a}.

   if r and s are in RE(Σ) and determine the languages L(r) and L(s) respectively, then

 

r|s RE(Σ) matches all strings matched either by r or by s. Therefore, L(r|s) =

L(r) L(s).


         rs RE(Σ) matches any string that is the concatenation of two strings, the first matching r and the second matching s. Therefore, the language determined is

L(rs) = L(r)L(s) = {uv Σ | u L(r) and v L(s)}.

(Given two sets S1 and S2 of strings, the notation S1S2 denotes the set of all strings formed by appending members of S1 with members of S2.)

         r RE(Σ) matches all finite concatenations of strings which all match r. The language denoted is thus

L(r) = (L(r))    =    [ (L(r))i

iN

=   {ǫ} L(r) L(r)L(r) · · ·

 

3.1.2        Regular Languages

 

Let L be a language over Σ. L is said to be a regular language if L = L(r) for some r RE(Σ).

 

3.1.3        Notation

 

   We need to use parentheses to overide the convention concerning the precedence of the operators. The normal convention is:  is higher than concatenation,  which is higher than |. Thus, for example, a|bc is a|(b(c)).

   We write r+ for rr.

   We write r? for ǫ|r.

   We write rn as an abbreviation for r . . . r (n times r), with r0 denoting ǫ.

 

3.1.4        Lemma

 

Writing r = s to mean L(r) = L(s) for two regular expressions r, s RE(Σ), the following identities hold for all r, s, t RE(Σ):

 

   r|s = s|r (| is commutative)

   (r|s)|t = r|(s|t) (| is associative)

   (rs)t = r(st) (concatenation is associative)

   r(s|t) = rs|rt  (concatenation

   (r|s)t = rt|st distributes over |)


   r = r =

   |r = r

   = ǫ

   r? = r

   r∗∗ = r

   (rs) = (r|s)

   ǫr = = r.

 

3.1.5        Regular definitions

 

It is often useful to give names to complex regular expressions, and to use these names in place of the expressions they represent. Given an alphabet comprising all ASCII characters,

letter    = A|B| · · · |Z|a|b| · · · |z digit = 0|1| · · · |9

ident   = letter(letter|digit)

are examples of regular definitions for letters, digits and identifiers.

 

3.1.6        The Decision Problem for Regular Languages

 

For every regular expression r RE(Σ) there exists a string-processing machine M = M(r) such that for every w Σ, when input to M:

 

(1)    if w L(r) then M  terminates with output Yes, and

(2)   if w /∈ L(r) then M  terminates with output No. Thus, every regular language is decidable.

The machines in question are Deterministic Finite State Automata.

 

3.2       Deterministic Finite State Automata

 

In this section we define the notion of a DFA without reference to its application in lexical analysis. Here we are interested purely in solving the decision problem for regular languages; that  is,  defining  machines  that  say  yes  or no  given  an inputted  string,  depending  on its membership of a particular language. In Section 3.3 we use DFAs as the basis for lexical analysers: pattern matching algorithms that output sequences of tokens.


3.2.1        Definition

 

A deterministic finite state automaton (or DFA) M is a 5-tuple

M = (Q, Σ, δ, q0, F )

where

 

   Q is a finite non-empty set of states,

   Σ is an alphabet,

   δ : Q × Σ Q is the transition or next-state function,

   q0 Q is the initial state, and

   F Q is the set of accepting or final states.

 

The idea behind a DFA M is that it is an abstract machine that defines a language L(M) Σ in the following way:

 

   The machine begins in its start state q0;

   Given a string w Σ the machine reads the symbols of w one at a time from left to right;

   Each symbol causes the machine to make a transition from its current state to a new state; if the current state is q and the input symbol is a, then the new state is δ(q, a);

   The machine terminates when all the symbols of the string have been read;

   If, when the machine terminates, its state is a member of F , then the machine accepts

w, else it rejects w.

 

Note the name final state is not a good one since a DFA does not terminate as soon as a final state has been entered.  The DFA only terminates when all the input has been read.

We formalise this idea as follows:

Definition. Let M = (Q, Σ, δ, q0, F ) be a DFA. We define δˆ : Q × Σ Q by

δˆ(q,