Compiler Design
1
Introduction
1.1 Compilation
Definition. Compilation is a process that translates a program in one language
(the source language) into an equivalent program in another
language (the object or target language).
Error messages
|
|
||||
Source program |
|
Compiler |
|
Target program |
|
|
|
||||
An important part of any compiler is the detection and reporting of errors; this will be dis- cussed in more detail later in the introduction. Commonly, the
source language is a high-level programming
language (i.e. a problem-oriented language), and the target language is a ma- chine language or assembly language
(i.e. a machine-oriented language). Thus compilation is a fundamental concept in the production of software: it is the link between the (abstract) world of application development and the low-level world of application execution on machines.
Types of Translators. An assembler is also a type of translator:
Assembly program |
|
Assembler |
|
Machine program |
|
|
Source program
An interpreter is closely related
to a compiler, but takes both source
program and input data. The translation
and execution phases of the source program are one and the
same.
Interpreter |
|
Output data |
|
Although the above types of translator are the most
well-known, we also need knowledge of compilation
techniques to deal with the recognition and translation of many other types of languages
including:
•
Command-line interface languages;
•
Typesetting / word processing languages (e.g.TEX);
•
Natural languages;
•
Hardware description languages;
•
Page description languages (e.g. PostScript);
•
Set-up or parameter files.
Early Development of Compilers.
1940’s. Early stored-program computers were programmed in
machine language. Later, as- sembly languages were developed where machine instructions and memory locations were given symbolic
forms.
1950’s. Early high-level languages were developed, for
example FORTRAN. Although more problem-oriented
than assembly languages, the first versions of FORTRAN still had many machine-dependent features. Techniques and
processes involved in compilation were not well- understood at this time, and compiler-writing was a huge task: e.g. the
first FORTRAN compiler took 18 man years of effort to write.
Chomsky’s study of the structure of natural languages
led to a
classification of languages
according to the complexity of their grammars. The context-free languages proved to be useful
in describing the syntax of programming languages.
1960’s onwards. The study of the parsing problem for context-free languages during the 1960’s and 1970’s has led to efficient algorithms for the recognition of context-free languages. These al- gorithms, and associated software tools, are
central to compiler construction today. Similarly, the theory of finite state machines and regular expressions (which correspond to Chomsky’s regular
languages) have proven useful for
describing the lexical structure of programming lan- guages.
From
Algol 60, high-level languages have become
more problem-oriented and machine indepen-
dent, with features much removed from the machine languages into which
they are compiled. The theory and
tools available today make compiler construction a managable task, even for complex languages. For example, your
compiler assignment will take only a few weeks (hope- fully) and will only be about 1000 lines of code (although,
admittedly, the source language is small).
1.2 The Context of a Compiler
The complete process of compilation is illustrated as:
skeletal source program
source program
assembly program
relocatable m/c code
absolute m/c code
1.2.1
Preprocessors
Preprocessing performs (usually simple)
operations on the source file(s) prior to compilation. Typical preprocessing operations include:
(a)
Expanding macros (shorthand notations
for longer constructs). For example, in C,
#define foo(x,y) (3*x+y*(2+x))
defines a macro foo, that when used in later in the program, is expanded by the preprocessor. For example, a = foo(a,b) becomes
a = (3*a+b*(2+a))
(b) Inserting named files. For example, in C,
#include "header.h"
is replaced by the contents of the file header.h
1.2.2
Linkers
A linker combines object code (machine
code that has not yet been linked) produced from compiling and assembling many source programs, as well as
standard library functions and resources
supplied by the operating system. This
involves resolving references in each object
file to external variables and procedures declared in other files.
1.2.3
Loaders
Compilers, assemblers and linkers usually produce code whose memory references are made relative to an undetermined starting
location that can be anywhere in memory (relocatable machine code). A loader calculates appropriate absolute addresses for these memory locations
and amends the code to use these addresses.
1.3 The Phases of a Compiler
The process of compilation is split up into six phases,
each of which interacts with a symbol table manager and an error handler. This is
called the analysis/synthesis model of
compilation. There are many variants on this model, but the essential elements
are the same.
source program
target program
1.3.1
Lexical Analysis
A lexical analyser or scanner is a program that groups sequences
of characters into lexemes, and outputs (to the syntax analyser) a sequence of tokens. Here:
(a)
Tokens are symbolic names for the entities
that make up the text of
the program;
e.g. if for the keyword if, and id for any identifier. These make up the output of the lexical
analyser.
(b)
A pattern is a rule that
specifies when a sequence of characters from the input constitutes a token; e.g the sequence i, f for the token if , and any
sequence of alphanumerics starting with a letter for the token id.
(c)
A lexeme is a sequence of
characters from the input that match a pattern (and hence constitute an instance of a
token); for example if matches the pattern for if , and foo123bar matches the pattern for id.
For example, the following code might result in the table given below.
program foo(input,output);var x:integer;begin readln(x);writeln(’value read =’,x) end.
Lexeme Token Pattern
program program p, r, o, g, r, a, m
newlines, spaces, tabs
foo id (foo) letter followed by seq. of alphanumerics
( leftpar a left parenthesis
input input i, n, p, u, t
, comma a comma
output output o, u, t, p, u, t
) rightpar a right parenthesis
; semicolon a semi-colon
var var v, a, r
x id (x) letter followed by seq. of alphanumerics
: colon a colon
integer integer i, n, t, e, g, e, r
; semicolon a semi-colon
begin begin b, e, g, i, n
newlines, spaces, tabs
readln readln r, e, a, d, l, n
( leftpar a left parenthesis
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis
; semicolon a semi-colon
writeln writeln w, r, i, t, e, l, n
( leftpar a left parenthesis
’value read
=’ literal (’value read
=’) seq. of chars enclosed in quotes
, comma a comma
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis newlines, spaces, tabs
end end e, n, d
. fullstop a fullstop
It is the sequence of tokens in the
middle column that are passed as output to the syntax analyser.
This token sequence
represents almost all the important information from the input program required by the syntax analyser.
Whitespace (newlines, spaces and tabs), although often im- portant
in separating lexemes,
is usually not returned as a token. Also, when outputting an id or literal token, the lexical
analyser must also return the value of the matched
lexeme (shown in parentheses) or else this information
would be lost.
1.3.2
Symbol Table Management
A symbol table is a data structure containing all the identifiers (i.e. names of variables, proce- dures etc.) of a source program together with all the attributes of each identifier.
For variables, typical attributes include:
•
its type,
•
how much memory it occupies,
•
its scope.
For procedures and functions, typical
attributes include:
•
the number and type of each argument
(if any),
•
the method of passing each argument, and
•
the type of value returned
(if any).
The purpose of the symbol
table is to provide quick
and uniform access
to identifier attributes throughout the compilation process. Information is usually put
into the symbol table during the lexical analysis and/or syntax analysis
phases.
1.3.3
Syntax Analysis
A syntax
analyser or parser is a program
that groups sequences of tokens from the lexical analysis phase into phrases each with an associated phrase type.
A phrase
is a logical unit with respect to the rules of the source language.
For example, consider:
a := x
* y
+ z
After lexical
analysis, this statement has the structure
id1 assign id2 binop1 id3 binop2 id4
Now, a syntactic rule of Pascal is that there are objects called ‘expressions’ for which the rules are (essentially):
(1)
Any constant or identifier is an expression.
(2) If exp1 and exp2 are expressions then so is exp1 binop exp2.
Taking all the identifiers to be variable
names for simplicity, we have:
•
By rule (1) exp1 = id2 and exp2 = id3 are both phrases with phrase type ‘expression’;
•
by rule (2) exp3 = exp1 binop1
exp2 is also a phrase with phrase type ‘expression’;
•
by rule (1) exp4 = id4 is a phase with type ‘expression’;
•
by rule (2), exp5 = exp3 binop2
exp4 is a phrase with phrase type ‘expression’.
Of course, Pascal also has a rule that says:
id assign exp
is a
phrase with phrase type ‘assignment’, and so the Pascal statement above is a
phrase of type ‘assignment’.
Parse
Trees and Syntax
Trees. The structure
of a phrase
is best thought of as
a parse tree or a syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases. A syntax tree is a compacted form of
parse tree in which the operators appear as the interior nodes. The construction of a parse tree is a basic activity in compiler-writing.
A parse
tree for the example Pascal
statement is:
assignment
id1
assign
exp5
exp3
binop2
exp4
exp1
binop1
exp2
id4
id2 id3
and
a syntax tree is:
assign
id1 binop2
binop1 id4
id2 id3
Comment. The distinction between lexical
and syntactical analysis sometimes seems arbitrary. The main criterion is whether the analyser needs recursion or not:
•
lexical analysers hardly
ever use recursion; they are sometimes
called linear analysers
since they scan the input in a ‘straight line’ (from from left to right).
•
syntax analysers almost always
use recursion; this is because
phrase types are often defined
in terms of themselves (cf. the phrase type ‘expression’ above).
1.3.4
Semantic Analysis
A semantic analyser takes its input from the syntax analysis phase in
the form of a parse tree and a symbol
table. Its purpose is to determine if
the input has a well-defined meaning; in practice
semantic analysers are mainly concerned with type checking and type
coercion based on type rules. Typical
type rules for expressions and assignments are:
Expression Type Rules.
Let exp be an expression.
(a) If exp is a constant then exp is well-typed and its type is the type of the constant.
(b) If exp is a variable then exp is well-typed and its type is the type of the variable.
(c) If exp is an operator applied
to further subexpressions such that:
(i)
the operator is applied to the correct
number of subexpressions,
(ii)
each subexpression is well-typed and
(iii)
each subexpression is of an appropriate type,
then exp is well-typed and its type is the result type of the operator.
Assignment Type Rules. Let var be a variable of type T1 and let
exp be a well-typed expression of type T2. If
(a) T1 = T2 and
(b) T1 is an assignable type
then var assign exp is a well-typed assignment. For example,
consider the following code fragment:
intvar := intvar
+ realarray
where intvar is
stored in the symbol table as being an integer
variable, and realarray as an array or reals. In Pascal this assignment is syntactically correct,
but semantically incorrect since + is only defined on numbers, whereas its second
argument is an array. The semantic analyser checks
for such type errors using the parse tree, the symbol table and type rules.
1.3.5
Error Handling
Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors.
On detecting an error the compiler must:
•
report the error in a helpful way,
•
correct the error if possible, and
•
continue processing (if possible) after the error to look for further
errors.
Types of Error. Errors are either syntactic or semantic:
Syntax errors are errors in the program text; they may
be either lexical or grammatical:
(a)
A
lexical error is a mistake
in a lexeme, for examples,
typing tehn instead
of then, or missing off one of the quotes in a literal.
(b)
A
grammatical error is a one that violates
the (grammatical) rules of the language, for example if x
= 7
y := 4 (missing then).
Semantic
errors are mistakes concerning the meaning of a
program construct; they may be either type errors, logical errors or run-time errors:
(a)
Type errors occur when an operator
is applied to an argument of the wrong type,
or to the wrong number of arguments.
(b)
Logical errors occur when a badly
conceived program is executed, for example: while x =
y do ... when x and y initially have the same value and the body of loop need not change the value of either x or y.
(c)
Run-time errors are errors that can
be detected only
when the program is executed, for example:
var x :
real; readln(x); writeln(1/x)
which would produce a run time error if the user input 0.
Syntax errors must be detected by a
compiler and at least reported to the user (in a helpful way). If possible, the compiler should make the appropriate correction(s). Semantic errors are much harder and sometimes
impossible for a computer to detect.
1.3.6
Intermediate Code Generation
After the analysis phases of the
compiler have been completed, a source program has been decomposed into a symbol table
and a parse tree both of which may have been modified
by the semantic analyser. From this information we begin the process
of generating object code according to either of two approaches:
(1) generate code for a specific machine, or
(2)
generate code for a ‘general’ or abstract machine, then use further translators to turn the abstract code into code for specific
machines.
Approach (2) is more modular and
efficient provided the abstract machine language is simple enough to:
(a) produce and analyse (in the optimisation phase), and
(b) easily translated into the required
language(s).
One of the most widely used intermediate languages is Three-Address Code (TAC).
TAC Programs. A TAC program is
a sequence of optionally labelled instructions. Some common TAC instructions include:
(i) var1 := var2 binop var3
(ii) var1 := unop var2
(iii)
var1 := num
(iv) goto label
(v) if var1 relop var2 goto label
There are also TAC instructions for
addresses and pointers, arrays and procedure calls, but will will use only the above for the following discussion.
Syntax-Directed Code Generation. In essence, code is generated by
recursively walking through a parse (or syntax) tree, and hence the process
is referred to as syntax-directed code generation. For example, consider the code fragment:
z := x *
y +
x
and its syntax
tree (with lexemes
replacing tokens):
:=
z +
* x
x y
We use this tree to direct the compilation into TAC as follows.
At the root of the tree we see an assignment whose right-hand side is an expression, and this
expression is the sum of two quantities. Assume
that we can produce TAC code that computes the value of the first and second summands and stores these values in temp1 and temp2 respectively. Then the appropriate TAC for the assignment statement is just
z
:= temp1
+ temp2
Next we consider how to compute the values of temp1 and temp2 in the same top-down recursive way.
For temp1 we see that it is the product of two quantities. Assume that we can produce TAC code
that computes the value of the first and second multiplicands and stores these
values in temp3 and temp4 respectively. Then the appropriate TAC for the computing temp1 is
temp1 := temp3
* temp4
Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus the TAC code
temp3 := x
is sufficient. Next we come to temp4 and similar to temp3 the appropriate code is
temp4 := y
Finally, considering temp2, of course
temp2 := x
suffices.
Each code fragment is output when we
leave the corresponding node; this results in the final program:
temp3 := x temp4 := y
temp1 := temp3 *
temp4 temp2 := x
z := temp1 +
temp2
Comment. Notice how a
compound expression has been broken down and translated into a sequence
of very simple instructions, and furthermore, the process of producing the TAC
code was uniform and simple. Some redundancy has been brought
into the TAC code but this can be removed
(along with redundancy that is not due to the TAC-generation) in the optimisation phase.
1.3.7
Code Optimisation
An optimiser attempts to improve the time and space requirements of a program. There
are many ways in which code can be optimised, but most are expensive in terms
of time and space to implement.
Common optimisations include:
•
removing redundant identifiers,
•
removing unreachable sections
of code,
•
identifying common subexpressions,
•
unfolding loops and
•
eliminating procedures.
Note that here we are concerned with the general
optimisation of abstract
code.
Example. Consider
the TAC code:
temp1 := x temp2 := temp1
if temp1
= temp2 goto 200
temp3 := temp1 *
y
goto 300
200 temp3 := z
300 temp4 := temp2 +
temp3
Removing redundant identifiers (just temp2) gives
temp1 := x
if temp1
= temp1 goto 200
temp3 := temp1 *
y
goto 300
200 temp3 := z
300 temp4 := temp1 +
temp3
Removing redundant
code gives
temp1 := x
200 temp3 := z
300 temp4 := temp1 +
temp3
Notes. Attempting to find a ‘best’ optimisation is expensive for the following reasons:
•
A given optimisation technique may
have to be applied repeatedly until no further optimi- sation can be obtained. (For example, removing
one redundant identifier may introduce another.)
•
A given optimisation technique may
give rise to other forms of redundancy and thus sequences of optimisation techniques may have to be repeated. (For example, above we removed a redundant identifier and this gave rise to
redundant code, but removing redundant code may lead to further redundant identifiers.)
•
The order in which optimisations are applied may be significant. (How many ways are there of applying n optimisation techniques to a given piece of
code?)
1.3.8
Code Generation
The final phase of the compiler is to
generate code for a specific machine. In
this phase we consider:
•
memory management,
•
register assignment and
•
machine-specific optimisation.
The output from this phase is usually
assembly language or relocatable machine code.
Example. The TAC code above could
typically result in the ARM assembly program shown below. Note that the example
illustrates a mechanical translation of TAC into ARM; it is not intended to illustrate compact ARM programming!
.x |
EQUD |
0 |
four bytes for x |
.z |
EQUD |
0 |
four bytes for z |
.temp |
EQUD |
0 |
four bytes each for temp1, |
|
EQUD |
0 |
temp3, and |
|
EQUD |
0 |
temp4. |
.prog |
MOV |
R12,#temp |
R12 = base address |
|
MOV |
R0,#x |
R0 = address of x |
|
LDR |
R1,[R0] |
R1 = value
of x |
|
STR |
R1,[R12] |
store R1 at R12 |
|
MOV |
R0,#z |
R0 =
address of z |
|
LDR |
R1,[R0] |
R1 = value of z |
|
STR |
R1,[R12,#4] |
store R1 at R12+4 |
|
LDR |
R1,[R12] |
R1 = value of temp1 |
|
LDR |
R2,[R12,#4] |
R2 = value of temp3 |
|
ADD |
R3,R1,R2 |
add temp1 to temp3 |
|
STR |
R3,[R12,#8] |
store R3 at R12+8 |
2
Languages
In this section
we introduce the formal notion
of a language, and the basic problem
of recognising strings from a language. These are central concepts that we will use
throughout the remainder of the course.
Note.This
section contains mainly theoretical definitions; the lectures will cover
examples and diagrams illustrating the theory.
2.1 Basic Definitions
•
An alphabet Σ is a finite non-empty set (of symbols).
•
A string or word over an alphabet Σ is a finite concatenation (or juxtaposition) of symbols from Σ.
•
The length of a string w (that is, the number of characters comprising it) is denoted |w|.
•
The empty or null string is denoted ǫ. (That is, ǫ is the unique string satisfying |ǫ| = 0.)
•
The set of all strings over Σ is denoted Σ∗.
•
For each n ≥ 0 we define
Σn = {w ∈ Σ∗ | |w| = n}.
•
We
define [
Σ+
= Σn.
n≥1
(Thus Σ∗ = Σ+ ∪ {ǫ}.)
•
For a symbol or word x, xn denotes x concatenated with itself
n times, with the convention that x0 denotes ǫ.
•
A language over Σ is a set L ⊆ Σ∗.
•
Two languages L1 and L2 over common alphabet Σ
are
equal if they are equal as sets. Thus L1 =
L2 if, and only if, L1 ⊆ L2 and L2 ⊆ L1.
2.2 Decidability
Given a language
L over some alphabet Σ, a basic question is: For each possible word w ∈ Σ∗, can we effectively decide if w is a member of L or not? We call this the decision problem
for L.
Note the use of the word ‘effectively’: this implies the mechanism by which we decide on membership (or non-membership) must be a finitistic, deterministic and mechanical procedure
that can be carried out by some form of
computing agent. Also note the decision
problem asks if a given word is a
member of L or not; that
is, it is not sufficient to be only able to decide when words are members of L.
More
precisely then, a language L ⊆ Σ∗ is said to be decidable if there exists
an algorithm such that for every w ∈ Σ∗
(1) the algorithm terminates with output ‘Yes’ when w ∈ L and
(2)
the algorithm terminates with output ‘No’ when w /∈ L. If no such algorithm
exists then L is said to be undecidable.
Note. Decidability is based on the notion of an ‘algorithm’. In standard theoretical computer
science this is taken to mean a Turing Machine; this is an abstract, but extremely low-level model of computation that is equivalent to
a digital computer with an infinite memory.
Thus it is sufficient in practice to
use a more convenient model of computation such as Pascal programs provided
that any decidability arguments we make assume an infinite memory.
Example. Let Σ = {0, 1} be an alphabet.
Let L be the (infinite) language
L = {w ∈ Σ∗ | w = 0n1for some n}.
Does the program below solve the decision
problem for L?
read( char );
if char =
END_OF_STRING then print( "No" )
else /* char must be ’0’
or
’1’ */ while char =
’0’ do
read( char )
od; /* char must be ’1’
or
END_OF_STRING
*/ if char =
’1’ then
print( "Yes" )
else
print( "No" ) fi
fi
Answer: ....
2.3 Basic Facts
(1) Every finite language is decidable. (Hence every undecidable language is infinite.)
(2) Not every infinite language
is undecidable.
(3) Programming languages
are (usually) infinite
but (always) decidable.
(Why?)
2.4 Applications to Compilation
Languages may be classified by the means in which they are defined. Of interest to us are
regular languages and context-free languages.
Regular Languages. The significant aspects
of regular languages
are:
•
they are defined by ‘patterns’ called regular expressions;
•
every regular language
is decidable;
•
the decision problem for any regular
language is solved
by a deterministic finite state automaton (DFA); and
•
programming languages’ lexical patterns are specified using regular expressions, and lex- ical analysers are (essentially) DFAs.
Regular languages and their relationship to lexical analysis are the subjects of the next section.
Context-Free Languages. The significant aspects of context-free languages are:
•
they are defined by ‘rules’ called context-free grammars;
•
every context-free language is decidable;
•
the decision problem for any context-free language of interest
to us is solved by a deter- ministic push-down automaton
(DPDA);
•
programming language syntax is specified using context-free grammars, and (most) parsers
are (essentially) DPDAs.
Context-free languages and their relationship to syntax analysis
are the subjects
of sections 4 and 5.
3
Lexical Analysis
In this section
we study some theoretical concepts
with respect to the class of regular languages and apply these concepts to the practical problem
of lexical analysis. Firstly, in Section 3.1, we define the notion of a regular expression and show how regular expressions determine regular languages. We then, in Section 3.2,
introduce deterministic
finite automata (DFAs), the class of algorithms that solve the decision
problems for regular languages. We
show how regular expressions and DFAs
can be used to specify and implement lexical analysers in Section 3.3, and in Section 3.4 we take a brief look at
Lex, a popular lexical analyser generator built upon the theory of regular expressions and DFAs.
Note.This
section contains mainly theoretical definitions; the lectures will cover
examples and diagrams illustrating the theory.
3.1 Regular Expressions
Recall from the Introduction that a
lexical analyser uses pattern
matching with respect to rules associated with the source language’s tokens. For example, the token then is associated with the pattern t, h,
e,
n, and the token id might be associated with the pattern “an alphabetic character followed
by any number of alphanumeric characters”. The notation of regular
expressions is a mathematical formalism ideal for expressing patterns such as
these, and thus ideal for expressing the lexical structure
of programming languages.
3.1.1
Definition
Regular expressions represent patterns
of strings of symbols. A regular expression r matches a set of
strings over an alphabet. This set is
denoted L(r) and is called the
language determined or generated by r.
Let Σ be an alphabet. We define the set RE(Σ) of regular expressions over Σ, the strings
they match and thus the languages they determine, as follows:
•
∅ ∈
RE(Σ) matches no strings. The language determined is L(∅) = ∅.
•
ǫ ∈
RE(Σ) matches only the empty string. Therefore L(ǫ) = {ǫ}.
•
If a ∈ Σ then a ∈ RE(Σ) matches the string a. Therefore L(a) = {a}.
•
if r and s are in RE(Σ) and determine the languages L(r) and L(s) respectively, then
– r|s ∈ RE(Σ) matches all strings matched
either by r or by s. Therefore, L(r|s) =
L(r) ∪ L(s).
–
rs ∈ RE(Σ) matches any string that is the concatenation of two strings,
the first matching r and the second matching
s. Therefore, the language determined is
L(rs) = L(r)L(s) = {uv ∈ Σ∗ | u ∈ L(r) and v ∈ L(s)}.
(Given two sets S1 and S2 of strings, the notation S1S2 denotes the set of all strings formed
by appending members of S1 with members of S2.)
–
r∗ ∈ RE(Σ) matches all finite concatenations of strings which all match r. The language denoted is thus
L(r∗) = (L(r))∗ = [ (L(r))i
i∈N
= {ǫ} ∪ L(r) ∪ L(r)L(r) ∪ · · ·
3.1.2
Regular Languages
Let L be a language
over Σ. L is said to be a regular language
if L = L(r) for some r ∈ RE(Σ).
3.1.3
Notation
•
We need to use parentheses to
overide the convention concerning the precedence of the operators. The normal
convention is: ∗ is higher than concatenation,
which is higher than |.
Thus, for example, a|bc∗ is a|(b(c∗)).
•
We write r+
for rr∗.
•
We write r? for ǫ|r.
•
We write rn
as an abbreviation for r . . . r (n times r), with r0 denoting ǫ.
3.1.4
Lemma
Writing ‘r = s’ to mean ‘L(r)
= L(s)’ for two regular
expressions r, s ∈ RE(Σ), the following identities hold for all r, s, t ∈ RE(Σ):
•
r|s = s|r (| is commutative)
•
(r|s)|t = r|(s|t) (| is associative)
•
(rs)t = r(st) (concatenation is associative)
•
r(s|t) = rs|rt (concatenation
•
(r|s)t = rt|st distributes over |)
•
∅r = r∅ = ∅
•
∅|r = r
•
∅∗ = ǫ
•
r?∗ = r∗
•
r∗∗ = r∗
•
(r∗s∗)∗ = (r|s)∗
•
ǫr = rǫ = r.
3.1.5
Regular definitions
It is often useful to give names to
complex regular expressions, and to use these names in place of the expressions they represent. Given an alphabet
comprising all ASCII characters,
letter = A|B| · · · |Z|a|b| · · · |z digit = 0|1| · · · |9
ident = letter(letter|digit)∗
are examples
of regular definitions for letters, digits and identifiers.
3.1.6
The Decision Problem
for Regular Languages
For every regular expression r ∈ RE(Σ) there exists a string-processing machine M = M(r) such that for every w ∈ Σ∗, when input to M:
(1) if w ∈
L(r) then M terminates with output ‘Yes’, and
(2)
if w /∈ L(r) then M terminates with output ‘No’. Thus, every regular language is decidable.
The machines
in question are Deterministic Finite
State Automata.
3.2 Deterministic Finite State Automata
In this section we define the notion of
a DFA without reference to its application in lexical analysis. Here we are interested purely in solving the decision
problem for regular languages; that is, defining machines that say “yes” or “no” given an inputted string, depending on its membership of a particular language. In Section 3.3 we use DFAs as the basis
for lexical analysers: pattern matching algorithms that output sequences
of tokens.
3.2.1
Definition
A deterministic finite
state automaton (or DFA) M is a 5-tuple
M = (Q, Σ, δ, q0, F )
where
•
Q is a finite non-empty set of states,
•
Σ is an alphabet,
•
δ : Q × Σ → Q is the transition or next-state function,
•
q0
∈ Q is the initial state, and
•
F ⊆ Q is the set of accepting or final states.
The idea behind a DFA M is that it is an abstract
machine that defines
a language L(M) ⊆ Σ∗ in the following
way:
•
The machine begins
in its start state q0;
•
Given a string w ∈ Σ∗ the machine reads the symbols
of w one at a time from left to right;
•
Each symbol causes the machine to make a transition from its current state to a new state; if the current state is q and the input symbol is a, then the new state is δ(q, a);
•
The machine terminates when all the symbols of the string have been read;
•
If, when the machine terminates, its state is a member
of F , then the machine
accepts
w, else it rejects
w.
Note the name ‘final state’ is not a good one since a DFA does not terminate as soon as a ‘final state’ has been entered. The DFA only terminates when all the input has been read.
We formalise
this idea as follows:
Definition. Let M = (Q, Σ, δ, q0, F ) be a DFA. We define δˆ : Q × Σ∗ → Q by
δˆ(q, ǫ) = q
for each q ∈ Q and
for each q ∈ Q, a ∈ Σ and w ∈ Σ∗.