You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The aim of this project is to provide an example of a static code analyser and to remind people of the basics of creating languages.
The analyser has been implemented in Scala for several reasons:
Scala natively supports Java libraries, including ANTLR, which is one of the best-known libraries for writing programming languages.
Scala supports Algebraic data type and has powerful Pattern matching capabilities, which makes it very easy to define a language and perform analysis on it.
You can install the project using Gradle and build the application with the following commands:
Windows
./gradlew.bat build
Linux & MacOS
./gradlew build
The application will be built in the build/libs/ExampleAnalyser-X.X.X-all.jar file.
Tests
You can run the tests of the project using Gradle with the following commands:
Windows
./gradlew.bat test
Linux & MacOS
./gradlew test
Syntax
The syntax of a programming language determines what constitutes a valid program in terms of text. In almost all cases for programming languages, this syntax is defined using a Context-free grammar. In the course and in this example, the grammars are defined formally using the EBNF (extended Backus-Naur form) notation.
Grammar
In this example, we will use the following grammar to define the syntax of our language:
By using this grammar, it is possible to check whether a text follows a certain format and therefore to verify the syntax of the programming language we want. However, this grammar cannot be directly converted into code that automatically checks whether text follows the syntax of our language. This is why we usually use tools such as ANTLR to do this for us. Even if this involves rewriting the syntax in a format that the tool supports, the use of ANTLR saves a lot of time by creating code that can recognise a language automatically. You can find the grammar described above in the format supported by ANTLR if you go there. If you open it, you'll see that it looks very similar to our grammar using the format the EBNF notation. The main difference is that the file is divided into 2 parts, the Lexer part and the Parser part. The parser is responsible for converting text into words and removing unnecessary characters. The order of the rules written here is important because the parser will use the first rule that matches the text to create the words. This is why the keywords are above the identifiers. Without this, ANTLR would not be able to properly find the keywords. Then, there is the parser which is responsible for converting the sequence of words created by the lexer into a CST (Concrete Syntax Tree). This CST uses the Visitor design pattern, which makes it easy to browse for the information we want.
Concrete Syntax Tree
Here is an example of how some code can be converted into a CST using ANTLR and our grammar:
int a;
a = {
int b;
b = -a + b - 7;
3 * 4 / b
};
As you can see, with just 6 lines of code, this already represents a fairly large CST (If you are interested, this CST was generated using the ANTLR extension for IntelliJ). In addition, this tree has a strange shape because of the rules on expressions. Even if this seems strange, it's actually intentional because it allows to enforce the Order of operations directly at the syntax level and this is something that can be found in almost all programming languages (Example with the Python grammar).
Semantics
Once the syntactic analysis has been performed using ANTLR, it is then possible to perform a Semantic analysis. The semantics of a programming language define what has a meaning, i.e. which operations make sense and which don't, what happens when an instruction is executed, and so on. Indeed, does it make sense to assign a boolean expression to a variable of type int, given that the syntax allows it? In the case of our language, we would rather display an error to the user to indicate that it doesn't make sense because there is a mismatch between the type of the variable and the type of the expression. That's why we need to formally define the semantics of our language.
Semantic rules
The semantics of our language are defined by the rules of inference below:
The notation $< Var >$ corresponds to the set of variables in the program.
The symbol $\oplus$ corresponds to the operators +, -, *, /, <, >, <=, >=, !=, ==, and and or with their mathematical semantics.
The symbol $\mathcal{E}$ corresponds to the set of all possible environments.
The symbol $\sigma$, with $\sigma \in \mathcal{E}$ and $\sigma : < Var > \mapsto \mathbb{Z} \cup {True, False}$, corresponds to the environment of the function currently being executed.
The symbol $\Sigma$, with $\Sigma = \langle \sigma_0, ..., \sigma_n \rangle$, corresponds to the execution stack which is a sequence of environments.
The notation $\Sigma \bullet \sigma$ splits the execution stack into the environment of the function currently being executed $\sigma$ and the rest of the execution stack $\Sigma$.
The notation $(e, \Sigma \bullet \sigma) \leadsto v$ corresponds to the evaluation of an expression $e$ with respect to an environment $\sigma$ and the rest of the execution stack $\Sigma$ and which produces $v$ as a value.
The notation $\sigma[x \mapsto v]$ corresponds to the update of the environment $\sigma$ with the fact that $v$ is associated to $x$.
The notation $(I, \Sigma \bullet \sigma) \leadsto \Sigma \bullet \sigma'$ corresponds to the fact of executing an instruction $I$ with respect to a state $\Sigma \bullet \sigma$ and yielding a state $\Sigma \bullet \sigma'$.
The notation $stdout(v)$ corresponds to the fact that the value $v$ has been printed to $stdout$.
With these semantic rules, it is now possible to perform our semantic analysis. In statically typed languages like the one in our example, this can be done statically using the Visitor design pattern and the CST defined earlier. If the programmer tries to perform an operation that doesn't make sense, an error is produced in the visitor and then displayed to the programmer. To help us in this step, it is usually a good idea to use a Symbol table to store the variables that have been defined and their type, potential functions (even if there aren't any in our case), and so on. The symbol table used in this project can be found here. In general, it is also during this step that we try to simplify our CST into an AST (Abstract Syntax Tree) that allows us to keep only the information we need and to create a set of data structures that are easier to use in the rest of our application. The code that performs the semantic analysis and creates the AST can be found here.
Abstract Syntax Tree
In our example, the AST is defined using Scala classes that allow easy pattern matching and the use of visitors. These include sequences, instructions, types and expressions. Regarding the visitors, there is one for each type of AST class, as this allows us to return different types depending on what we need and not necessarily have to visit everything every time. In addition, a data parameter has been added because we often need to pass data from one node to another, and this parameter makes this easier.
Here is what the AST looks like after being built from the CST mentioned above:
In this section, we will define a Zero Analysis, similar to the one used in the course, by using the Worklist algorithm implemented here. The file containing the entire analysis code is available here.
Abstract values & Lattice
First, we need to define the set of abstract values $L$ that we are going to use, and the lattice that is used to order these values:
$$
L = {Bottom, Z, NZ, U}
$$
with:
$Bottom$ representing the fact that there is no value assigned yet.
$Z$ representing zero.
$NZ$ representing any value different from zero
$U$ representing the fact
that it is not known whether the value is zero or different from zero.
Control-flow graph
Next, we need to convert our AST into a CFG (Control-flow graph). The code that achieves this can be found here.
A CFG is composed of the following elements:
$PRED(p)$ : A function which returns the predecessors of the program point $p$.
$SUCC(p)$ : A function which returns the successors of the program point $p$.
$COND(p, p')$ : A function which returns a boolean expression that must be $True$ to go from the program point $p$ to the program point $p'$.
Abstract environment & Control-flow function
After that, we can define our abstract environment:
$$
\phi: < Var > \mapsto L
$$
with:
The notation $< Var >$ corresponds to the set of variables in the program.
Next, we can define our control-flow function by specifying its instances:
$$
\begin{align}
& fg [[ p ]] (\phi) = & \phi[x \mapsto Z] & \quad if & P[p] \equiv int\ x & \\
& & \phi[x \mapsto NZ] & \quad if & P[p] \equiv bool\ x & \\
& & f [[ P[p] ]] (\phi) & \quad if & P[p] \equiv x = E & \\
& & \phi[x \mapsto U] & \quad if & P[p] \equiv x = {S; E} & \\
& & \phi & \quad if & P[p] \equiv while (E) \lor P[p] \equiv if (E) \lor P[p] \equiv print \ E & \\
\end{align}
$$
with:
$$
\begin{align}
& f [[ x = 0 ]] (\phi) = & \phi[x \mapsto Z] & & & \\
& f [[ x = c ]] (\phi) = & \phi[x \mapsto NZ] & \quad if & c \in \mathbb{Z}_0 \cup {True, False} & \\
& f [[ x = y ]] (\phi) = & \phi[x \mapsto \phi(y)] & \quad if & y \in < Var > & \\
& f [[ x = c + d ]] (\phi) = & \phi[x \mapsto Z] & \quad if & c = -d & \\
& & \phi[x \mapsto U] & \quad otherwise & & \\
& f [[ x = y + z ]] (\phi) = & \phi[x \mapsto Z] & \quad if & \phi(y) = \phi(z) = Z & \\
& & \phi[x \mapsto U] & \quad otherwise & & \\
& f [[ x = y + c ]] (\phi) = & \phi[x \mapsto Z] & \quad if & \phi(y) = Z \wedge c = 0 & \\
& & \phi[x \mapsto NZ] & \quad if & \phi(y) = Z \wedge c \neq 0 & \\
& & \phi[x \mapsto NZ] & \quad if & \phi(y) = NZ \wedge c = 0 & \\
& & \phi[x \mapsto U] & \quad otherwise & & \\
& f [[ x = c + y ]] (\phi) = & f [[ x = y + c ]] (\phi) & & & \\
& f [[ x = y \oplus z ]] (\phi) = & \phi[x \mapsto U] & \quad if & \oplus \neq + & \\
\end{align}
$$
These control-flow function instances are almost the same as the ones in the course. It was a choice to make them not that precise.
Condition update function
Moreover, the Worklist algorithm also requires a function to update the abstract environments according to the boolean expressions in the $COND(p, p')$ function.
Here are the function instances that we will use in our analysis:
$$
\begin{align}
& cg[[ False ]] (\phi) = & \bot & & \\
& cg[[ y < c ]] (\phi) = & \phi[y \mapsto NZ] & \quad if & c \leq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c \leq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \phi[y \mapsto U] & \quad if & c > 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap U] & \quad if & c > 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap U \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c < y ]] (\phi) = & cg[[ y > c ]] (\phi) & & & \\
& cg[[ y > c ]] (\phi) = & \phi[y \mapsto NZ] & \quad if & c \geq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c \geq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \phi[y \mapsto U] & \quad if & c < 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap U] & \quad if & c < 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap U \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c > y ]] (\phi) = & cg[[ y < c ]] (\phi) & & & \\
& cg[[ y <= c ]] (\phi) = & \phi[y \mapsto NZ] & \quad if & c < 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c < 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \phi[y \mapsto U] & \quad if & c \geq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap U] & \quad if & c \geq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap U \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c <= y ]] (\phi) = & cg[[ y >= c ]] (\phi) & & & \\
& cg[[ y >= c ]] (\phi) = & \phi[y \mapsto NZ] & \quad if & c > 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c > 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \phi[y \mapsto U] & \quad if & c \leq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap U] & \quad if & c \leq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap U \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c >= y ]] (\phi) = & cg[[ y <= c ]] (\phi) & & & \\
& cg[[ y == c ]] (\phi) = & \phi[y \mapsto Z] & \quad if & c = 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap Z] & \quad if & c = 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap Z \neq \bot & \\
& & \phi[y \mapsto NZ] & \quad if & c \neq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c \neq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c == y ]] (\phi) = & cg[[ y == c ]] (\phi) & & & \\
& cg[[ y\ != c ]] (\phi) = & \phi[y \mapsto NZ] & \quad if & c = 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap NZ] & \quad if & c = 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap NZ \neq \bot & \\
& & \phi[y \mapsto U] & \quad if & c \neq 0 \wedge \phi(y) = \bot & \\
& & \phi[y \mapsto \phi(y) \sqcap U] & \quad if & c \neq 0 \wedge \phi(y) \neq \bot \wedge \phi(y) \sqcap U \neq \bot & \\
& & \bot & \quad otherwise & & \\
& cg[[ c\ != y ]] (\phi) = & cg[[ y\ != c ]] (\phi) & & & \\
& cg[[ E ]] (\phi) = & \phi & \quad if & \text{E is not defined in the other instances} & \\
\end{align}
$$
In addition, the $cg$ function can also be written as follows: