A C89 compiler written in Rust.
To build the compiler run the following in the root directory of this project:
cargo build --release
This creates the executable ./target/release/comp
, which can now be used to
compile files. ANTLR will have been run automatically to create the needed
parser files.
You can also use the following to build and run in one command:
cargo run --release
Comp will read from stdin
by default to get its input. Compiling from a file
is possible by giving it as a positional argument:
./comp INPUT.c
The output will by default also be written to stdout
, use -o/--output
to set
an output file path.
./comp INPUT.c -o OUTPUT.llvm
By default llvm ir will be emitted, use -e/--emit
to change this. The possible
output formats are: antlr-tree
, ast-dot
, ast-rust-dbg
, ir-rust-dbg
, and
llvm-ir
, mips-dbg
, mips-asm
.
./comp INPUT.c -o OUTPUT.dot -e ast-dot
When using one of the mips
emit option you will also have to set the target to mips
as well. This can be done with the -t/--target
option (the only two targets are
mips
and x86-64
):
./comp INPUT.c -t mips -e mips-asm
mips-asm
will also be automaticly selected when using the mips
target.
Lastly there is also --skip
to skip some optional passes. The two optional passes are
const-fold
and control-flow-analysis
. So
./comp INPUT.c -o OUTPUT.dot -e ast-dot --skip const-fold
will not run const folding.
-
comp
: The cli fronted that uses the comp library incomp_lib
. -
comp_lib
: The internal library used by the cli to compile files.comp_lib/grammar
: The ANTLR grammar files.comp_lib/src/codegen
: The generation of llvm/mips code from the ir.comp_lib/src/structure
: The different trees used by different steps in the compilation process.comp_lib/src/passes
: Code to turn one tree into another.
-
llvm_ir
: Internal library to easily generate llvm. -
mips_ir
: Internal library to easily generate mips asm and run control flow graph algorithms.
- The source code is parsed into a CST using ANTLR.
- The CST get transformed into AST.
- Const folding is run on this AST.
- The AST gets lowered to a IR, type checking is run at the same time.
- Some control flow analysis is run on the IR.
- Depending on the target one of the following is done:
- Codegen is run on the IR to create LLVM IR
The algorithms for register allocation were outlined by Hack et al. (2006) 1 and to compute liveness sets by Brandner et al. (2011) 2
- Codegen is run on the IR to create an MIPS control flow graph.
- Patching: Insert fake defines where needed.
- Dead code elimination: SSA pruning and removal of unreachable blocks and unused registers.
- Register allocation:
- Call isolation: Isolate calls in seperate basic blocks.
- Spiling: Save and load registers to and from memory when out of physical registers.
- Dead code elimination: Remove new unused registers.
- Call isolation: Reisolate calls.
- Coloring: Choose physical registers (colors) for virtual registers.
- Coalesing: Change colors to minimize moves and swaps.
- SSA destruction: Replace virtual registers and insert moves.
- Stack frame building.
- Devirtualization: Replace virtual instructions with real MIPS instructions.
- Simplification: Merge linear blocks.
- Fixing: Rearrange blocks and branches to make CFG representable in MIPS asm.
- Linking: Insert premade
printf
andscanf
when used and add specialmain
functionality.
A python script is provided (run.py
) which will run the compiler on the files in examples
.
A .asm
(mips asm), .llvm
, .ast.dot
, .ir.dot
, and .txt
(compiler output) file will be
created in the same folder for every input with the same file name. If there's a syntax error,
only the .txt
file will be generated. If there is a semantic error in the ast to ir step, the
.ir.dot
, .asm
and .llvm
files will not generated, etc.
All features required by assignments 1 - 6 are supported. As can be seen in the video.
Almost all optional features suggested by assignments 1 - 6 are also supported. Only dynmaic arrays and assigning whole array slices to other arrays are unsuported. See the video for more details
Some things the compiler supports as well, that were not explicitly mentioned in the assignments.
- Logical
&&
and||
have short circuiting. - Assignments can be used in expressions (as specified by the c standard).
- Pointer arithmetic.
- All c types: short, long, unsigned, double.
- Assignments in expressions.
- Bitwise operators:
&
,|
,^
,~
. - Octal and hexadecimal character escapes in string literals.
- Scientific notation for floats.
- Pointer arithmetic.
- Left and right shift.
- If's, for's, while's with a single statement as body (e.g.
if (...) a = 2;
) - Octal and hexadecimal number literals.
- Optimized registers allocation
This project uses the following dependencies:
anyhow
: for easier error propagationclap
: a crate to facilitate creating command line interfacescodespan-reporting
: to nicely format the diagnosticsis-terminal
: to detect whether stdout is written to a terminalantlr-rust
: fork of ANTLR4 with added support for Rustvec1
: staticly guaranteed non empty vectorsgenerational-arena
: arena based data structuresarrayvec
: dynamic array stored on the stack
Footnotes
-
Hack, S., Grund, D., & Goos, G. (2006). Register Allocation for Programs in SSA-Form. In Lecture Notes in Computer Science (pp. 247–262). Springer Science+Business Media. https://doi.org/10.1007/11688839_20 ↩
-
Brandner, F., Boissinot, B., Darte, A., De Dinechin, B. D., & Rastello, F. (2011). Computing Liveness Sets for SSA-Form Programs. INRIA, 25. https://inria.hal.science/inria-00558509v2 ↩