This tutorial is a progressive journey from a simple blinky design to a RISC-V core.
It works with the following boards:
- IceStick
- IceBreaker
- ULX3S
- ARTY
If you do not have a board, you can run everything in simulation (but it is not as fun).
- it is a progressive introduction, changing only one thing at a time. It is a curated version of my logbook when I learnt these notions (2020-2022). I also tryed to keep track of all the dead ends I explored and traps that caught me, they are often indicated as side remarks and notes;
- I try to keep hardware requirement as minimal as possible. With the tiniest FPGA (IceStick Ice40HX1K) you can do the first episode of the tutorial and transform it into a fully functional RV32I microcontroller that can execute compiled C code.
- in the end, the obtained processor is not the most efficient, but it is not a toy: it can execute any program. To anwser the question you may ask, yes, it runs DOOM! (but not on an IceStick, you will need a larger FPGA). It works with the help of LiteX that has a nice SDRAM controller, because Doom needs some RAM;
- the tutorial is both about hardware and software: you will learn how to compile programs in assembly and in C for your core;
- I try to make all example programs fun and interesting while reasonably short. The bundled
demo programs include:
- mandelbrot set in assembly and in C
- rotozoom graphic effect
- drawing filled polygons
- raytracing These graphic program are all displayed in text mode on the terminal, using ANSI escape sequences (yes, this makes BIG pixels). For more fun, it is also possible to use a small OLED display instead (will add instructions for that in the future).
- Episode II is on pipelining, you will learn there how to transform the basic processor obtained at the end of this tutorial into a more efficient pipelined processor with branch prediction.
- [Episode III)(INTERRUPTS.md) is a WIP on interrupts and the priviledged RISC-V ISA.
- This tutorial is in VERILOG. It is currently being ported into other HDLs
- Amaranth/nMigen version by @bl0x
- TODO: Silice version
- TODO: SpinalHDL version
To understand processor design, the first thing that I have read was this answer on Stackoverflow, that I found inspiring. There is also this article suggested by @mithro. For a complete course, I highly recommend this one from the MIT, it also gives the principles for going much further than what I've done here (pipelines etc...).
For Verilog basics and syntax, I read Verilog by example by Blaine C. Readler, it is also short and to the point.
There are two nice things with the Stackoverflow answer:
- it goes to the essential, and keeps nothing else than what's essential
- the taken example is a RISC processor, that shares several similarities with RISC-V (except that it has status flags, that RISC-V does not have).
What we learn there is that there will be a register file, that stores
the so-called general-purpose registers. By general-purpose, we mean
that each time an instruction reads a register, it can be any of them,
and each time an instruction writes a register, it can be any of them,
unlike the x86 (CISC) that has specialized registers. To implement the
most general instruction (register <- register OP register
), the
register file will read two registers at each cycle, and optionally
write-back one.
There will be an ALU, that will compute an operation on two values.
There will be also a decoder, that will generate all required internal signals from the bit pattern of the current instruction.
If you want to design a RISC-V processor on your own, I recommend you take a deep look at the Stackoverflow answer, and do some schematics on your own to have all the general ideas in mind before going further... or you can choose to directly jump into this tutorial, one step at a time. It will gently take you from the most trivial Blinky design to a fully functional RISC-V core.
First step is cloning the learn-fpga repository:
$ git clone https://github.com/BrunoLevy/learn-fpga.git
Before starting, you will need to install the following softwares:
- iverilog/icarus (simulation)
$ sudo apt-get install iverilog
- yosys/nextpnr, the toolchain for your board. See this link.
Note that iverilog/icarus is sufficient to run and play with all the steps of the tutorial, but the experience is not the same. I highly recommend to run each step on a real device. The feeling and excitation of your own processor running some code for the first time is not of the same magnitude when you are doing simulation !!!
Let us start and create our first blinky ! Our blinky is implemented as a VERILOG module, connected to inputs and outputs, as follows (step1.v):
module SOC (
input CLK,
input RESET,
output [4:0] LEDS,
input RXD,
output TXD
);
reg [4:0] count = 0;
always @(posedge CLK) begin
count <= count + 1;
end
assign LEDS = count;
assign TXD = 1'b0; // not used for now
endmodule
We call it SOC (System On Chip), which is a big name for a blinky, but that's what our blinky will be morphed into after all the steps of this tutorial. Our SOC is connected to the following signals:
CLK
(input) is the system clock.LEDS
(output) is connected to the 5 LEDs of the board.RESET
(input) is a reset button. You'll say that the IceStick has no button, but in fact ... (we'll talk about that later)RXD
andTXD
(input,output) connected to the FTDI chip that emulates a serial port through USB. We'll also talk about that later.
You can synthesize and send the bitstream to the device as follows:
$ BOARDS/run_xxx.sh step1.v
where xxx
corresponds to your board.
The five leds will light on... but they are not blinking. Why is this so ? In fact they are blinking, but it is too fast for you to distinguish anything.
To see something, it is possible to use simulation. To use simulation, we write
a new VERILOG file bench_iverilog.v,
with a module bench
that encapsulates our SOC
:
module bench();
reg CLK;
wire RESET = 0;
wire [4:0] LEDS;
reg RXD = 1'b0;
wire TXD;
SOC uut(
.CLK(CLK),
.RESET(RESET),
.LEDS(LEDS),
.RXD(RXD),
.TXD(TXD)
);
reg[4:0] prev_LEDS = 0;
initial begin
CLK = 0;
forever begin
#1 CLK = ~CLK;
if(LEDS != prev_LEDS) begin
$display("LEDS = %b",LEDS);
end
prev_LEDS <= LEDS;
end
end
endmodule
The module bench
drives all the signals of our SOC
(called
uut
here for "unit under test"). The forever
loop wiggles
the CLK
signal and displays the status of the LEDs whenever
it changes.
Now we can start the simulation:
$ iverilog -DBENCH -DBOARD_FREQ=10 bench_iverilog.v step1.v
$ vvp a.out
... but that's a lot to remember, so I created a script for that, you'll prefer to do:
$ ./run.sh step1.v
You will see the LEDs counting. Simulation is precious, it lets
you insert "print" statements ($display
) in your VERILOG code,
which is not directly possible when you run on the device !
To exit the simulation:
<ctrl><c>
finish
Note: I developped the first version of femtorv completely on device, using only the LEDs to debug because I did not know how to use simulation, don't do that, it's stupid !
Try this How would you modify step1.v
to slow it down
sufficiently for one to see the LEDs blinking ?
Try this Can you implement a "Knight driver"-like blinking pattern instead of counting ?
You probably got it right: the blinky can be slowed-down either
by counting on a larger number of bits (and wiring the most
significant bits to the leds), or inserting a "clock divider"
(also called a "gearbox") that counts on a large number
of bits (and driving the counter
with its most significant bit). The second solution is interesting,
because you do not need to modify your design, you just insert
the clock divider between the CLK
signal of the board and your
design. Then, even on the device you can distinguish what happens
with the LEDs.
To do that, I created a Clockworks
module in clockworks.v,
that contains the gearbox and a mechanism related with the RESET
signal (that
I'll talk about later). Clockworks
is implemented as follows:
module Clockworks
(
input CLK, // clock pin of the board
input RESET, // reset pin of the board
output clk, // (optionally divided) clock for the design.
output resetn // (optionally timed) negative reset for the design (more on this later)
);
parameter SLOW;
...
reg [SLOW:0] slow_CLK = 0;
always @(posedge CLK) begin
slow_CLK <= slow_CLK + 1;
end
assign clk = slow_CLK[SLOW];
...
endmodule
This divides clock frequency by 2^SLOW
.
The Clockworks
module is then inserted
between the CLK
signal of the board
and the design, using an internal clk
signal, as follows, in step2.v:
`include "clockworks.v"
module SOC (
input CLK, // system clock
input RESET, // reset button
output [4:0] LEDS, // system LEDs
input RXD, // UART receive
output TXD // UART transmit
);
wire clk; // internal clock
wire resetn; // internal reset signal, goes low on reset
// A blinker that counts on 5 bits, wired to the 5 LEDs
reg [4:0] count = 0;
always @(posedge clk) begin
count <= !resetn ? 0 : count + 1;
end
// Clock gearbox (to let you see what happens)
// and reset circuitry (to workaround an
// initialization problem with Ice40)
Clockworks #(
.SLOW(21) // Divide clock frequency by 2^21
)CW(
.CLK(CLK),
.RESET(RESET),
.clk(clk),
.resetn(resetn)
);
assign LEDS = count;
assign TXD = 1'b0; // not used for now
endmodule
It also handles the RESET
signal.
Now you can try it on simulation:
$ ./run.sh step2.v
As you can see, the counter is now much slower. Try it also on device:
$ BOARDS/run_xxx.sh step2.v
Yes, now we can see clearly what happens ! And what about the RESET
button ? The IceStick has no button. In fact it has one !
Press a finger on the circled region of the image (around pin 47).
Try this Knight-driver mode, and RESET
toggles direction.
If you take a look at clockworks.v, you will see it can
also create a PLL
, it is a component that can be used to generate
faster clocks. For instance, the IceStick has a 12 MHz system clock,
but the core that we will generate will run at 45 MHz. We will see that
later.
Now we got all the tools that we need, so let's see how to transform this blinker into a fully-functional RISC-V processor. This goal seems to be far far away, but the processor we will have created at step 16 is not longer than 200 lines of VERILOG ! I was amazed to discover that it is that simple to create a processor. OK, let us go there one step at a time.
We know already that a processor has a memory, and fetches instructions from there, in a sequential manner most of the time (except when there are jumps and branches). Let us start with something similar, but much simpler: a pre-programmed christmas tinsel, that loads the LEDs pattern from a memory (see step3.v). Our tinsel has a memory with the patterns:
reg [4:0] MEM [0:20];
initial begin
MEM[0] = 5'b00000;
MEM[1] = 5'b00001;
MEM[2] = 5'b00010;
MEM[3] = 5'b00100;
...
MEM[19] = 5'b10000;
MEM[20] = 5'b00000;
end
Note that what's in the initial block does not generate any circuitry when synthesized, it is directly translated into the initialization data for the BRAMs of the FPGA.
We will also have a "program counter" PC
incremented at each clock, and
a mechanism to fetch MEM
contents indexed by PC
:
reg [4:0] PC = 0;
reg [4:0] leds = 0;
always @(posedge clk) begin
leds <= MEM[PC];
PC <= (!resetn || PC==20) ? 0 : (PC+1);
end
Note the test PC==20
to make it cycle.
Now try it with simulation and on device.
Try this create several blinking modes, and switch between
modes using RESET
.
An important source of information is of course the RISC-V reference manual. There you learn that there are several flavors of the RISC-V standard. Let us start from the simplest one (RV32I, that is, 32 bits base integer instruction set). Then we will see how to add things, one thing at a time. This is a very nice feature of RISC-V, since the instruction set is modular, you can start with a very small self-contained kernel, and this kernel will be compliant with the norm. This means standard tools (compiler, assembler, linker) will be able to generate code for this kernel. Then I started reading Chapter 2 (page 13 to page 30). Seeing also the table page 130, there are in fact only 11 different instrutions ! (I say for instance that an AND, an OR, an ADD ... are the same instruction, the operation is just an additional parameter). Now we just try to have an idea of the overall picture, no need to dive into the details for now. Let's take a global look at these 11 instructions:
instruction | description | algo |
---|---|---|
branch | conditional jump, 6 variants | if(reg OP reg) PC<-PC+imm |
ALU reg | Three-registers ALU ops, 10 variants | reg <- reg OP reg |
ALU imm | Two-registers ALU ops, 9 variants | reg <- reg OP imm |
load | Memory-to-register, 5 variants | reg <- mem[reg + imm] |
store | Register-to-memory, 3 variants | mem[reg+imm] <- reg |
LUI |
load upper immediate | reg <- (im << 12) |
AUIPC |
add upper immediate to PC | reg <- PC+(im << 12) |
JAL |
jump and link | reg <- PC+4 ; PC <- PC+imm |
JALR |
jump and link register | reg <- PC+4 ; PC <- reg+imm |
FENCE |
memory-ordering for multicores | (not detailed here, skipped for now) |
SYSTEM |
system calls, breakpoints | (not detailed here, skipped for now) |
-
The 6 branch variants are conditional jumps, that depend on a test on two registers.
-
ALU operations can be of the form
register <- register OP register
orregister <- register OP immediate
-
Then we have load and store, that can operate on bytes, on 16 bit values (called half-words) or 32 bit values (called words). In addition byte and half-word loads can do sign expansion. The source/target address is obtained by adding an immediate offset to the content of a register.
-
The remaining instructions are more special (one may skip their description in a first read, you just need to know that they are used to implement unconditional jumps, function calls, memory ordering for multicores, system calls and breaks):
-
LUI
(load upper immediate) is used to load the upper 20 bits of a constant. The lower bits can then be set usingADDI
orORI
. At first sight it may seem weird that we need two instructions to load a 32 bit constant in a register, but in fact it is a smart choice, because all instructions are 32-bit long. -
AUIPC
(add upper immediate to PC) adds a constant to the current program counter and places the result in a register. It is meant to be used in combination withJALR
to reach a 32-bit PC-relative address. -
JAL
(jump and link) adds an offset to the PC and stores the address of the instruction following the jump in a register. It can be used to implement function calls.JALR
does the same thing, but adds the offset to a register. -
FENCE
andSYSTEMS
are used to implement memory ordering in multicore systems, and system calls/breaks respectively.
-
To summarize, we got branches (conditional jumps), ALU operations, load and store, and a couple of special instructions used to implement unconditional jumps and function calls. There are also two functions for memory ordering and system calls (but we will ignore these two ones for now). OK, in fact only 9 instructions then, it seems doable... At this point, I had not understood everything, so I'll start from what I think to be the simplest parts (intruction decoder, register file and ALU), then we will see how things are interconnected, how to implement jumps, branches, and all the instructions.
Now the idea is to have a memory with RISC-V instructions in it, load all instructions
sequentially (like in our christmas tinsel), in an instr
register, and see how to recognize
among the 11 instructions (and light a different LED in function of the recognized instruction). Each
instruction is encoded in a 32-bits word, and we need to decode the different bits of this word to
recognize the instruction and its parameters.
The RISC-V reference manual has all the information that we need summarized in two tables in page 130 (RV32/64G Instruction Set Listings).
Let us take a look at the big table, first thing to notice is that the 7 LSBs tells you which instruction it is
(there are 10 possibilities, we do not count FENCE
for now).
reg [31:0] instr;
...
wire isALUreg = (instr[6:0] == 7'b0110011); // rd <- rs1 OP rs2
wire isALUimm = (instr[6:0] == 7'b0010011); // rd <- rs1 OP Iimm
wire isBranch = (instr[6:0] == 7'b1100011); // if(rs1 OP rs2) PC<-PC+Bimm
wire isJALR = (instr[6:0] == 7'b1100111); // rd <- PC+4; PC<-rs1+Iimm
wire isJAL = (instr[6:0] == 7'b1101111); // rd <- PC+4; PC<-PC+Jimm
wire isAUIPC = (instr[6:0] == 7'b0010111); // rd <- PC + Uimm
wire isLUI = (instr[6:0] == 7'b0110111); // rd <- Uimm
wire isLoad = (instr[6:0] == 7'b0000011); // rd <- mem[rs1+Iimm]
wire isStore = (instr[6:0] == 7'b0100011); // mem[rs1+Simm] <- rs2
wire isSYSTEM = (instr[6:0] == 7'b1110011); // special
Besides the instruction type, we need also to decode the arguments of the instruction.
The table on the top distinguishes 6 types of instructions
(R-type
,I-type
,S-type
,B-type
,U-type
,J-type
), depending on the arguments
of the instruction and how they are encoded within the 32 bits of the instruction word.
R-type
instructions take two source registers rs1
and rs2
,
apply an operation on them and stores the result in a
third destination register rd
(ADD
, SUB
, SLL
, SLT
, SLTU
, XOR
,
SRL
, SRA
, OR
, AND
).
Since RISC-V has 32 registers,
each of rs1
,rs2
and rd
use 5 bits of the instruction
word. Interestingly, these are the same bits for all
instruction formats. Hence, "decoding" rs1
,rs2
and rd
is just a matter of drawing some wires
from the instruction word:
wire [4:0] rs1Id = instr[19:15];
wire [4:0] rs2Id = instr[24:20];
wire [4:0] rdId = instr[11:7];
Then, one needs to recognize among the 10 R-type instructions.
It is done mostly with the funct3
field, a 3-bits code. With
a 3-bits code, one can only encode 8 different instructions, hence
there is also a funct7
field (7 MSBs of instruction word). Bit
30 of the instruction word encodes ADD
/SUB
and SRA
/SRL
(arithmetic right shift with sign expansion/logical right shift).
The instruction decoder has wires for funct3
and funct7
:
wire [2:0] funct3 = instr[14:12];
wire [6:0] funct7 = instr[31:25];
I-type
instructions take one register rs1
, an immediate value
Iimm
, applies an operation on them and stores the result in the
destination register rd
(ADDI
, SLTI
, SLTIU
, XORI
, ORI
,
ANDI
, SLLI
, SRLI
, SRAI
).
Wait a minute: there are 10 R-Type instructions but only 9 I-Type
instructions, why is this so ? If you look carefully, you will see
that there is no SUBI
, but one can instead use ADDI
with a
negative immediate value. This is a general rule in RISC-V, if an
existing functionality can be used, do not create a new functionality.
As for R-type instructions, the instruction can be distinguished using
funct3
and funct7
(and in funct7
, only the bit 30 of the instruction
word is used, to distinguish SRAI
/SRLI
arithmetic and logical right shifts).
The immediate value is encoded in the 12 MSBs of the instruction word, hence we will draw additional wires to get it:
wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};
As can be seen, bit 31 of the instruction word is repeated 21 times, this is "sign expansion" (converts a 12-bits signed quantity into a 32-bits one).
There are four other instruction formats S-type
(for Store),
B-type
(for Branch), U-type
(for Upper immediates that
are left-shifted by 12), and J-type
(for Jumps). Each
instruction format has a different way of encoding an immediate
value in the instruction word.
To understand what it means, let's get back to Chapter 2, page 16. The different instruction types correspond to the way immediate values are encoded in them.
Instr. type | Description | Immediate value encoding |
---|---|---|
R-type |
register-register ALU ops. more on this here | None |
I-type |
register-immediate integer ALU ops and JALR . |
12 bits, sign expansion |
S-type |
store | 12 bits, sign expansion |
B-type |
branch | 12 bits, sign expansion, upper [31:1] (bit 0 is 0) |
U-type |
LUI ,AUIPC |
20 bits, upper 31:12 (bits [11:0] are 0) |
J-type |
JAL |
12 bits, sign expansion, upper [31:1] (bit 0 is 0) |
Note that I-type
and S-type
encode the same type of values (but they are taken from different parts of instr
).
Same thing for B-type
and J-type
.
One can decode the different types of immediates as follows:
wire [31:0] Uimm={ instr[31], instr[30:12], {12{1'b0}}};
wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};
wire [31:0] Simm={{21{instr[31]}}, instr[30:25],instr[11:7]};
wire [31:0] Bimm={{20{instr[31]}}, instr[7],instr[30:25],instr[11:8],1'b0};
wire [31:0] Jimm={{12{instr[31]}}, instr[19:12],instr[20],instr[30:21],1'b0};
Note that Iimm
, Simm
, Bimm
and Jimm
do sign expansion (by copying
bit 31 the required number of times to fill the MSBs).
And that's all for our instruction decoder ! To summarize, the instruction decoder gets the following information from the instruction word:
- signals isXXX that recognizes among the 11 possible RISC-V instructions
- source and destination registers
rs1
,rs2
andrd
- function codes
funct3
andfunct7
- the five formats for immediate values (with sign expansion for
Iimm
,Simm
,Bimm
andJimm
).
Let us now initialize the memory with a few RISC-V instruction and see whether we can recognize them
by lighting a different LED depending on the instruction (step4.v). To do that, we use
the big table in page 130 of the
RISC-V reference manual.
It is a bit painful (we will see easier ways later !). Using the _
character to separate fields of a binary constant is
especially interesting under this circumstance.
initial begin
// add x1, x0, x0
// rs2 rs1 add rd ALUREG
MEM[0] = 32'b0000000_00000_00000_000_00001_0110011;
// addi x1, x1, 1
// imm rs1 add rd ALUIMM
MEM[1] = 32'b000000000001_00001_000_00001_0010011;
...
// lw x2,0(x1)
// imm rs1 w rd LOAD
MEM[5] = 32'b000000000000_00001_010_00010_0000011;
// sw x2,0(x1)
// imm rs2 rs1 w imm STORE
MEM[6] = 32'b000000_00001_00010_010_00000_0100011;
// ebreak
// SYSTEM
MEM[7] = 32'b000000000001_00000_000_00000_1110011;
end
Then we can fetch and recognize the instructions as follows:
always @(posedge clk) begin
if(!resetn) begin
PC <= 0;
end else if(!isSYSTEM) begin
instr <= MEM[PC];
PC <= PC+1;
end
end
assign LEDS = isSYSTEM ? 31 : {PC[0],isALUreg,isALUimm,isStore,isLoad};
(first led is wired to PC[0]
so that we will see it blinking even if
there is the same instruction several times).
As you can see, the program counter is only incremented if instruction
is not SYSTEM
. For now, the only SYSTEM
instruction that we support
is EBREAK
, that halts execution.
In simulation mode, we can in addition display the name of the recognized instruction and the fields:
`ifdef BENCH
always @(posedge clk) begin
$display("PC=%0d",PC);
case (1'b1)
isALUreg: $display("ALUreg rd=%d rs1=%d rs2=%d funct3=%b",rdId, rs1Id, rs2Id, funct3);
isALUimm: $display("ALUimm rd=%d rs1=%d imm=%0d funct3=%b",rdId, rs1Id, Iimm, funct3);
isBranch: $display("BRANCH");
isJAL: $display("JAL");
isJALR: $display("JALR");
isAUIPC: $display("AUIPC");
isLUI: $display("LUI");
isLoad: $display("LOAD");
isStore: $display("STORE");
isSYSTEM: $display("SYSTEM");
endcase
end
`endif
Try this run step4.v
in simulation and on the device. Try initializing the memory with
different RISC-V instruction and test whether the decoder recognizes them.
This paragraph may be skipped. it just contains my own impressions and reflexions on the RISC-V instruction set, inspired by the comments and Q&A in italics in the RISC-V reference manual.
At this point, I realized what an instruction set architecture means: it is for sure a specification of what bit pattern does what (Instruction Set) and it is also at the same time driven by how this will be translated into wires (Architecture). An ISA is not abstract, it is independent on an implementation, but it is strongly designed with implementation in mind ! While the pipeline, branch prediction unit, multiple execution units, caches may differ in different implementations, the instruction decoder is probably very similar in all implementations.
There were things that seemed really weird to me
in the first place: all these immediate format variants, the fact that immediate values are scrambled in different bits of instr
,
the zero
register, and the weird instructions LUI
,AUIPC
,JAL
,JALR
. When writing the instruction decoder, you better understand the reasons. The
ISA is really smart, and is the result of a long evolution (there were RISC-I, RISC-II, ... before). It seems to me the result of a
distillation. Now, in 2020, many things were tested in terms of ISA, and this one seems to have benefited from all the previous
attempts, taking the good choices and avoiding the suboptimal ones.
What is really nice in the ISA is:
- instruction size is fixed. Makes things really easier. (there are extension with varying instrution length, but at least the core instruction set is simple);
rs1
,rs2
,rd
are always encoded by the same bits ofinstr
;- the immediate formats that need to do sign expansion do it from the same bit (
instr[31]
); - the weird instructions
LUI
,AUIPC
,JAL
,JALR
can be combined to implement higher-level tasks (load 32-bit constant in register, jump to arbitrary address, function calls). Their existence is justified by the fact it makes the design easier. Then assembly programmer's life is made easier by pseudo-instructionsCALL
,RET
, ... See risc-v assembly manual, the two tables at the end of the page. Same thing for tests/branch instructions obtained by swapping parameters (e.g.a < b <=> b > a
etc...), there are pseudo-instructions that do the job for you.
Put differently, to appreciate the elegance of the RISC-V ISA, imagine that your mission is to invent it. That is, invent both the set of instructions and the way they are encoded as bit patterns. The constraints are:
- fixed instruction length (32 bits)
- as simple as possible: the ultimate sophistication is simplicity [Leonardo da Vinci] !!
- source and destination registers always encoded at the same position
- whenever there is sign-extension, it should be done from the same bit
- it should be simple to load an arbitrary 32-bits immediate value in a register (but may take several instructions)
- it should be simple to jump to arbitrary memory locations (but may take several instructions)
- it should be simple to implement function calls (but may take several instructions)
Then you understand why there are many different immediate
formats. For instance, consider JAL
, that does not have a source
register, as compared to JALR
that has one. Both take an immediate
value, but JAL
has 5 more bits available to store it, since it does
not need to encode the source register. The slightest available bit is
used to extend the dynamic range of the immediates. This explains both
the multiple immediate formats and the fact that they are assembled
from multiple pieces of instr
, slaloming between the three fixed
5-bits register encodings, that are there or not depending on the
cases.
Now the rationale behind the weird instructions LUI
,AUIPC
,JAL
and JALR
is to give a set of functions that can be combined to load
arbitrary 32-bit values in register, or to jump to arbitrary locations
in memory, or to implement the function call protocol as simply as
possible. Considering the constraints, the taken choices (that seemed
weird to me in the first place) perfectly make sense. In addition,
with the taken choices, the instruction decoder is pretty simple and
has a low logical depth. Besides the 7-bits instruction decoder, it
mostly consists of a set of wires drawn from the bits of instr
, and
duplication of the sign-extended bit 31 to form the immediate values.
Before moving forward, I'd like to say a word about the zero
register.
I think it is really a smart move. With it, you do not need a MOV rd rs
instruction (just ADD rd rs zero
), you do not need a NOP
instruction (ADD zero zero zero
), and all the branch variants can
compare with zero
! I think that zero
is a great invention, not as great
as 0
, but really makes the instruction set more compact.
The register bank is implemented as follows:
reg [31:0] RegisterBank [0:31];
Let us take a closer look at what we need to to to execute an instruction. Consider for instance a stream of R-type instructions. For each instruction, we need to do the following four things:
- fetch the instruction:
instr <= MEM[PC]
- fetch the values of
rs1
andrs2
:rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]
wherers1
andrs2
are two registers. We need to do that becauseRegisterBank
will be synthesized as a block of BRAM, and one needs one cycle to access the content of BRAM. - compute
rs1
OP
rs2
(whereOP
depends onfunct3
andfunct7
) - store the result in
rd
:RegisterBank[rdId] <= writeBackData
. This can be done during the same cycle as the previous step ifOP
is computed by a combinatorial circuit.
The first three operations are implemented by a state machine, as follows (see step5.v):
localparam FETCH_INSTR = 0;
localparam FETCH_REGS = 1;
localparam EXECUTE = 2;
reg [1:0] state = FETCH_INSTR;
always @(posedge clk) begin
case(state)
FETCH_INSTR: begin
instr <= MEM[PC];
state <= FETCH_REGS;
end
FETCH_REGS: begin
rs1 <= RegisterBank[rs1Id];
rs2 <= RegisterBank[rs2Id];
state <= EXECUTE;
end
EXECUTE: begin
PC <= PC + 1;
state <= FETCH_INSTR;
end
endcase
end
end
The fourth one (register write-back) is implemented in this block:
wire [31:0] writeBackData = ... ;
wire writeBackEn = ...;
always @posedge(clk) begin
if(writeBackEn && rdId != 0) begin
RegisterBank[rdId] <= writeBackData;
end
end
Remember that writing to register 0 has no effect (hence the test rdId != 0
).
The signal writeBackEn
is asserted whenever writeBackData
should be written
to register rdId
.
The data to be written back (writeBackData
) will be obtained from the ALU,
as explained in the next episode.
Try this: run step5.v in simulation and on the device. You will see your wannabe CPU's state machine dancing waltz on the LEDs (that display the current state).
Now we can fetch instructions from memory, decode them and read register values, but our (wannabe) CPU is still unable to do anything. Let us see how to do actual computations on register's values.
So, are you going to create an ALU
module ? And by the way, why did not
you create a Decoder
module, and a RegisterBank
module ?
My very first design used multiple modules and multiple files, for a total of 1000 lines of code or so, then Matthias Koch wrote a monolithic version, that fits in 200 lines of code. Not only it is more compact, but also it is much easier to understand when you got everything in one place. Rule of thumb: if you have more boxes and wires between the boxes than circuitry in the boxes, then you have too many boxes !
But wait a minute, modular design is good, no ?
Modular design is neither good nor bad, it is useful whenever it makes things simpler. It is not the case in the present situation. There is no absolute answer though, it is a matter of taste and style ! In this tutorial, we use a (mostly) monolithic design.
Now we want to implement two types of instructions:
- Rtype:
rd
<-rs1
OP
rs2
(recognized byisALUreg
) - Itype:
rd
<-rs1
OP
Iimm
(recognized byisALUimm
)
The ALU takes two inputs aluIn1
and aluIn2
, computes
aluIn1
OP
aluIn2
and stores it in aluOut
:
wire [31:0] aluIn1 = rs1;
wire [31:0] aluIn2 = isALUreg ? rs2 : Iimm;
reg [31:0] aluOut;
Depending on the instruction type, aluIn2
is either the value
in the second source register rs2
, or an immediate in the Itype
format (Immm
). The operation OP
depends mostly on funct3
(and also on funct7
). Keep a copy of the RISC-V reference manual open page 130 on your knees or in another window:
funct3 | operation |
---|---|
3'b000 | ADD or SUB |
3'b001 | left shift |
3'b010 | signed comparison (<) |
3'b011 | unsigned comparison (<) |
3'b100 | XOR |
3'b101 | logical right shift or arithmetic right shift |
3'b110 | OR |
3'b111 | AND |
- for
ADD
/SUB
, if its anALUreg
operation (Rtype), then one makes the difference betweenADD
andSUB
by testing bit 5 offunct7
(1 forSUB
). If it is anALUimm
operation (Itype), then it can be onlyADD
. In this context, one just needs to test bit 5 ofinstr
to distinguish betweenALUreg
(if it is 1) andALUimm
(if it is 0). - for logical or arithmetic right shift, one makes the difference also by testing
bit 5 of
funct7
, 1 for arithmetic shift (with sign expansion) and 0 for logical shift. - the shift amount is either the content of
rs2
forALUreg
instructions orinstr[24:20]
(the same bits asrs2Id
) forALUimm
instructions.
Putting everything together, one gets the following VERILOG code for the ALU:
reg [31:0] aluOut;
wire [4:0] shamt = isALUreg ? rs2[4:0] : instr[24:20]; // shift amount
always @(*) begin
case(funct3)
3'b000: aluOut = (funct7[5] & instr[5]) ? (aluIn1-aluIn2) : (aluIn1+aluIn2);
3'b001: aluOut = aluIn1 << shamt;
3'b010: aluOut = ($signed(aluIn1) < $signed(aluIn2));
3'b011: aluOut = (aluIn1 < aluIn2);
3'b100: aluOut = (aluIn1 ^ aluIn2);
3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : (aluIn1 >> shamt);
3'b110: aluOut = (aluIn1 | aluIn2);
3'b111: aluOut = (aluIn1 & aluIn2);
endcase
end
Note: although it is declared as a reg
, aluOut
will be a combinatorial function
(no flipflop generated), because its value is determined in a combinatorial block
(always @(*)
), and all the configurations are enumerated in the case
statement.
Register write-back is configured as follows:
assign writeBackData = aluOut;
assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm));
Try this run step6.v in simulation and on the device. In simulation
it will display the written value and the written register for all register
write-back operation. On the device it will show the 5 LSBs of x1
on the LEDs.
Then you can try changing the program, and observe the effect on register values.
You are here ! This is the list of instructions you have to implement, your wannabe RISC-V core currently supports 20 of them. Next steps: jumps, then branches, then... the rest. Before then, as you probably have noticed, translating RISC-V programs into binary (that is, assembling manually) is extremely painful. Next section gives a much easier solution.
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [ ] 2 | [ ] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 |
To avoid having to manually translate RISC-V assembly into binary, one can
use the GNU assembler, generate a binary file, translate it into hexadecimal
and use the VERILOG function readmemh()
to initialize memory with the
content of that file. We will see later how to do that.
But in our case, it would be very convenient to be able to write small assembly programs directly in the same VERILOG file as our design. In fact, it is possible to do so, by implementing a RISC-V assembler directly in VERILOG (using tasks and functions), as done in riscv_assembly.v.
In step7.v, memory is initialized with the same assembly program as in step6.v. It looks like that now, Much easier to read, no ?
`include "riscv_assembly.v"
initial begin
ADD(x0,x0,x0);
ADD(x1,x0,x0);
ADDI(x1,x1,1);
ADDI(x1,x1,1);
ADDI(x1,x1,1);
ADDI(x1,x1,1);
ADD(x2,x1,x0);
ADD(x3,x1,x2);
SRLI(x3,x3,3);
SLLI(x3,x3,31);
SRAI(x3,x3,5);
SRLI(x1,x3,26);
EBREAK();
end
Note: riscv_assembly.v
needs to be included from inside the module that
uses assembly.
In this step, we make another modification: in the previous steps, PC
was
the index of the current instruction. For what follows, we want it to be
the address of the current instruction. Since each instruction is 32-bits
long, it means that:
- to increment
PC
, we doPC <= PC + 4
(instead ofPC <= PC + 1
as before) - to fetch the current instruction, we do
instr <= MEM[PC[31:2]];
(we ignore the two LSBs ofPC
).
There are two jump instructions, JAL
(jump and link), and JALR
(jump and
link register). By "and link", one means that the current PC can be written
to a register. Hence JAL
and JALR
can be used to implement not only
jumps, but also function calls. Here is what the two instructions are
supposed to do:
instruction | effect |
---|---|
JAL rd,imm | rd<-PC+4; PC<-PC+Jimm |
JALR rd,rs1,imm | rd<-PC+4; PC<-rs1+Iimm |
To implement these two instructions, we need to make
the following changes to our core. First thing is
register write-back: now value can be PC+4
instead
of aluOut
for jump instructions:
assign writeBackData = (isJAL || isJALR) ? (PC + 4) : aluOut;
assign writeBackEn = (state == EXECUTE &&
(isALUreg ||
isALUimm ||
isJAL ||
isJALR)
);
We also need to declare a nextPC
value, that implements the
three possibilities:
wire [31:0] nextPC = isJAL ? PC+Jimm :
isJALR ? rs1+Iimm :
PC+4;
Then, in the state machine, the line PC <= PC + 4;
is replaced
with PC <= nextPC;
and that's all !
We can now implement a simple (infinite) loop to test our new jump instruction:
`include "riscv_assembly.v"
integer L0_=4;
initial begin
ADD(x1,x0,x0);
Label(L0_);
ADDI(x1,x1,1);
JAL(x0,LabelRef(L0_));
EBREAK();
endASM();
end
The integer L0_
is a label. Unlike with a real assembler, we
need to specify the value of L0_
by hand. Here it is easy,
L0_
is right after the first instruction, hence it corresponds
to the beginning of the RAM (0) plus one 32-bits words, that is, 4.
For longer programs with many labels, you can let the labels uninitialized
(integer L0_;
) then the first time you run the program, it will compute and display the
values to be used for the labels. It is not super-convenient, but still
much better than assembling by hand / determining the labels by hand.
The LabelRef()
function computes the label's offset relative to the current program
counter. In addition, in simulation mode, it displays the current address (to be used
to initialize the label), and if the label was already initialized (like here with L0_=4
)
it checks that the label corresponds to the current address generated by the assembler. If
it is not the case, the endASM()
statement displays an error message and exits.
Note 1: I systematically insert an EBREAK()
instruction at the end of the program,
here it would not be necessary (we have an infinite loop), but if I change my mind
and exit the loop, then EBREAK()
is already there.
Note 2: the endASM();
statement checks the validity of all the labels and exits
simulation whenever an invalid label is detected. If you use the RISC-V VERILOG
assembler, systematically run your design in simulation before synthesizing (because
this verification cannot be done at synthesis time).
Try this Run the design step8.v in simulation and on the device. Yes, after 8 steps, what we have is just another stupid blinky ! But this time, this blinky is executing a real RISC-V program ! It is not a complete RISC-V core yet, but it starts to have a strong RISC-V flavor. Be patient, our core will be soon able to run RISC-V programs that are more interesting than a blinky.
You are here ! Still some work to do, but we are making progress.
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [ ] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 |
Try this add a couple of instructions before the loop, run in simulation, fix the label as indicated by the simulator, re-run in simulation, run on device.
Branches are like jumps, except that they compare two register, and update
PC
based on the result of the comparison. Another difference is that they
are more limited in the address range they can reach from PC
(12-bits offset).
There are 6 different branch instructions:
instruction | effect |
---|---|
BEQ rs1,rs2,imm | if(rs1 == rs2) PC <- PC+Bimm |
BNE rs1,rs2,imm | if(rs1 != rs2) PC <- PC+Bimm |
BLT rs1,rs2,imm | if(rs1 < rs2) PC <- PC+Bimm (signed comparison) |
BGE rs1,rs2,imm | if(rs1 >= rs2) PC <- PC+Bimm (signed comparison) |
BLTU rs1,rs2,imm | if(rs1 < rs2) PC <- PC+Bimm (unsigned comparison) |
BGEU rs1,rs2,imm | if(rs1 >= rs2) PC <- PC+Bimm (unsigned comparison) |
Wait a minute: there is BLT
, but where is BGT
? Always the same
principle in a RISC-V processor: if something can be done with a functionality
that is already there, do not add a new functionality ! In this case,
BGT rs1,rs2,imm
is equivalent to BLT rs2,rs1,imm
(just swap the first
two operands). If you use BGT
in a RISC-V assembly program, it will work
(and the assembler replaces it with BLT
with swapped operands). BGT
is called a "pseudo-instruction". There are many pseudo-instructions to make
RISC-V assembly programmer's life easier (more on this later).
Back to our branch instructions, we will need to add in the ALU some wires to compute the result of the test, as follows:
reg takeBranch;
always @(*) begin
case(funct3)
3'b000: takeBranch = (rs1 == rs2);
3'b001: takeBranch = (rs1 != rs2);
3'b100: takeBranch = ($signed(rs1) < $signed(rs2));
3'b101: takeBranch = ($signed(rs1) >= $signed(rs2));
3'b110: takeBranch = (rs1 < rs2);
3'b111: takeBranch = (rs1 >= rs2);
default: takeBranch = 1'b0;
endcase
Note 1 it is possible to create a much more compact ALU, that uses a much smaller number of LUTs when synthesized, we sill see that later (for now, our goal is to have a RISC-V processor that works, we will optimize it later).
Note 2 Among the 8 possibilites given by funct3
, only 6 of them are used by the branch
instructions. It is necessary to have a default:
statement in the case
, else the
synthesizer would not be able to keep takeBranch
as purely combinatorial (and would generate
a latch, which we do not want).
Now the only thing that remains to do for implementing branches is to add a case for
nextPC
, as follows:
wire [31:0] nextPC = (isBranch && takeBranch) ? PC+Bimm :
isJAL ? PC+Jimm :
isJALR ? rs1+Iimm :
PC+4;
We are now ready to test a simple loop, that counts from 0 to 31,
displays each iteration on the LEDs (remember, they are wired
to x1
) and stops:
`include "riscv_assembly.v"
integer L0_ = 8;
initial begin
ADD(x1,x0,x0);
ADDI(x2,x0,32);
Label(L0_);
ADDI(x1,x1,1);
BNE(x1, x2, LabelRef(L0_));
EBREAK();
endASM();
end
Try this run step9.v in simulation and on device. Try modifying the program, create a "knight driver" blinky with an outer loop and two inner loops (one left to right and one right to left).
You are here ! Wow, we have implemented 28 instructions out of 38 ! Let us continue...
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [ *] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 |
We still have these two weird instructions to implement. What do they do ? It is rather simple:
instruction | effect |
---|---|
LUI rd, imm | rd <= Uimm |
AUIPC rd, imm | rd <= PC + Uimm |
And if you look at the Uimm
format, it reads its MSBs (imm[31:12]
) from
the immediate encoded in the instructions. The 12 LSBs are set to zero.
These two instructions are super useful: the immediate formats supported by all the
other instructions can only modify the LSBs. Combined with these two
functions, one can load an arbitrary value in a register (but this can
require up to two instructions).
Implementing these two instructions just requires to change writeBackEn
and
writeBackData
as follows:
assign writeBackData = (isJAL || isJALR) ? (PC + 4) :
(isLUI) ? Uimm :
(isAUIPC) ? (PC + Uimm) :
aluOut;
assign writeBackEn = (state == EXECUTE &&
(isALUreg ||
isALUimm ||
isJAL ||
isJALR ||
isLUI ||
isAUIPC)
);
You are here ! Seems that we are nearly there ! 8 instructions to go...
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [ *] 6 | [*] | [*] | [ ] 5 | [ ] 3 | [*] 1 |
Try this run step10.v in simulation and on the device.
Argh !! On my icestick, it does not fit (requires 1283 LUTs and the IceStick only has 1280). What can we do ? Remember, we absolutely took no care about resource consumption, just trying to write a design that works. In fact, there is a lot of room for improvement in our design, we will see that later, but before then, let's organize our SOC a bit better (then we will shrink the processor).
In our previous designs, we got everything in our SOC
module (memory and
processor). In this step, we will see how to separate them.
First, the Memory
module:
module Memory (
input clk,
input [31:0] mem_addr, // address to be read
output reg [31:0] mem_rdata, // data read from memory
input mem_rstrb // goes high when processor wants to read
);
reg [31:0] MEM [0:255];
`include "riscv_assembly.v"
integer L0_=8;
initial begin
ADD(x1,x0,x0);
ADDI(x2,x0,31);
Label(L0_); ADDI(x1,x1,1);
BNE(x1, x2, LabelRef(L0_));
EBREAK();
endASM();
end
always @(posedge clk) begin
if(mem_rstrb) begin
mem_rdata <= MEM[mem_addr[31:2]];
end
end
endmodule
In its interface, there is a clk
signal connected to the clock.
Whenever the processor wants to read in memory, it positions the
address to be read on mem_addr
, and sets mem_rstrb
to 1. Then
the Memory
module returns the data to be read on mem_rdata
.
Symetrically, the Processor
module has a mem_addr
signal (as
output
this time), a mem_rdata
signal (as input) and a
mem_rstrb
signal (as output):
module Processor (
input clk,
input resetn,
output [31:0] mem_addr,
input [31:0] mem_rdata,
output mem_rstrb,
output reg [31:0] x1
);
...
endmodule
(in addition, we have a x1
signal that contains the contents
of register x1
, that can be used for visual debugging. We will
plug it to the LEDs).
The state machine has one additional state:
localparam FETCH_INSTR = 0;
localparam WAIT_INSTR = 1;
localparam FETCH_REGS = 2;
localparam EXECUTE = 3;
case(state)
FETCH_INSTR: begin
state <= WAIT_INSTR;
end
WAIT_INSTR: begin
instr <= mem_rdata;
state <= FETCH_REGS;
end
FETCH_REGS: begin
rs1 <= RegisterBank[rs1Id];
rs2 <= RegisterBank[rs2Id];
state <= EXECUTE;
end
EXECUTE: begin
if(!isSYSTEM) begin
PC <= nextPC;
end
state <= FETCH_INSTR;
end
endcase
Note we will see later how to simplify it and get back to three states.
Now, mem_addr
and mem_rstrb
can be wired as follows:
assign mem_addr = PC;
assign mem_rstrb = (state == FETCH_INSTR);
And finally, everything is installed and connected in the SOC
module SOC (
input CLK, // system clock
input RESET, // reset button
output [4:0] LEDS, // system LEDs
input RXD, // UART receive
output TXD // UART transmit
);
wire clk;
wire resetn;
Memory RAM(
.clk(clk),
.mem_addr(mem_addr),
.mem_rdata(mem_rdata),
.mem_rstrb(mem_rstrb)
);
wire [31:0] mem_addr;
wire [31:0] mem_rdata;
wire mem_rstrb;
wire [31:0] x1;
Processor CPU(
.clk(clk),
.resetn(resetn),
.mem_addr(mem_addr),
.mem_rdata(mem_rdata),
.mem_rstrb(mem_rstrb),
.x1(x1)
);
assign LEDS = x1[4:0];
// Gearbox and reset circuitry.
Clockworks #(
.SLOW(19) // Divide clock frequency by 2^19
) CW (
.CLK(CLK),
.RESET(RESET),
.clk(clk),
.resetn(resetn)
);
assign TXD = 1'b0; // not used for now
endmodule
Now you can run step11.v in the simulator. As expected, it does the same thing as in the previous step (counts on the LEDs from 0 to 31 and stops). What about running it on the device ? Wow, even worse, 1341 LUTs (and we only got 1280 of them on the IceStick). So let us shrink our code to make it fit !
Tribute to "the Incredible Shrinking Man" classic movie
There are many things we can do for shrinking this core. Let us first take a look at the ALU. It can compute addition, subtraction, and comparisons. Can't we reuse the result of subtraction for comparisons ? Sure we can, but to do that we need to compute a 33 bits subtraction, and test the sign bit. Matthias Koch (@Mecrisp) explained me this trick, that is also used in swapforth/J1 (another small RISC core that works on the IceStick). The 33 bits subtract is written as follows:
wire [32:0] aluMinus = {1'b0,aluIn1} - {1'b0,aluIn2};
if you want to know what A-B
does in Verilog, it corresponds
to A+~B+1
(negate all the bits of B before adding, and add 1), it
is how two's complement subtraction works. For instance, take
4'b0000 - 4'b0001
, the result is -1
, encoded as 4'b1111
. It is
computed as follows by the formula: 4'b0000 + ~4'b0001 + 1
= 4'b0000 + 4'b1110 + 1
= 4'b1111
. So we will keep the following expression (we could have kept the
simpler form above, but it is interesting to be aware of what happens under the
scene):
wire [32:0] aluMinus = {1'b1, ~aluIn2} + {1'b0,aluIn1} + 33'b1;
Then we can create the wires for the three tests (this saves three 32-bit adders):
wire EQ = (aluMinus[31:0] == 0);
wire LTU = aluMinus[32];
wire LT = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32];
- The first one,
EQ
, goes high whenaluIn1
andaluIn2
have the same value, oraluMinus == 0
(no need to test the 33-rd bit) - the second one,
LTU
, corresponds to unsigned comparison. It is given by the sign bit of our 33-bits subtraction. - for the third one, there are two cases: if the signs differ, then
LT
goes high ifaluIn1
is negative, else it is given by the sign bit of our 33-bits subtraction.
Of course, we still need one adder for addition:
wire [31:0] aluPlus = aluIn1 + aluIn2;
Then, aluOut
is computed as follows:
reg [31:0] aluOut;
always @(*) begin
case(funct3)
3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
3'b001: aluOut = aluIn1 << shamt;;
3'b010: aluOut = {31'b0, LT};
3'b011: aluOut = {31'b0, LTU};
3'b100: aluOut = (aluIn1 ^ aluIn2);
3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) :
($signed(aluIn1) >> shamt);
3'b110: aluOut = (aluIn1 | aluIn2);
3'b111: aluOut = (aluIn1 & aluIn2);
endcase
end
Let us try on the IceStick. Yes ! 1167 LUTs, it fits ! But it is not a
good reason to stop there, there are still several opportunities to
shrink space. Let us take a look at takeBranch
, can't we reuse the
EQ
,LT
,LTU
signals we just created ? Sure we can:
reg takeBranch;
always @(*) begin
case(funct3)
3'b000: takeBranch = EQ;
3'b001: takeBranch = !EQ;
3'b100: takeBranch = LT;
3'b101: takeBranch = !LT;
3'b110: takeBranch = LTU;
3'b111: takeBranch = !LTU;
default: takeBranch = 1'b0;
endcase
end
For this to work, we also need to make sure that rs2
is routed to the
second ALU input also for branches:
wire [31:0] aluIn2 = isALUreg | isBranch ? rs2 : Iimm;
What does it give on the device ? 1094 LUTs, not that bad, but let us continue...
The jump target for JALR
is rs1+Iimm
, and we created an adder especially for
that, it is stupid because the ALU already computes that. OK let us reuse it:
wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm :
isJALR ? {aluPlus[31:1],1'b0}:
PCplus4;
How do we stand now ? 1030 LUTs. And it is not finished: what eats-up the largest number of LUTs is the shifter, and we have three of them in the ALU (one for left shifts, one for logical right shifts and one for arithmetic right shifts). By another sorcerer's trick indicated by by Matthias Koch (@mecrisp), it is possible to merge the two right shifts, by creating a 33 bits shifter with the additional bit set to 0 or 1 depending on input's bit31 and on whether it is a logical shift or an arithmetic shift.
wire [31:0] shifter =
$signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0];
Even better, Matthias told me it is possible to use in fact a single shifter, by flipping the input and flipping the output if it is a left shift:
wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1;
wire [31:0] leftshift = flip32(shifter);
The ALU then looks like that:
reg [31:0] aluOut;
always @(*) begin
case(funct3)
3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
3'b001: aluOut = leftshift;
3'b010: aluOut = {31'b0, LT};
3'b011: aluOut = {31'b0, LTU};
3'b100: aluOut = (aluIn1 ^ aluIn2);
3'b101: aluOut = shifter;
3'b110: aluOut = (aluIn1 | aluIn2);
3'b111: aluOut = (aluIn1 & aluIn2);
endcase
end
Where do we stand now ? 887 LUTs my friend !
Note 1 well, in fact one can gain even more space with the shifter, by shifting 1 single bit at each clock. The ALU then becomes a little bit more complicated (multi-cycle), but much much smaller (Femtorv32-quark uses this trick). We will see that later.
Note 2 with a multi-cycle ALU, we could also have a single 33-bits adder, and compute subtractions
in three cycles, by separating the computation of ~aluIn2
, aluIn1+(~aluIn2)
and aluIn1+(~aluIn2)+1
.
Before then, another easy win is factoring the adder used for address computation, as follows:
wire [31:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
instr[4] ? Uimm[31:0] :
Bimm[31:0] );
wire [31:0] PCplus4 = PC+4;
Then these two adders can be used by both nextPC
and writeBackData
:
assign writeBackData = (isJAL || isJALR) ? (PCplus4) :
(isLUI) ? Uimm :
(isAUIPC) ? PCplusImm :
aluOut;
assign writeBackEn = (state == EXECUTE && !isBranch);
wire [31:0] nextPC = (isBranch && takeBranch || isJAL) ? PC+Imm :
isJALR ? {aluPlus[31:1],1'b0} :
PCplus;
The verdict ? 839 LUTs (we have gained another 50 LUTs or so...). There is still room for gaining more LUTs (by using a multi-cycle ALU for shifts, and by using a smaller number of bits for address computation), but we'll keep that for later, since we have now enough room on the device for the next steps.
OK, so now we have an (uncomplete) RISC-V processor, a SOC, both fit on the device. Remember, we are approaching the end, only 8 instructions to go (5 Load variants, 3 Store variants).
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [ *] 6 | [*] | [*] | [ ] 5 | [ ] 3 | [*] 1 |
Before attacking them, let us learn a bit more on RISC-V assembly, and
function calls. Up to now, we have used a gearbox to slow down the CPU in
such a way we can observe it executing our programs. Could'nt we implement
a wait
function instead and call it ? Let us see how to do that.
First thing to do is to remove the #(.SLOW(nnn))
parameter in the Clockworks
instanciation:
Clockworks CW(
.CLK(CLK),
.RESET(RESET),
.clk(clk),
.resetn(resetn)
);
this no longer generates a gearbox and directly wires the CLK
signal of the board
to the internal clk
signal used by our design.
OK, so now we need to see two different things:
- how to write a function that waits for some time
- how to call it
Wait a minute you are talking about function calls, but we do not have
Load
/ Store
instructions. We won't be able to push the return address
on the stack (because we cannot read/write memory, and the stack is in memory !),
so how is it possible ?
There would many possible ways of using RISC-V instructions to implement function calls. To make sure everybody uses the same convention, there is an application binary interface that defines how to call functions, how to pass parameters, and which register does what. See this document for more details.
Calling a function In this document, we learn that for calling a function, the return address will
be stored in x1
. Hence one can call a function using JAL(x1,offset)
where
offset
is the (signed) difference between the program counter and the address
of the function to be called. This works provided the offset fits in 20 bits
(Jimm format).
Note: for function that are further away, one can use a combination of AUIPC
and
JALR
to reach an arbitrary offset.
Returning from a function is done by jumping to the address stored in x1
, which can
be done by JALR(x0,x1,0)
.
Function arguments and return value: The first 6 function arguments
are passed through x10
..x16
, and the return value is passed through x10
(it overwrites the first function argument).
That's interesting, even though we do not have Load
/Store
, we can write programs
with functions, but we cannot write functions that call other functions, because this
requires saving x1
to the stack (well in fact nothing forbids us from doing that by
saving x1
in another register but then it would quickly become a mess, so we won't
do that).
One little thing: we have just learnt that in the ABI, x1
is used to store the
return address of functions. Up to know we have wired it to the LEDs. Since we
are going now to comply with the ABI, we need to chose another register instead.
From now, x10
will be wired to the LEDs.
OK, so now we have everything we need to write yet another version of the blinky !
Let us chose a slow_bit
constant, wire a wait
function that counts to
2^slow_bit
, and call it to slow-down our blinky:
`ifdef BENCH
localparam slow_bit=15;
`else
localparam slow_bit=19;
`endif
`include "riscv_assembly.v"
integer L0_ = 4;
integer wait_ = 20;
integer L1_ = 28;
initial begin
ADD(x10,x0,x0);
Label(L0_);
ADDI(x10,x10,1);
JAL(x1,LabelRef(wait_)); // call(wait_)
JAL(zero,LabelRef(L0_)); // jump(l0_)
EBREAK(); // I keep it systematically
// here in case I change the program.
Label(wait_);
ADDI(x11,x0,1);
SLLI(x11,x11,slow_bit);
Label(L1_);
ADDI(x11,x11,-1);
BNE(x11,x0,LabelRef(L1_));
JALR(x0,x1,0);
endASM();
end
always @(posedge clk) begin
if(mem_rstrb) begin
mem_rdata <= MEM[mem_addr[31:2]];
end
end
endmodule
Try step13.v in simulation and on the device.
Try this Knight-driver blinky, with one routine for going from left to right,
another routine for going from right to left, and the wait routine. Hint you
will need to save x1
to another register.
With the ABI, we have a standard way of writing programs, but there are many things to remember:
- all RISC-V registers are the same, but with the ABI, we need to use certain
registers for certain tasks (
x1
for return address,x10
..x16
for function parameters, etc...); - calling a function is implemented using
JAL
orAUIPC
andJALR
, and returning from a function is implemented usingJALR
.
On a CISC processor, there are often special functions for calling
functions (CALL
) and for returning from a function (RET
), and registers
are often specialized (function return address, stack pointer, function
parameters). This makes programmer's life easier because there is less
to remember. There is no reason not doing the same for a RISC processor !
Let us pretend that the register are different and give them different names
(or aliases). These names are listed
here.
ABI name | name | usage |
---|---|---|
zero |
x0 |
read:0 write:ignored |
ra |
x1 |
return address |
t0 ...t6 |
... | temporary registers |
fp ,s0 ...s11 |
... | saved registers, fp =so : frame pointer |
a0 ...a7 |
... | function parameters and return value (a0 ) |
sp |
x2 |
stack pointer |
gp |
x3 |
global pointer |
Saved registers (s0
, ... s11
) are supposed to be left untouched or
saved/restored by functions. You can put your local variables there.
If you write a function, you are supposed to push the ones you use
on the stack and pop them before returning.
For all the other registers, you cannot expect them to be preserved through function calls.
The global pointer gp
can be used as a "shortcut" to reach memory areas that are
far away in 1 instruction. We will see that later (once we have Load
and Store
).
In our VERILOG assembler riscv_assembly.v, we just need to declare these aliases for register names:
localparam zero = x0;
localparam ra = x1;
localparam sp = x2;
localparam gp = x3;
...
localparam t4 = x29;
localparam t5 = x30;
localparam t6 = x31;
Besides these names, there are also pseudo-instructions for common tasks, such as:
pseudo-instruction | action |
---|---|
LI(rd,imm) |
loads a 32-bits number in a register |
CALL(offset) |
calls a function |
RET() |
return from a function |
MV(rd,rs) |
equivalent to ADD(rd,rs,zero) |
NOP() |
equivalent to ADD(zero,zero,zero) |
J(offset) |
equivalent to JAL(zero,offset) |
BEQZ(rd1,offset) |
equivalent to BEQ(rd1,x0,offset) |
BNEZ(rd1,offset) |
equivalent to BNE(rd1,x0,offset) |
BGT(rd1,rd2,offset) |
equivalent to BLT(rd2,rd1,offset) |
If the constant in the [-2048,2047] range, LI
is implemented using ADDI(rd,x0,imm)
, else
it uses a combination of LUI
and ADDI
(if you want to know how it works, see this stackoverflow answer, there are tricky details about sign expansion).
Using ABI register names and pseudo-instructions, our program becomes as follows:
integer L0_ = 4;
integer wait_ = 24;
integer L1_ = 32;
initial begin
LI(a0,0);
Label(L0_);
ADDI(a0,a0,1);
CALL(LabelRef(wait_));
J(LabelRef(L0_));
EBREAK();
Label(wait_);
LI(a1,1);
SLLI(a1,a1,slow_bit);
Label(L1_);
ADDI(a1,a1,-1);
BNEZ(a1,LabelRef(L1_));
RET();
endASM();
end
It does not make a huge difference, but in longer programs, it improves legibility by showing the intent of the programmer (this one is a function, that one is a jump to a label etc...). Without it, since everything looks like the same, reading a program is more difficult.
It is quite funny: the RISC-V standard has a super-simple instruction set, but programming with
it is not that easy, so the ABI pretends that the instruction set is more complicated, like a
CISC processor, and this makes programmer's life easier. It also ensures that a function written
by a programmer can be called from a function written by another programmer, possibly in a different
language. We will see later how to use GNU assembler and C compiler to compile programs for our CPU.
But before playing with software and toolchains, remember, we still have 8 instructions to implement
in hardware (5 Load
variants and 3 Store
variants).
Try this invent (or copy it from somewhere else) a routine to multiply two numbers, test it on various inputs in simulation, and on the device.
Let us see now how to implement load instructions. There are 5 different instructions:
Instruction | Effect |
---|---|
LW(rd,rs1,imm) | Load word at address (rs1+imm) into rd |
LBU(rd,rs1,imm) | Load byte at address (rs1+imm) into rd |
LHU(rd,rs1,imm) | Load half-word at address (rs1+imm) into rd |
LB(rd,rs1,imm) | Load byte at address (rs1+imm) into rd then sign extend |
LH(rd,rs1,imm) | Load half-word at address (rs1+imm) into rd then sign extend |
Note addresses are aligned on word boundaries for LW
(multiple of 4 bytes) and
halfword boundaries for LH
,LHU
(multiple of 2 bytes). It is a good thing, it
makes things much easier for us...
But we still have some work to do ! First, some circuitry that determines the
loaded value (that we will call LOAD_data
).
As you can see, we got instructions for loading words, half-words and bytes, and instructions that load half-words and bytes exist in two versions:
LBU
,LHU
that load a byte,halfword in the LSBs ofrd
LB
,LH
that load a byte,halfword in the LSBs ofrd
then do sign extensin:
For instance, imagine a sign byte with the value -1
, that is 8'b11111111
,
loading it in a 32-bit register with LBU
will result in 32'b0000000000000000000000011111111
,
whereas loading it with LB
will result in 32'b11111111111111111111111111111111
, that is,
the 32-bits version of -1
.
So we got a "two-dimensional" array of cases (whether we load a byte, halfword, word, and
whether we do sign extension or not). Well, in fact it is even more complicated. Remember,
our memory is structured into words, so when we load a byte, we need to know which one it
is (among 4), and when we load a halfword, we need to know which one it is (among 2). This
can be done by examining the 2 LSBs of the address of the data to be loaded (rs1 + Iimm
):
wire [31:0] loadstore_addr = rs1 + Iimm;
wire [15:0] LOAD_halfword =
loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];
wire [7:0] LOAD_byte =
loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];
OK, so now we need to select among mem_rdata
(LW
), LOAD_halfword
(LH
,LHU
)
and LOAD_byte
(LB
,LBU
). Examining the table in the
RISC-V reference manual
page 130, this is determined by the two LSBs of funct3
:
wire mem_byteAccess = funct3[1:0] == 2'b00;
wire mem_halfwordAccess = funct3[1:0] == 2'b01;
wire [31:0] LOAD_data =
mem_byteAccess ? LOAD_byte :
mem_halfwordAccess ? LOAD_halfword :
mem_rdata ;
Now we need to insert sign expansion into this expression. The value to be
written in the MSBs of rd
, LOAD_sign
, depends on both whether the
instruction does sign expansion (LB
,LH
), characterized by funct3[2]=0
,
and the MSB of the loaded value:
wire LOAD_sign =
!funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]);
wire [31:0] LOAD_data =
mem_byteAccess ? {{24{LOAD_sign}}, LOAD_byte} :
mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} :
mem_rdata ;
Pfiuuuu, it was a bit painful, but in the end it is not too complicated. My initial design was much more complicated, but Matthias Koch (@mecrisp) simplified it a lot, resulting in the (reasonably easy to understand) design above.
We are not completely done though, now we need to modify the state machine. It will have
two additional states, LOAD
and WAIT_DATA
:
localparam FETCH_INSTR = 0;
localparam WAIT_INSTR = 1;
localparam FETCH_REGS = 2;
localparam EXECUTE = 3;
localparam LOAD = 4;
localparam WAIT_DATA = 5;
reg [2:0] state = FETCH_INSTR;
Note 1 we could do with a smaller number of states, but for now our goal is to have
something that works and that is as easy to understand as possible. We will see later
how to simplify the state machine.
Note 2 do not forget to check that state
has the required number of bits !
(reg [2:0] state
instead of reg [1:0] state
as before !!). Then the new
states are plugged in as follows:
...
EXECUTE: begin
if(!isSYSTEM) begin
PC <= nextPC;
end
state <= isLoad ? LOAD : FETCH_INSTR;
end
LOAD: begin
state <= WAIT_DATA;
end
WAIT_DATA: begin
state <= FETCH_INSTR;
end
...
And finally, the signals mem_addr
(with the address to be read)
and mem_rstrb
(that goes high whenever the processor wants to read data) are
driven as follows:
assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
PC : loadstore_addr ;
assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);
Let us test now our new instructions with the following program:
integer L0_ = 8;
integer wait_ = 32;
integer L1_ = 40;
initial begin
LI(s0,0);
LI(s1,16);
Label(L0_);
LB(a0,s0,400); // LEDs are plugged on a0 (=x10)
CALL(LabelRef(wait_));
ADDI(s0,s0,1);
BNE(s0,s1, LabelRef(L0_));
EBREAK();
Label(wait_);
LI(t0,1);
SLLI(t0,t0,slow_bit);
Label(L1_);
ADDI(t0,t0,-1);
BNEZ(t0,LabelRef(L1_));
RET();
endASM();
// Note: index 100 (word address)
// corresponds to
// address 400 (byte address)
MEM[100] = {8'h4, 8'h3, 8'h2, 8'h1};
MEM[101] = {8'h8, 8'h7, 8'h6, 8'h5};
MEM[102] = {8'hc, 8'hb, 8'ha, 8'h9};
MEM[103] = {8'hff, 8'hf, 8'he, 8'hd};
end
This program initializes some values in four words
at address 400, and loads them in a10
in a loop.
There is also a delay loop (wait
function) to let
you see something, just as before.
Try this Run the program in simulation and on the device. Test the other instructions. Do a programmable tinsel as in step 3.
You are here ! Just three instructions to go and we will be done !
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [*] 6 | [*] | [*] | [*] 5 | [ ] 3 | [*] 1 |
We are approaching the end, but still some work to do, to implement the following three instructions:
Instruction | Effect |
---|---|
SW(rs2,rs1,imm) | store rs2 at address rs1+imm |
SB(rs2,rs1,imm) | store 8 LSBs of rs2 at address rs1+imm |
SH(rs2,rs1,imm) | store 16 LSBs of rs2 at address rs1+imm |
To do so, we will need to do three different things:
- modify the interface between the processor and the memory in such a way that the processor can write to the memory
- the memory is addressed by words. Each write operation will
modify a word. But
SB
andSH
need to be able to write individual bytes. Besides the word to be written, we need to compute which byte of this word should be effectively modified in memory (a 4-bits mask) - the state machine needs to be modified.
The Memory
module is modified as follows:
module Memory (
input clk,
input [31:0] mem_addr,
output reg [31:0] mem_rdata,
input mem_rstrb,
input [31:0] mem_wdata,
input [3:0] mem_wmask
);
reg [31:0] MEM [0:255];
initial begin
...
end
wire [29:0] word_addr = mem_addr[31:2];
always @(posedge clk) begin
if(mem_rstrb) begin
mem_rdata <= MEM[word_addr];
end
if(mem_wmask[0]) MEM[word_addr][ 7:0 ] <= mem_wdata[ 7:0 ];
if(mem_wmask[1]) MEM[word_addr][15:8 ] <= mem_wdata[15:8 ];
if(mem_wmask[2]) MEM[word_addr][23:16] <= mem_wdata[23:16];
if(mem_wmask[3]) MEM[word_addr][31:24] <= mem_wdata[31:24];
end
We have two new input signals: mem_wdata
, a 32-bits signal
with the value to be written, and mem_wmask
a 4-bits signal
that indicates which byte should be written.
Note you may wonder how it is implemented in practice, in particular
how the masked write to memory is synthesized on the device. BRAMs on
most FPGAs directly support masked writes, through vendor's special
primitives. Yosys has a (super smart) special step called "technology mapping" that
detects some patterns in the source VERILOG file, and instances
the vendor's primitive best adapted to the usage. In fact technology mapping
was used before in our tutorial, to represent the registers bank: at each
cycle we read two registers, rs1
and rs2
. In the IceStick, BRAMs can
read a single value at each clock, so to make it possible, yosys automatically
duplicates the register bank. Whenever a value is written to rd
, it is written to
the two register banks: bank1[rdId] <- writeBackValue; bank2[rdId] <- writeBackValue;
,
and two different registers can be read at the same cycle, each one in its own
register bank rs1 <- bank1[rs1Id]; rs2 <- bank2[rs2Id;
. With the magic of Yosys,
you do not have to take care of this, it will automatically select the best
mapping for you (duplicated register bank, single register bank with two read
ports if target supports it, or even array of flipflops with address decoder
for larger FPGAs with many LUTs). In our case, the IceStick has an Ice40HX1K,
that has 8 kB of BRAM, organized in 8 blocks of 1 kB each. Two of them are
used for the (duplicated) register bank, leaving 6 kB of BRAM that we use
to synthesize system RAM.
The Processor
module is updated accordingly:
module Processor (
input clk,
input resetn,
output [31:0] mem_addr,
input [31:0] mem_rdata,
output mem_rstrb,
output [31:0] mem_wdata,
output [3:0] mem_wmask,
output reg [31:0] x10 = 0
);
(and everything is connected in the SOC
).
Let us see now how to compute the word to be written and the mask. The
address where the value should be written is still rs1 + imm
, but
the format of the immediate value is different between Load
(Iimm
)
and Store
(Simm
):
wire [31:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);
Now the data to be written depends on whether we write a byte, a halfword or a word, and for bytes and halfwords, also depends on the 2 LSBs of the address. Interestingly, we do not need to test whether we write a byte, a halfword or a word, because the write mask (see lated) will ignore MSBs for byte and halfword write:
assign mem_wdata[ 7: 0] = rs2[7:0];
assign mem_wdata[15: 8] = loadstore_addr[0] ? rs2[7:0] : rs2[15: 8];
assign mem_wdata[23:16] = loadstore_addr[1] ? rs2[7:0] : rs2[23:16];
assign mem_wdata[31:24] = loadstore_addr[0] ? rs2[7:0] :
loadstore_addr[1] ? rs2[15:8] : rs2[31:24];
And finally, the 4-bits write mask, that indicate which byte of mem_wdata
should be effectively written to memory. It is determined as follows:
write mask | Instruction |
---|---|
4'b1111 |
SW |
4'b0011 or 4'b1100 |
SH , depending on loadstore_addr[1] |
4'b0001 , 4'b0010 , 4'b0100 or 4'b1000 |
SB , depending on loadstore_addr[1:0] |
Deriving the expression is a bit painful. With Matthias Koch we ended up with this one:
wire [3:0] STORE_wmask =
mem_byteAccess ?
(loadstore_addr[1] ?
(loadstore_addr[0] ? 4'b1000 : 4'b0100) :
(loadstore_addr[0] ? 4'b0010 : 4'b0001)
) :
mem_halfwordAccess ?
(loadstore_addr[1] ? 4'b1100 : 4'b0011) :
4'b1111;
Let us now create additional states in the state machine:
localparam FETCH_INSTR = 0;
localparam WAIT_INSTR = 1;
localparam FETCH_REGS = 2;
localparam EXECUTE = 3;
localparam LOAD = 4;
localparam WAIT_DATA = 5;
localparam STORE = 6;
...
always @(posedge clk) begin
...
case(state)
...
EXECUTE: begin
if(!isSYSTEM) begin
PC <= nextPC;
end
state <= isLoad ? LOAD :
isStore ? STORE :
FETCH_INSTR;
LOAD: begin
state <= WAIT_DATA;
end
WAIT_DATA: begin
state <= FETCH_INSTR;
end
STORE: begin
state <= FETCH_INSTR;
end
endcase
end
end
The signals interfaced with the memory as driven as follows:
assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
PC : loadstore_addr ;
assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);
assign mem_wmask = {4{(state == STORE)}} & STORE_wmask;
And, at last, a little thing: do not write back to register bank if instruction
is a Store
!
assign writeBackEn = (state==EXECUTE && !isBranch && !isStore && !isLoad) ||
(state==WAIT_DATA) ;
Note The !isLoad
term that prevents writing rd
during EXECUTE
can be removed from the condition,
since rd
will be overwritten right after during the WAIT_DATA
. It is there to have something easier
to understand with simulations.
try this Run step16.v in simulation and on the device. It copies 16 bytes from address 400 to address 800, then displays the values of the copied bytes.
You are here ! Congratulations ! You have finished implementing your first RV32I RISC-V core !
ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM |
---|---|---|---|---|---|---|---|---|
[*] 10 | [*] 9 | [*] 2 | [*] 6 | [*] | [*] | [*] 5 | [*] 3 | [*] 1 |
But wait a minute for sure we have worked a lot to implement a RISC-V core, but all what I can see know is just something that looks like the stupid blinky at step 1 ! I want to see more !
To do so, we need to let our device communicate with the outside word with more than 5 LEDs.
Now the idea is to add devices to our SOC. We already have LEDs, that are plugged to
register a0
(x10
). Plugging devices on a register like that is not super elegant, it would
be better to have a special address in memory that is not really actual RAM but that has
a register plugged to the LEDs. With this idea, one can add as many devices as he likes, by
assigning a virtual address to each device. Then the SOC will have address decoding hardware
that routes the data to the right device. As you will see, besides removing from the processor
the wires drawn from x10
to the LEDS, this only requires some small modifications in the SOC.
Before starting to modify the SOC, the first thing to do is to decide about the "memory map", that is, which address space portion corresponds to what. In our system, we have 6 kB of RAM, so in practice we could say that addresses between 0 and 2^13-1 (8 kB, let us keep a power of two) correspond to RAM. I decided to use a larger portion of address space for RAM (because we also have FPGAs that have ampler quantities of BRAM), then the address space dedicated to RAM will be between 0 and 2^22-1 (that is, 4 MB of RAM).
Then, I decided to say that if bit 22 is set in an address, then this address
corresponds to a device. Now we need to specify how to select among multiple
devices. A natural idea is to use bits 0 to 21 as a "device index", but doing
so is going to require multiple 22-bits wide comparators, and on our IceStick,
it will eat-up a significant portion of the removing LUTs. A better idea,
suggested (once again) by Matthias Koch (@mecrisp), is to use 1-hot encoding,
that is, data is routed to device number n
if bit n
is set in the address.
We will only consider "word addresses" (that is, ignore the two LSBs).
Doing that, we can only plug 20 different devices to our SOC, but it is still
much more than what we need. The advantage is that it dramatically simplifies
address decoding, in such a way that everything still fits in the IceStick.
To determine whether a memory request should be routed to the RAM or to the devices, we insert the following circuitry into the SOC:
wire [31:0] RAM_rdata;
wire [29:0] mem_wordaddr = mem_addr[31:2];
wire isIO = mem_addr[22];
wire isRAM = !isIO;
wire mem_wstrb = |mem_wmask;
The RAM is wired as follows:
Memory RAM(
.clk(clk),
.mem_addr(mem_addr),
.mem_rdata(RAM_rdata),
.mem_rstrb(isRAM & mem_rstrb),
.mem_wdata(mem_wdata),
.mem_wmask({4{isRAM}}&mem_wmask)
);
(note the isRAM
signal ANDed with the write mask)
Now we can add the logic to wire our LEDs. They are
declared as a reg
in the SOC module interface:
module SOC (
input CLK,
input RESET,
output reg [4:0] LEDS,
input RXD,
output TXD
);
driven by a simple block:
localparam IO_LEDS_bit = 0;
always @(posedge clk) begin
if(isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin
LEDS <= mem_wdata;
end
end
Now we can write (yet another version of) our old good blinky:
LI(gp,32'h400000);
LI(a0,0);
Label(L1_);
SW(a0,gp,4);
CALL(LabelRef(wait_));
ADDI(a0,a0,1);
J(LabelRef(L1_));
First we load the base address of the IO page in gp
(that is, 2^22
). To write
LEDs value, we store a0
to word address 1 (that is address 4) in the IO page.
To make things easier when we'll have several devices (right after), let us write
some helper functions:
// Memory-mapped IO in IO page, 1-hot addressing in word address.
localparam IO_LEDS_bit = 0; // W five leds
// Converts an IO_xxx_bit constant into an offset in IO page.
function [31:0] IO_BIT_TO_OFFSET;
input [31:0] bit;
begin
IO_BIT_TO_OFFSET = 1 << (bit + 2);
end
endfunction
Then we can write to the LEDs as follows:
SW(a0,gp,IO_BIT_TO_OFFSET(IO_LEDS_bit));
OK, is it all what you have, still your stupid blinky after 17 (!) tutorial steps ?
Sure, you are right man. Let us add an UART to allow our core to display stuff to a
virtual terminal. The IceStick (and many other FPGA boards) has a special chip
(FTDI2232H if you want to know), that
translates between the plain old RS232 serial protocol and USB. It is good news for
us, because RS232 is a simple protocol, much easier to implement than USB. In fact,
our core will communicate with the outside word through two pins (one for sending
data, called TXD
and one for receiving data, called RXD
), and the FTDI chip
converts to the USB protocol for you. Moreover, it is a good idea not reinventing
the wheel, and there are many existing implementation of UART
(Universal Asynchronous Receiver Transmitter, that implement the RS232 protocol)
in VERILOG. For our
purpose, for now we will only implement half of it (that is, the part that lets
our processor send data over it to display text in a terminal emulator).
Olof Kindren has written a Tweet-size UART, more legible version here.
Let us insert it into our SOC and connect it:
// Memory-mapped IO in IO page, 1-hot addressing in word address.
localparam IO_LEDS_bit = 0; // W five leds
localparam IO_UART_DAT_bit = 1; // W data to send (8 bits)
localparam IO_UART_CNTL_bit = 2; // R status. bit 9: busy sending
...
wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DAT_bit];
wire uart_ready;
corescore_emitter_uart #(
.clk_freq_hz(`BOARD_FREQ*1000000),
.baud_rate(115200)
) UART(
.i_clk(clk),
.i_rst(!resetn),
.i_data(mem_wdata[7:0]),
.i_valid(uart_valid),
.o_ready(uart_ready),
.o_uart_tx(TXD)
);
wire [31:0] IO_rdata =
mem_wordaddr[IO_UART_CNTL_bit] ? { 22'b0, !uart_ready, 9'b0}
: 32'b0;
assign mem_rdata = isRAM ? RAM_rdata :
IO_rdata ;
The UART is projected onto two different addresses in memory space. The first one, that can be only written to, sends one character. The second one, that can be only read from, indicates whether the UART is ready (bit 9 = 0) or busy sending a character (bit 9 = 1).
Now our processor has more possibilities to communicate with the outside world than the poor five LEDs we had before ! Let us implement a function to send a character:
Label(putc_);
// Send character to UART
SW(a0,gp,IO_BIT_TO_OFFSET(IO_UART_DAT_bit));
// Read UART status, and loop until bit 9 (busy sending)
// is zero.
LI(t0,1<<9);
Label(putc_L0_);
LW(t1,gp,IO_BIT_TO_OFFSET(IO_UART_CNTL_bit));
AND(t1,t1,t0);
BNEZ(t1,LabelRef(putc_L0_));
RET();
It writes the character to the UART address projected in IO space, then loops while the UART status indicates that it is busy sending a character.
Try this run step17.v in simulation.
Wait a minute in simulation, how does it know how to display something ?
It's because I cheated a bit, I added the following block of code to the SOC:
`ifdef BENCH
always @(posedge clk) begin
if(uart_valid) begin
$write("%c", mem_wdata[7:0] );
$fflush(32'h8000_0001);
end
end
`endif
(the magic constant argument to$fflush()
corresponds to stdout
, you need to
do that else you do not see anything on the terminal until the output buffer
of stdout
is full). Doing so we do not test the UART in simulation (it is completely bypassed).
I trust Olof that it works fine, but to do things properly, it would be better to plug something
on the simulated TXD
signal, decode the RS232 protocol and display the characters (we'll see
examples of this type of simulation later on).
Try this run step17.v on device.
To display what's sent to the UART, use:
$ ./terminal.sh
Note edit terminal.sh
and chose your favourite terminal emulator in there. You may also
need to change DEVICE=/dev/ttyUSB1
according to your local configuration.
Now that we have a functional RISC-V processor and a SOC with an UART that can send characters to a virtual terminal, let us rest a little bit with a purely software step. In this step, we are going to write a program in RISC-V assembly that computes a crude, ASCII-art version of the Mandelbrot set.
Our "image" will be made of 80x80 characters. So let us start by writing a program that fills
the image with "*" characters. To do that, we will use two nested loops. The Y coordinate
will be stored in s0
and the X coordinate in s1
. The upper bound (80) will be stored
in s11
. The program looks like that:
LI(gp,32'h400000); // IO page
LI(s1,0);
LI(s11,80);
Label(loop_y_);
LI(s0,0);
Label(loop_x_);
LI(a0,"*");
CALL(LabelRef(putc_));
ADDI(s0,s0,1);
BNE(s0,s11,LabelRef(loop_x_));
LI(a0,13);
CALL(LabelRef(putc_));
LI(a0,10);
CALL(LabelRef(putc_));
ADDI(s1,s1,1);
BNE(s1,s11,LabelRef(loop_y_));
EBREAK();
(and we copy the putc
function from the previous example).
Fixed point So now we want to compute the Mandelbrot set. To do that, we need to manipulate real numbers.
Unfortunately, our super simplistic RISC-V core is not able to directly manipulate floating point
numbers. The C compiler's support library libgcc
has some functions to support them, but we will
see later how to use them. For now, the idea is to compute the Mandelbrot set using fixed-point
numbers, that is, in an integer number, we will use some bits to represent the fractional part
(10 bits in our case), and some bits to represent the integer parts (22 bits in our case). In other
words, it means that if we want to represent a real number x
, we will store (the integer part of)
x*2^10
in a register. It is similar to floating point numbers, except that the exponent in our
case is always 10. We will use the following constants in our program:
`define mandel_shift 10
`define mandel_mul (1 << `mandel_shift)
Now, to compute the sum or the difference of two numbers, it does not change anything, because
the 2^10
factor is the same for both numbers to be added (or subtracted). For a product it
is a different story, because when you compute x*y
, the actual computation that you do is
x*2^10*y*2^10
, so what you get is (x*y)*2^20
, and you wanted (x*y)*2^10
, so you need to
divide by 2^10
(right shift by 10
). OK, that's good, but how do we compute the product
of two integer numbers stored in two registers ? Our processor has no MUL
instruction ? In fact
it is possible to add a MUL
instruction (it is part of the RV32M instruction set), we will see
that later, but it will not fit within our tiny IceStick ! So what can we do ? We can implement
a function that takes two numbers in a0
and a1
, computes their products and returns it in a0
.
The C compiler support library libgcc
has one (it is what is used when compiling C for small
RV32I RISC-V processors that do not have the MUL
instruction, like ours). The source-code of
this function is here.
Let us port it to our VERILOG RISC-V assembler (that has a slightly different syntax unfortunately,
we will see later how to directly use gcc and gas):
// Mutiplication routine,
// Input in a0 and a1
// Result in a0
Label(mulsi3_);
MV(a2,a0);
LI(a0,0);
Label(mulsi3_L0_);
ANDI(a3,a1,1);
BEQZ(a3,LabelRef(mulsi3_L1_));
ADD(a0,a0,a2);
Label(mulsi3_L1_);
SRLI(a1,a1,1);
SLLI(a2,a2,1);
BNEZ(a1,LabelRef(mulsi3_L0_));
RET();
(do not forget to declare the new labels before the initial
block).
So now, before displaying the Mandelbrot set, to test our fixed-point
computation idea, let us display a simpler shape, that is, we consider
we are visualizing the [-2.0,2.0]x[-2.0,2.0]
square (mapped to our
30x30 characters display), and we want to display a disk of radius 2
centered on (0,0)
. To do that, we need first to compute the (fixed point)
coordinates x,y
. They will be stored in s2
and s3
. Then we need to
compute x^2+y^2
. We can do that by invoking the mulsi3
routine twice
(do not forget to rightshift the result by 10). Finally, we compare
the result with 4 << 10
(4 because it is the squared radius, and shifted
to the left by 10 because of our fixed-point representation), to decide
whether the point was inside or outside the disk, and use a different character
to display it. The corresponding program looks like that:
`define mandel_shift 10
`define mandel_mul (1 << `mandel_shift)
`define xmin (-2*`mandel_mul)
`define xmax ( 2*`mandel_mul)
`define ymin (-2*`mandel_mul)
`define ymax ( 2*`mandel_mul)
`define dx ((`xmax-`xmin)/30)
`define dy ((`ymax-`ymin)/30)
`define norm_max (4 << `mandel_shift)
integer loop_y_ = 28;
integer loop_x_ = 36;
integer in_disk_ = 92;
initial begin
LI(gp,32'h400000); // IO page
LI(s1,0);
LI(s3,`xmin);
LI(s11,30);
LI(s10,`norm_max);
Label(loop_y_);
LI(s0,0);
LI(s2,`ymin);
Label(loop_x_);
MV(a0,s2);
MV(a1,s2);
CALL(LabelRef(mulsi3_));
SRLI(s4,a0,`mandel_shift); // s4 = x*x
MV(a0,s3);
MV(a1,s3);
CALL(LabelRef(mulsi3_));
SRLI(s5,a0,`mandel_shift); // s5 = y*y
ADD(s6,s4,s5); // s6 = x*x+y*y
LI(a0,"*");
BLT(s6,s10,LabelRef(in_disk_)); // if x*x+y*y < 4
LI(a0," ");
Label(in_disk_);
CALL(LabelRef(putc_));
ADDI(s0,s0,1);
ADDI(s2,s2,`dx);
BNE(s0,s11,LabelRef(loop_x_));
LI(a0,13);
CALL(LabelRef(putc_));
LI(a0,10);
CALL(LabelRef(putc_));
ADDI(s1,s1,1);
ADDI(s3,s3,`dy);
BNE(s1,s11,LabelRef(loop_y_));
EBREAK();
and the output looks like that:
***********
***************
******************
*********************
***********************
************************
*************************
***************************
***************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
*****************************
***************************
***************************
*************************
*************************
***********************
*********************
*******************
***************
***********
Now to compute the Mandelbrot set, we need to iterate the following operation:
Z <- 0; iter <- 0
do
Z <- Z^2 + C
iter <- iter + 1
while |Z| < 2
where Z
and C
are complex numbers. C = x + iy
corresponds to the current pixel.
Remember the rule for complex number multiplication (i*i = -1
), we can compute
Z^2 = (Zr + i*Zi)^2 = Zr^2-Zi^2 + 2*i*Zr*Zi
. The loop that computes these iterates
writes:
Label(loop_Z_);
MV(a0,s4); // Zrr <- (Zr*Zr) >> mandel_shift
MV(a1,s4);
CALL(LabelRef(mulsi3_));
SRLI(s6,a0,`mandel_shift);
MV(a0,s4); // Zri <- (Zr*Zi) >> (mandel_shift-1)
MV(a1,s5);
CALL(LabelRef(mulsi3_));
SRAI(s7,a0,`mandel_shift-1);
MV(a0,s5); // Zii <- (Zi*Zi) >> (mandel_shift)
MV(a1,s5);
CALL(LabelRef(mulsi3_));
SRLI(s8,a0,`mandel_shift);
SUB(s4,s6,s8); // Zr <- Zrr - Zii + Cr
ADD(s4,s4,s2);
ADD(s5,s7,s3); // Zi <- 2Zri + Cr
ADD(s6,s6,s8); // if norm > norm max, exit loop
LI(s7,`norm_max);
BGT(s6,s7,LabelRef(exit_Z_));
ADDI(s10,s10,-1); // iter--, loop if non-zero
BNEZ(s10,LabelRef(loop_Z_));
Label(exit_Z_);
in the end, we display different characters depending on the value
of iter
(s10
) when the loop is exited:
Label(exit_Z_);
LI(a0,colormap_);
ADD(a0,a0,s10);
LBU(a0,a0,0);
CALL(LabelRef(putc_));
where the "colormap" is an array of characters that mimic different "intensities", from the darkest to the brightest:
Label(colormap_);
DATAB(" ",".",",",":");
DATAB(";","o","x","%");
DATAB("#","@", 0 , 0 );
Try that run step18.v in simulation and on the device. Modify it to draw your own graphics (for instance, try drawing "concentric circles" using the "colormap").
As you have seen in Step 18, simulation is much much slower than running the design on the device. However, there is
another tool, called verilator
, that lets you convert a VERILOG design into C++. Then you compile the C++, and you
have a simulation that is much much faster than icarus/iverilog. Let us first install verilator:
$ apt-get install verilator
Before transforming our design into C++, we will have to create a "bench", that is, some C++ code that will generate the
signals for our design, and that will declare the C++ main()
function. The main role of the main function is to declare
an object of class VSOC
(generated from our SOC
module), and wiggle its CLK
signal. Each time the CLK
signal is
changed, you need to call the eval()
function to take the change into account. The sim_main.cpp
file is as follows:
#include "VSOC.h"
#include "verilated.h"
#include <iostream>
int main(int argc, char** argv, char** env) {
VSOC top;
top.CLK = 0;
while(!Verilated::gotFinish()) {
top.CLK = !top.CLK;
top.eval();
}
return 0;
}
In addition, in sim_main.cpp, there is some code to decode whenever the LEDs change, and display their status.
To convert a design to C++, use the following command:
$ verilator -DBENCH -DBOARD_FREQ=12 -Wno-fatal --top-module SOC -cc -exe sim_main.cpp step18.v
Then to compile the C++ and run the generated program:
$ cd obj_dir
$ make -f VSOC.mk
$ ./VSOC
As you can see, it is much much faster than icarus/iverilog ! For a small design, it does not make a huge difference, but believe me, when you are developping an RV32IMFC core, with a FPU, it is good to have efficient simulation !
To make things easier, there is a run_verilator.sh
script, that you can invoke as follows:
$ run_verilator.sh step18.v
At this step, you may have the feeling that our RISC-V design is just a toy, for educational purpose, far away from "the real thing". In fact, at this step, you will start feeling that what you have done is as real as any other RISC-V processor ! What makes a processor interesting is the software you can run on it, hence if our thingy can run any software written for a (RV32I) RISC-V processor, then it is a RV32I RISC-V processor.
Wait a minute but what we have used up to now to write the software is the VERILOG assembler, it is just a toy, different from the real thing no ?
In fact, the VERILOG assembler generates exactly the same machine code as any other RISC-V assembler. We coud use instead any other RISC-V assembler, load the generated machine code into our design and run it !
To do so, VERILOG has a $readmemh()
command, that loads the data to
initialize a memory from an external file. It is used as follows in
step20.v:
initial begin
$readmemh("firmware.hex",MEM);
end
where firmware.hex
is an ASCII file with the initial content of MEM
in hexadecimal.
So if we want to use an external assembler, all we have to do is figure out the following things:
- how to compile RISC-V assembly code using GNU tools
- how to tell GNU tools about the device we have created (RAM start address, RAM amount)
- how to convert the output of GNU tools into a file that
$readmemh()
can understand
OK, let us start with a simple blinker, in blinker.S:
# Simple blinker
.equ IO_BASE, 0x400000
.equ IO_LEDS, 4
.section .text
.globl start
start:
li gp,IO_BASE
li sp,0x1800
.L0:
li t0, 5
sw t0, IO_LEDS(gp)
call wait
li t0, 10
sw t0, IO_LEDS(gp)
call wait
j .L0
wait:
li t0,1
slli t0, t0, 17
.L1:
addi t0,t0,-1
bnez t0, .L1
ret
As you can see, it is very similar to the code we wrote up to now in the VERILOG assembler. In this program, we have three different things:
- main program
- utilities, here the
wait
function - setup, that is, initializing
gp
andsp
So we will split the file into three parts:
- FIRMWARE/blinker.S with the
main
function - FIRMWARE/wait.S with the
wait
function - FIRMWARE/start.S with the setup code, that calls
main
in the end.
To compile it, you will need to install the RISC-V toolchain (compiler, assembler, linker) on your machine. Our makefile can do that for you:
$ cd learn-fpga/FemtoRV
$ make ICESTICK.firmware_config
Note: always use ICESTICK.firmware_config
, even if you have a larger board,
it will configure the makefiles for RV32I
build (and that's what our processor
supports).
This will download some files and unpack them in learn-fpga/FemtoRV/FIRMWARE/TOOLCHAIN
.
Add the riscv64-unknown-elf-gcc..../bin/
directory to your path.
Now to compile our program:
$ cd learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/FIRMWARE
$ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax start.S -o start.o
$ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax blinker.S -o blinker.o
$ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax wait.S -o wait.o
We specify the architecture (rv32i
) that corresponds to the instructions
supported by our processor and the ABI (ilp32
) that corresponds to the way functions
are called. THe no-relax
option concerns the gp
register that we use for
accessing the IO page (so we do not let the assembler use it for anything else).
This generates object files (.o
). We now need to generate an executable from them,
by invoking the linker. The linker will determine where our code and data should
be implanted in memory. For that, we need to specify how the memory in our
device is organized, in a linker script (FIRMWARE/bram.ld):
MEMORY
{
BRAM (RWX) : ORIGIN = 0x0000, LENGTH = 0x1800 /* 6kB RAM */
}
SECTIONS
{
everything :
{
. = ALIGN(4);
start.o (.text)
*(.*)
} >BRAM
}
A linker script contains a description of MEMORY
. In our case, there is a single
segment of 6 kB of memory, that we call BRAM
. It starts from address 0x0000
.
Then we have SECTIONS
, that indicates what goes where (or which segment goes
to which memory). In our case, it is super simple: everything goes to BRAM.
We also indicate that the content of start.o
should be installed first in memory.
The linker is invoked as follows:
$ riscv64-unknown-elf-ld blinker.o wait.o -o blinker.bram.elf -T bram.ld -m elf32lriscv -nostdlib -norelax
It generates an "elf" executable ("elf" stands for Executable and Linkable Format). It is the
same format as the binaries in a Linux system. The option
-T bram.ld
tells it to use our linker script. The option -m elf32lriscv
indicates that
we are generating a 32-bits executable. We are not using the C stdlib for now (-nostdlib
) and
we keep gp
for ourselves (-norelax
). We do not need to have start.o
on the command line
in the list of objects to link, because it is already included in the linker script bram.ld
.
We are not completely done, now we need to extract the relevant information from the elf executable,
and generate a file with all the machine code in hexadecimal, so that VERILOG's $readmemh()
function
can understand it. For that, I wrote a firmware_words
utility, that understands the elf file formats,
extracts the parts that are interesting for us and writes them in ASCII hexadecimal:
$ make blinker.bram.hex
Note you can invoke make xxxx.bram.hex
directly, it will invoke the assembler, linker and
elf conversion utility for you automatically.
Now you can run the example in simulation and on the device:
$ cd ..
$ ./run_verilator.sh step20.v
$ BOARDS/run_xxx.sh step20.v
Now that things are easier, we can write more complicated programs. Let us see how
to write the famous "hello world" program. What we need is a putstring
routine to display
a string on the tty. It takes as input the address of the first character of the string
to display in a0
. We just need to loop on all characters of the string, and
exit the loop as soon as we find a null character, and call putchar
for each character:
# Warning, buggy code ahead !
putstring:
mv t2,a0
.L2: lbu a0,0(t2)
beqz a0,.L3
call putchar
addi t2,t2,1
j .L2
.L3: ret
Have you seen the comment ? It means the code above has an error, can you spot it ?
A hint, putstring
is a function that calls a function. Don't we need to do special
in this case ?
Do you remember what call
and ret
do ? Yes, call
stores PC+4
in ra
then
jumps to the function, and ret
jumps to the address in ra
. Now suppose that
somebody called our putstring
function. When we enter the function, ra
contains
the address we are supposed to jump to when reaching the ret
statement in putstring
.
But inside putstring
, we call putchar
, and it overwrites ra
with the address right
after the call, so that putchar
will be able to jump there when it will return, but
putstring
will jump there as well, which is not what we want. To avoid that, we need
to save ra
at the beginning of putstring
, and restore it at the end. To do that,
we use the stack as follows:
putstring:
addi sp,sp,-4 # save ra on the stack
sw ra,0(sp) # (need to do that for functions that call functions)
mv t2,a0
.L2: lbu a0,0(t2)
beqz a0,.L3
call putchar
addi t2,t2,1
j .L2
.L3: lw ra,0(sp) # restore ra
addi sp,sp,4 # resptore sp
ret
The function can be used as follows:
la a0, hello
call putstring
...
hello:
.asciz "Hello, world !\n"
The la
(load address) pseudo-instruction loads the address of the string
in a0
. The string is declared with a standard label, and the .asciz
directive that generates a zero-terminated string.
Try this Compile hello.S
(cd FIRMWARE; make hello.bram.hex
) and test it in simulation and on device.
Try also mandelbrot.S
. As you can see, FIRMWARE/mandelbrot.S does not have
the __mulsi
function. If you take a look at FIRMWARE/Makefile, the executable is
linked with the right version of libgcc.a
(for RV32I), that has it.
Now you can start having a feeling that your processor is a real thing: when you run the Mandelbrot example, it executes code on your processor that was written by somebody else. Can we go further and run code generated by standard tools ?
Let us see now how we can write code in C for our processor. At this point, we are able to
generate object files (.o
) and produce an elf executable from them using the linker. Our
linker script ensures that everything goes at the right place in memory, then our processor
can execute the code, first the content of start.S
, implanted at address 0, that calls in
turn the main
function. Up to now our programs were completely written in assembly. The
nice thing with the ABI (Application Binary Interface), that we have seen at steps 13 and 14,
is that it makes it possible to combine object files (.o
) produced by different tools, as
soon as they respect the ABI, which is the case (of course) of the C compiler.
The example FIRMWARE/sieve.c, taken from the examples in picorv is a good
candidate. It is interesting, it does multiplications, divisions and modulos using integer
numbers. These operations are not implemented by our RV32I core, but they are supported
by the compiler using functions in libgcc.a
, and since we link with libgcc.a
, this will
work. However, the program also uses printf()
to display the result, and this function
is declared in libc.a
. In principle, it would be possible to use it, but printf()
supports so many formats that its code is too large and will not fit in our 6 kB or RAM.
For this reason, we include a much smaller / much simpler version in
FIRMWARE/print.c (also taken from picorv), and included in the objects
to be linked with executables.
There are two other examples, a C version of the Mandelbrot program: FIRMWARE/mandel_C.c. It uses ANSI colors to display low-resolution "graphics" in the terminal. There is also FIRMWARE/riscv_logo.c that displays a spinning Risc-V logo (in a 90-ish demoscene style !).
Try this Compile sieve.c
(cd FIRMWARE; make sieve.bram.hex
) and test it in simulation (./run_verilator.sh step20.v
)
and on device (BOARDS/run_xxx.sh step20.v; ./terminal.sh
).
Try the other programs. Write your own programs (if you do not have an idea, try for instance cellular automata, Life ...).
Note: the Verilator framework can directly load ELF executables in simulation (no need to regenerate firmware.hex
). You can generate all
demo programs: cd FIRMWARE; make hello.bram.elf mandelbrot.bram.elf mandel_C.bram.elf riscv_logo.bram.elf;cd ..
, then run the one that you want
using ./run_verilator.sh step20.v FIRMWARE/mandel_C.bram.elf
or ./obj_dir/FIRMWARE/mandel_C.bram.elf
.
Now you can see that your processor is not just a toy, it is a real RISC-V processor on which you can run programs produced by standard tools !
Note on the IceStick, we only have 6kB
of RAM, so only tiny programs will fit. If the compiled
program is larger than 6kB
then you will get an error. A more problematic case is a program that
nearly fills the whole BRAM, then we have nearly no space for the stack, and the stack will overwrite
the rest, putting the CPU in an invalid state, probably frozen. This situation is difficult to understand /
to debug when you encounter it, so firmware_words
displays a big warning message whenever the generated
code fills more than 95% of the BRAM.
and some optimizations in the processor
On the IceStick, there are only 8 blocks of 1 kB of BRAM, and since we need to use two of them for the registers, this leaves only 6 kB of RAM for our programs. It is sufficient for small programs like Mandelbrot or little graphic demos, but you will very soon reach the limit. The IceStick has a little chip (see figure) with 4 MBs of FLASH memory (other boards have a similar chip). When you synthesize a design, it is stored in this FLASH memory. On startup, the FPGA loads its configuration from this chip. The nice thing is that the FPGA configuration takes no more than a few kilobytes, this leaves us a lot of space to store our own data. But we will need to create some additional hardware to communicate with this chip.
As you can see on the figure, this chip only has 8 legs, how can we address 4 MBs of data using 8 pins only ? In fact, this chip uses a serial protocol (SPI). To access data, one sends the address to be read on a pin, one bit at a time, then the chip sends the data back on another pin, one bit at a time. If you want to learn more about it, my notes about SPI flash are here and the VERILOG implementation is in spi_flash.v. It supports different protocols, depending on the used number of pins and whether pins are bidirectional.
The MappedSPIFlash
module has the following interface:
module MappedSPIFlash(
input wire clk,
input wire rstrb,
input wire [19:0] word_address,
output wire [31:0] rdata,
output wire rbusy,
output wire CLK,
output reg CS_N,
inout wire [1:0] IO
);
signal | description |
---|---|
clk | system clock |
rstrb | read strobe, goes high whenever processor wants to read a word |
word_address | address of the word to be read |
rdata | data read from memory |
rbusy | asserted if busy receiving data |
CLK | clock pin of the SPI flash chip |
CS_N | chip select pin of the SPI flash chip, active low |
IO | two bidirectional pins for sending and receiving data |
Now the idea is to modify our SOC in such a way that some addresses correspond to the SPI flash. First we need to decide how it will be projected into the memory space of our processor. The idea is to use bit 23 of memory addresses to select the SPI Flash. Bit 22 is for IO (LEDs, UART). In addition, for IO, we need to check that bit 23 is zero. And if both bits 23 and 22 are zero, then we are in BRAM. So our memory space is decomposed into four "quadrants" depending on bits 23 and 22, and we use three of them.
Then we have the different signals to discriminate the different zones of our memory:
wire isSPIFlash = mem_addr[23];
wire isIO = mem_addr[23:22] == 2'b01;
wire isRAM = mem_addr[23:22] == 2'b00;
The MappedSPIFlash
module is wired as follows:
wire SPIFlash_rdata;
wire SPIFlash_rbusy;
MappedSPIFlash SPIFlash(
.clk(clk),
.word_address(mem_wordaddr),
.rdata(SPIFlash_rdata),
.rstrb(isSPIFlash & mem_rstrb),
.rbusy(SPIFlash_rbusy),
.CLK(SPIFLASH_CLK),
.CS_N(SPIFLASH_CS_N),
.IO(SPIFLASH_IO)
);
(the pins SPIFLASH_CLK
, SPIFLASH_CS_N
, SPIFLASH_IO[0]
and SPIFLASH_IO[1]
are declared
in the constraint file, in the BOARDS
subdirectory).
The data sent to the processor has a three-ways mux:
assign mem_rdata = isRAM ? RAM_rdata :
isSPIFlash ? SPIFlash_rdata :
IO_rdata ;
OK, now our processor can automatically trigger a SPI flash read by accessing memory with bit 23 set in the
address, but how does it know that data is ready ? (remember, data arrives one bit at a time). There is
this SPIFlash_rbusy
that goes high whenever MappedSPIFlash
is busy receiving some data, we need to take it
into account in our processor's state machine. We add a new input signal mem_rbusy
to our processor,
and modify the state machine as follows:
...
WAIT_DATA: begin
if(!mem_rbusy) begin
state <= FETCH_INSTR;
end
end
...
Then, in the SOC, this signal is wired to SPIFlash_rbusy
:
wire mem_rbusy;
...
Processor CPU(
...
.mem_rbusy(mem_rbusy),
...
);
...
assign mem_rbusy = SPIFlash_rbusy;
By the way, since we are revisiting the state machine, there is something we can do. Remember this portion of the state machine, don't you think we could go faster ?
WAIT_INSTR: begin
instr <= mem_rdata;
state <= FETCH_REGS;
end
FETCH_REGS: begin
rs1 <= RegisterBank[rs1Id];
rs2 <= RegisterBank[rs2Id];
state <= EXECUTE;
end
Yes, rs1Id
and rs2Id
are simply 5 wires (each) drawn from instr
, so we can
get them from mem_rdata
directly, and fetch the registers in the WAIT_INSTR
state,
as follows:
WAIT_INSTR: begin
instr <= mem_rdata;
rs1 <= RegisterBank[mem_rdata[19:15]];
rs2 <= RegisterBank[mem_rdata[24:20]];
state <= EXECUTE;
end
Doing so we gain one cycle per instruction, and it is an easy win !
Oh, and one more thing, why do we need a LOAD
and a STORE
state, could'nt we
initiate memory transfers in the EXECUTE
state ? Yes we can, so we need to change the write mask and
read strobes accordingly, like that:
assign mem_rstrb = (state == FETCH_INSTR || (state == EXECUTE & isLoad));
assign mem_wmask = {4{(state == EXECUTE) & isStore}} & STORE_wmask;
Then the state machine has 4 states only !
localparam FETCH_INSTR = 0;
localparam WAIT_INSTR = 1;
localparam EXECUTE = 2;
localparam WAIT_DATA = 3;
reg [1:0] state = FETCH_INSTR;
always @(posedge clk) begin
if(!resetn) begin
PC <= 0;
state <= FETCH_INSTR;
end else begin
if(writeBackEn && rdId != 0) begin
RegisterBank[rdId] <= writeBackData;
end
case(state)
FETCH_INSTR: begin
state <= WAIT_INSTR;
end
WAIT_INSTR: begin
instr <= mem_rdata;
rs1 <= RegisterBank[mem_rdata[19:15]];
rs2 <= RegisterBank[mem_rdata[24:20]];
state <= EXECUTE;
end
EXECUTE: begin
if(!isSYSTEM) begin
PC <= nextPC;
end
state <= isLoad ? WAIT_DATA : FETCH_INSTR;
end
WAIT_DATA: begin
if(!mem_rbusy) begin
state <= FETCH_INSTR;
end
end
endcase
end
end
There are several other things that we can optimize. First thing, you may have noticed that
the two LSBs of the instructions are always 2'b11
in RV32I, so we do not need to load them:
reg [31:2] instr;
...
instr <= mem_rdata[31:2];
...
wire isALUreg = (instr[6:2] == 5'b01100);
...
Something else: we are doing all address computations with 32 bits, whereas our address space has 24 bits only, we can save significant resources there:
localparam ADDR_WIDTH=24;
wire [ADDR_WIDTH-1:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
instr[4] ? Uimm[31:0] :
Bimm[31:0] );
wire [ADDR_WIDTH-1:0] PCplus4 = PC+4;
wire [ADDR_WIDTH-1:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm :
isJALR ? {aluPlus[31:1],1'b0} :
PCplus4;
wire [ADDR_WIDTH-1:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);
The up to date verilog file is avalaible in step22.v. Let us now check that we are able to access the SPI flash from our processor, with the following program:
#include "io.h"
#define SPI_FLASH_BASE ((char*)(1 << 23))
int main() {
for(int i=0; i<16; ++i) {
IO_OUT(IO_LEDS,i);
int lo = (int)SPI_FLASH_BASE[2*i ];
int hi = (int)SPI_FLASH_BASE[2*i+1];
print_hex_digits((hi << 8) | lo,4); // print four hexadecimal digits
printf(" ");
}
printf("\n");
}
The SPI flash is mapped in memory space, using addresses with bit 23 set (the
first address, that we call SPI_FLASH_BASE
, is 1 << 23
). Then we access all individual
bytes, and display them by grouping them into 16-bit words (for each word, the first byte
in memory is the least significant one, because RISC-V follows the little-endian convention).
We have a print_hex_digits()
function in FIRMWARE/print.c that does the job
(the second argument is the number of hex characters we want to print for each number).
Now compile the program, synthesize the design and send it to the device as follows:
$ cd FIRMWARE
$ make read_spiflash.bram.hex
$ cd ..
$ BOARDS/run_icestick.sh step22.v
$ ./terminal.sh
... and you see nothing. While is this so ? The program finished before you started the terminal, so we were not able to see anything, but you can reset the processor, pushing the invisible reset button (mentioned in step 2). Each time you push the "button", it will display on the terminal the first 16 words stored in the SPI flash. On a IceStick, you will see something like:
00FF FF00 AA7E 7E99 0051 0501 0092 6220 4B01 0072 8290 0000 0011 0101 0000 0000
Do you have an idea where these values come from ? Remember why there is this SPI flash chip on your FPGA
board: it is where your design is stored. When the FPGA starts, it loads its design from the SPI flash. The
design corresponds to the file SOC.bin
, that is generated at the end of the yosys/nextpnr/icepack
pipeline:
yosys
transforms your verilog into a "circuit", also called a "netlist"- then
nextpnr
maps the gates of this circuit to the logical elements of the FPGA, - and finally
icepack
converts the result into a "binary stream" directly understood by the FPGA.
Let us examine the 16 first words of the binary stream:
$ od -x -N 32 SOC.bin
Then you'll see something like:
0000000 00ff ff00 aa7e 7e99 0051 0501 0092 6220
0000020 4b01 0072 8290 0000 0011 0101 0000 0000
0000040
and this corresponds to what we have just seen on the terminal, read from the SPI flash chip.
So our CPU can read its own FPGA representation from the SPI flash, like a biologist sequencing his
hown DNA ! While it has a nice and intriguing recursion flavor, it is probably of very little practical
use, but let us take a deeper look at it: the SOC.bin
file is not very large:
$ ls -al SOC.bin
-rw-rw-r-- 1 blevy blevy 32220 Jan 7 07:31 SOC.bin
It weights only 32KB
or so, and our SPI flash chip has capacity for 4MB
, so there is plenty of room for us !
The only thing we need to take care of is not overwriting the FPGA configuration (in other words, always start further
away then the size of SOC.bin
). So we will use a 1MB
offset for storing our data (you will say we are wasting a lot
of space between 32KB
and 1MB
but we shall use that space for something else in subsequent steps of this tutorial).
Try this Create a text file hello.txt
, send it to the FPGA at the 1MB
offset (see below how to do that), write
a program that displays the stored file. To know where to stop, you may need either to decide for a termination character
or to precode the length of the file.
For ICE40 boards (IceStick, IceBreaker, ...), use:
$ iceprog -o 1M hello.txt
For ECP5 boards (ULX3S), use:
$ cp hello.txt hello.img
$ ujprog -j flash -f 1048576 hello.img
(using latest version of ujprog
compiled from https://github.com/kost/fujprog).
OK, so now we are ready to use the new storage that we have for more interesting things.
What we will do is displaying
an animation on the terminal. The animation is a demo from the 90's, that streams
polygon data to a software polygon renderer. Polygon data is a 640 kB binary file,
available from learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat
(see other
files in the same directory for more information about the file format). First
thing to do is writing the file to the SPI flash, from a 1MBytes offset. For
ICE40-based boards (IceStick, IceBreaker), use:
$ iceprog -o 1M learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat
For ECP5 boards (ULX3S), use:
$ cp learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat scene1.img
$ ujprog -j flash -f 1048576 scene1.img
(using latest version of ujprog
compiled from https://github.com/kost/fujprog).
Now you can compile the program:
$ cd FIRMWARE
$ make ST_NICCC.bram.hex
$ cd ..
and send the design and the program to the device:
$ BOARDS/run_xxx.sh step22.v
$ ./terminal.sh
Try this Store an image in SPI Flash (in a format that is easy to read), and write a program to display it.
You can use printf("\033[48;2;%d;%d;%dm ",R,G,B);
to send a pixel (where R
,G
,B
are numbers between 0 and 255),
and printf("\033[48;2;0;0;0m\n");
after each scanline.
With what we have done in the previous step, we are now able to load data from the SPI flash, and we have ample space for all our data, but we still have only 6 kB that is shared between our code and variables, it is not much ! It would be great to be able to use the SPI flash to store our code, and execute it directly from there. We were able to write nice demos that fit in 6 kB, imagine what you could do with 2 MB for code, and the entire 6 kB available for your variables !
To be able to load code from the SPI flash, the only thing we need to
change is staying in the WAIT_INSTR
state until mem_rbusy
is zero, hence we
just need to test mem_rbusy
before changing state
to EXECUTE
:
WAIT_INSTR: begin
instr <= mem_rdata[31:2];
rs1 <= RegisterBank[mem_rdata[19:15]];
rs2 <= RegisterBank[mem_rdata[24:20]];
if(!mem_rbusy) begin
state <= EXECUTE;
end
end
and we initialize the BRAM with the following program, that jumps to address 0x00820000
:
initial begin
LI(a0,32'h00820000);
JR(a0);
end
This address corresponds to the address where the SPI flash is projected into the address space of our
CPU (0x00800000
= 1 << 23) plus an offset of 128kB (0x20000
). This offset of 128 kB is
necessary because remember, we share the SPI Flash with the FPGA that stores its configuration
in it !
OK, that's mostly it for the hardware part. Let us see now if we can execute code from there. To do that, we will need a new linker script (FIRMWARE/spiflash0.ld):
MEMORY {
FLASH (RX) : ORIGIN = 0x00820000, LENGTH = 0x100000 /* 4 MB in flash */
}
SECTIONS {
everything : {
. = ALIGN(4);
start.o (.text)
*(.*)
} >FLASH
}
It is the same thing as before, but we tell the linker to put everything in flash memory (for now, we will see later how it works for global variables). Let us test it with a program that does not write to global variables, for instance FIRMWARE/hello.S. To link it using our new linker script, we do:
$ riscv64-unknown-elf-ld -T spiflash0.ld -m elf32lriscv -nostdlib -norelax hello.o putchar.o -o hello.spiflash0.elf
But since it is tedious to type, it is automated by the Makefile:
$ make hello.spiflash0.elf
Now you need to convert the ELF executable into a flat binary:
$ riscv64-unknown-elf-objcopy hello.spiflash0.elf hello.spiflash0.bin -O binary
or with our Makefile:
$ make hello.spiflash0.bin
and send it to the SPI flash at offset 128k:
$ iceprog -o 128k hello.spiflash0.bin
or with our Makefile:
$ make hello.spiflash0.prog
and then:
$ ./terminal.sh
Before starting, let us make a little change in our core: when pushing the reset button, it jumps at address 0, which is initialized as a jump to flash memory, but after executing our program, it is possible (and highly probable) that the RAM will have been used for something else, and no longer has the jump-to-flash instruction. To fix this, one can make the CPU jump to flash memory each time reset goes low:
if(!resetn) begin
PC <= 32'h00820000;
state <= WAIT_DATA;
end
Note that state is set to WAIT_DATA, so that it waits for mem_rbusy
to go low before
doing anything else.
OK, so now we have a large quantity of flash memory in which we can install the code
and run it from there. We can also install readonly variables in there, like the
string .asciz "Hello, world !\n"
in the previous example. And what about local
variables ? They are allocated on the stack, that resides in the 6 kB of RAM that
we have, so it will work. How does it know where the stack is ? Remember, we
have written FIRMWARE/start.S, that initializes sp
at the
end of the RAM (0x1800
) and it suffices.
But how does it work for a program like that ?
int x = 3;
void main() {
x = x + 1;
printf("%d\n",x);
}
The global variable x
has an initial value that needs to be stored somewhere,
so we need to put it in flash memory, but we are modifying it after,
so we need to put it in RAM, how is it possible ?
In fact, what we need is a mechanism for storing all the initial values of the
(initialized) global variables in flash memory and copy them to RAM on startup.
To do that, we will need a new linker script (that indicates where to put the variables and
where to put their initial values) and a new start.S
(that copies the initial
values to the variables). Let us see how to do that.
When you compile C code, the compiler inserts directives to indicate where the different things go (sections). To take a look, generate assembly from one of our C programs:
$ cd FIRMWARE
$ make ST_NICCC.o
$ readelf -S ST_NICCC.o
it will show you the different sections that are present in the object file.
section | description |
---|---|
text | executable code |
bss, sbss | uninitialized data |
data, sdata | read-only data |
rodata | read-only data |
The section name (bss) for uninitialized data has an historic reason that dates back to the 60's (BSS: Block Started by Symbol is a pseudo-instruction of an assembler for the IBM 704). Uninitialized and initialized data sections come in two flavor, sbss and sdata is for small uninitialized (resp) initialized) data.
In readelf
output, there is also a type
field. PROGBIT
means that
some data needs to be loaded from the file (for text
, data
and rodata
)
segments. NOBITS
means that no data should be loaded (for bss
). Then the Addr
indicates where the section will be mapped into memory (for a .o
file, it is always 0,
but it is useful for a linked elf executable, you can check using readelf
). Then
the Offs
field indicates the offset for the section's data in the .o
file, and
the Size
field the number of bytes in the section.
So what we have to do is writing a linker script that will say the following things:
text
sections go to the flash memorybss
sections go to BRAMdata
sections go to BRAM, but have their initial values stored in the flash memory
For text
and bss
, we already know how to do it. For data
, linker scripts can specify
a LMA (Load Memory Address), that indicates where initial values need to be stored. In our
linker script, we will have something like:
MEMORY {
FLASH (rx) : ORIGIN = 0x00820000, LENGTH = 0x100000
RAM (rwx) : ORIGIN = 0x00000000, LENGTH = 0x1800
}
SECTIONS {
.data: AT(address_in_spi_flash) {
*(.data*)
*(.sdata*)
} > RAM
.text : {
start_spiflash1.o(.text)
*(.text*)
*(.rodata*)
*(.srodata*)
} >FLASH
.bss : {
*(.bss*)
*(.sbss*)
} >RAM
}
Each section indicates how to map sections read from object files to sections in the executable
(.data
, .text
and .bss
), and how to map these sections to the flash memory and to the BRAM.
For each section, some pattern matching rules indicate which sections from the object files are
concerned. For the .text
section, we make sure that the first section is the
text section of start_spiflash1.o
, because our processor jumps there on reset.
Note also that we put the readonly data (.rodata
and .srodata
) into the flash.
For the .data
section, the AT
keyword indicates the LMA (Load
Memory Address) where the linker will put the initial values (an
address in spi flash), and whenever a symbol in a data
or sdata
section is referenced, the linker will use its address in RAM.
But a question remains: how does the system know that it should copy initialization data from
the flash into BRAM ? How does it know at which address ? How can we initialize uninitialized
data (BSS) to zero ? In fact we need to do it by hand,
in the startup code start_spiflash1.S
, that looks like that:
.equ IO_BASE, 0x400000
.text
.global _start
.type _start, @function
_start:
.option push
.option norelax
li gp,IO_BASE
.option pop
li sp,0x1800
# zero-init bss section:
la a0, _sbss
la a1, _ebss
bge a0, a1, end_init_bss
loop_init_bss:
sw zero, 0(a0)
addi a0, a0, 4
blt a0, a1, loop_init_bss
end_init_bss:
# copy data section from SPI Flash to BRAM:
la a0, _sidata
la a1, _sdata
la a2, _edata
bge a1, a2, end_init_data
loop_init_data:
lw a3, 0(a0)
sw a3, 0(a1)
addi a0, a0, 4
addi a1, a1, 4
blt a1, a2, loop_init_data
end_init_data:
call main
ebreak
- The first thing that we do is initializing the stack pointer and the general
pointer
gp
(with the IO page address in our case). - the first loop clears the memory between
_sbss
and_ebss
. - the second loop copies data from
_sidata
to_sdata
..._edata
- finally we call
main
... but wait a minute, how do we know the values
for _sbss
,_ebss
,_sidata
,_sdata
,_edata
?
In fact, the linker script can generate them for us. Here is
what the .data
section looks like:
.data : AT ( _sidata ) {
. = ALIGN(4);
_sdata = .;
*(.data*)
*(.sdata*)
. = ALIGN(4);
_edata = .;
} > RAM
where .
denotes the current address. In addition, lines like . = ALIGN(4);
make sure that addresses remain aligned on 4-bytes boundaries, since our
initialization loops in start_spiflash1.S
depend on that.
The declaration for the .text
section looks like:
.text : {
. = ALIGN(4);
start_spiflash1.o(.text)
*(.text*)
. = ALIGN(4);
*(.rodata*)
*(.srodata*)
_etext = .;
_sidata = _etext;
} >FLASH
note that it declares _sidata
right at the end of the text section, so that the .data
section can
put its initialization data there.
OK, so let us try it with one of our examples:
$ cd FIRMWARE
$ make mandel_C.spiflash1.prog
$ cd ..
$ ./terminal.sh
Yes, it works, but wait a minute, it is significantly slower than before. Can you guess why ?
Remember that the FLASH memory is a serial memory, wich means that addresses are sent one bit at a time and the result is obtained also one bit at a time (well, in fact two bits at a time for both in our case), it is much slower than the BRAM that gets a 32-bits value in one cycle. Can we do something ? Sure we can ! What about putting some critical functions in BRAM ? To do that, we can change our linker script as follows (result in FIRMWARE/spiflash2.ld):
.data_and_fastcode : AT ( _sidata ) {
. = ALIGN(4);
_sdata = .;
/* Initialized data */
*(.data*)
*(.sdata*)
/* integer mul and div */
*/libgcc.a:muldi3.o(.text)
*/libgcc.a:div.o(.text)
putchar.o(.text)
print.o(.text)
/* functions with attribute((section(".fastcode"))) */
*(.fastcode*)
. = ALIGN(4);
_edata = .;
} > RAM
By doing so, we indicate that some specific functions (integer multiply and
divide from libgcc and IO functions) should be put in fast RAM, and that's
all we have to do ! The linker will put the code for these functions in the
same section as the initialization data for initialized variables, and
our runtime start_spiflash1.S
will copies them with the initialization data
to RAM at startup, cool !
Let us try it with our example:
$ cd FIRMWARE
$ make mandel_C.spiflash2.prog
$ cd ..
$ ./terminal.sh
Aaaah, much better !
Note also the line *(.fastcode*)
: you can put your own functions in BRAM, by
indicating that they are in a fastcode
section. In C, you can do that as
follows:
void my_function(my args ...) __attribute((section(".fastcode")));
void my_function(my args ...) {
...
}
Try this run the ST_NICCC
demo (make ST_NICCC.spiflash2.prog
). Then uncomment
the line in ST_NICCC.c
with the definition for RV32_FASTCODE
and re-run it.
Now we can run larger programs on our device:
- FIRMWARE/pi.c (by Fabrice Beillard, computes the decimals of pi)
- FIRMWARE/tinyraytracer.c (by Dmitry Sokolov, raytracing)
Both of them use floating point numbers. For a RV32I core such as ours, floating point numbers use
routines implemented in libgcc
. As a consequence, executables are larger (pi
weights 17 kB and
tinyraytracer
weights 25 kB) and would have been impossible to run in 6 kB of RAM. The additional
memory offered by the SPI FLASH offers much more possibilities to our device !
At this point, not only our device runs code compiled using standard tools (gcc), but also it runs
existing code, independently developped (the mathematical routines in libgcc
). It is quite exciting
to run existing binary code on a processor that you create on your own !
- step 1: Blinker, too fast, can't see anything
- step 2: Blinker with clockworks
- step 3: Blinker that loads pattern from ROM
- step 4: The instruction decoder
- step 5: The register bank and the state machine
- step 6: The ALU
- step 7: Using the VERILOG assembler
- step 8: Jumps
- step 9: Branches
- step 10: LUI and AUIPC
- step 11: Memory in separate module
- step 12: Size optimization: the Incredible Shrinking Core !
- step 13: Subroutines 1 (standard Risc-V instruction set)
- step 14: Subroutines 2 (using Risc-V pseudo-instructions)
- step 15: Load
- step 16: Store
- step 17: Memory-mapped devices
- step 18: Mandelbrot set
- step 19: Faster simulation with Verilator
- step 20: Using the GNU toolchain to compile assembly programs
- step 21: Using the GNU toolchain to compile C programs
- step 22: More memory ! Using the SPI Flash
- step 23: Running programs from the SPI Flash, first steps
- step 24: Running programs from the SPI Flash, better linker script
WIP
- step 25: More devices (LED matrix, OLED screen...)