From Blinker to RISC-V

This tutorial is a progressive journey from a simple blinky design to a RISC-V core.

It works with the following boards:

IceStick
IceBreaker
ULX3S
ARTY

If you do not have a board, you can run everything in simulation (but it is not as fun).

About this tutorial

it is a progressive introduction, changing only one thing at a time. It is a curated version of my logbook when I learnt these notions (2020-2022). I also tryed to keep track of all the dead ends I explored and traps that caught me, they are often indicated as side remarks and notes;
I try to keep hardware requirement as minimal as possible. With the tiniest FPGA (IceStick Ice40HX1K) you can do the first episode of the tutorial and transform it into a fully functional RV32I microcontroller that can execute compiled C code.
in the end, the obtained processor is not the most efficient, but it is not a toy: it can execute any program. To anwser the question you may ask, yes, it runs DOOM! (but not on an IceStick, you will need a larger FPGA). It works with the help of LiteX that has a nice SDRAM controller, because Doom needs some RAM;
the tutorial is both about hardware and software: you will learn how to compile programs in assembly and in C for your core;
I try to make all example programs fun and interesting while reasonably short. The bundled demo programs include:
- mandelbrot set in assembly and in C
- rotozoom graphic effect
- drawing filled polygons
- raytracing These graphic program are all displayed in text mode on the terminal, using ANSI escape sequences (yes, this makes BIG pixels). For more fun, it is also possible to use a small OLED display instead (will add instructions for that in the future).
Episode II is on pipelining, you will learn there how to transform the basic processor obtained at the end of this tutorial into a more efficient pipelined processor with branch prediction.
[Episode III)(INTERRUPTS.md) is a WIP on interrupts and the priviledged RISC-V ISA.
This tutorial is in VERILOG. It is currently being ported into other HDLs
- Amaranth/nMigen version by @bl0x
- TODO: Silice version
- TODO: SpinalHDL version

Introduction and references on processor design

To understand processor design, the first thing that I have read was this answer on Stackoverflow, that I found inspiring. There is also this article suggested by @mithro. For a complete course, I highly recommend this one from the MIT, it also gives the principles for going much further than what I've done here (pipelines etc...).

For Verilog basics and syntax, I read Verilog by example by Blaine C. Readler, it is also short and to the point.

There are two nice things with the Stackoverflow answer:

it goes to the essential, and keeps nothing else than what's essential
the taken example is a RISC processor, that shares several similarities with RISC-V (except that it has status flags, that RISC-V does not have).

What we learn there is that there will be a register file, that stores the so-called general-purpose registers. By general-purpose, we mean that each time an instruction reads a register, it can be any of them, and each time an instruction writes a register, it can be any of them, unlike the x86 (CISC) that has specialized registers. To implement the most general instruction (register <- register OP register), the register file will read two registers at each cycle, and optionally write-back one.

There will be an ALU, that will compute an operation on two values.

There will be also a decoder, that will generate all required internal signals from the bit pattern of the current instruction.

If you want to design a RISC-V processor on your own, I recommend you take a deep look at the Stackoverflow answer, and do some schematics on your own to have all the general ideas in mind before going further... or you can choose to directly jump into this tutorial, one step at a time. It will gently take you from the most trivial Blinky design to a fully functional RISC-V core.

Prerequisites:

First step is cloning the learn-fpga repository:

$ git clone https://github.com/BrunoLevy/learn-fpga.git

Before starting, you will need to install the following softwares:

iverilog/icarus (simulation)

  $ sudo apt-get install iverilog

yosys/nextpnr, the toolchain for your board. See this link.

Note that iverilog/icarus is sufficient to run and play with all the steps of the tutorial, but the experience is not the same. I highly recommend to run each step on a real device. The feeling and excitation of your own processor running some code for the first time is not of the same magnitude when you are doing simulation !!!

Step 1: your first blinky

Let us start and create our first blinky ! Our blinky is implemented as a VERILOG module, connected to inputs and outputs, as follows (step1.v):

   module SOC (
       input  CLK,        
       input  RESET,      
       output [4:0] LEDS, 
       input  RXD,        
       output TXD         
   );

   reg [4:0] count = 0;
   always @(posedge CLK) begin
      count <= count + 1;
   end
   assign LEDS = count;
   assign TXD  = 1'b0; // not used for now

   endmodule

We call it SOC (System On Chip), which is a big name for a blinky, but that's what our blinky will be morphed into after all the steps of this tutorial. Our SOC is connected to the following signals:

CLK (input) is the system clock.
LEDS (output) is connected to the 5 LEDs of the board.
RESET (input) is a reset button. You'll say that the IceStick has no button, but in fact ... (we'll talk about that later)
RXD and TXD (input,output) connected to the FTDI chip that emulates a serial port through USB. We'll also talk about that later.

You can synthesize and send the bitstream to the device as follows:

$ BOARDS/run_xxx.sh step1.v

where xxx corresponds to your board.

The five leds will light on... but they are not blinking. Why is this so ? In fact they are blinking, but it is too fast for you to distinguish anything.

To see something, it is possible to use simulation. To use simulation, we write a new VERILOG file bench_iverilog.v, with a module bench that encapsulates our SOC:

module bench();
   reg CLK;
   wire RESET = 0; 
   wire [4:0] LEDS;
   reg  RXD = 1'b0;
   wire TXD;

   SOC uut(
     .CLK(CLK),
     .RESET(RESET),
     .LEDS(LEDS),
     .RXD(RXD),
     .TXD(TXD)
   );

   reg[4:0] prev_LEDS = 0;
   initial begin
      CLK = 0;
      forever begin
	 #1 CLK = ~CLK;
	 if(LEDS != prev_LEDS) begin
	    $display("LEDS = %b",LEDS);
	 end
	 prev_LEDS <= LEDS;
      end
   end
endmodule

The module bench drives all the signals of our SOC (called uut here for "unit under test"). The forever loop wiggles the CLK signal and displays the status of the LEDs whenever it changes.

Now we can start the simulation:

  $ iverilog -DBENCH -DBOARD_FREQ=10 bench_iverilog.v step1.v
  $ vvp a.out

... but that's a lot to remember, so I created a script for that, you'll prefer to do:

  $ ./run.sh step1.v

You will see the LEDs counting. Simulation is precious, it lets you insert "print" statements ($display) in your VERILOG code, which is not directly possible when you run on the device !

To exit the simulation:

  <ctrl><c>
  finish

Note: I developped the first version of femtorv completely on device, using only the LEDs to debug because I did not know how to use simulation, don't do that, it's stupid !

Try this How would you modify step1.v to slow it down sufficiently for one to see the LEDs blinking ?

Try this Can you implement a "Knight driver"-like blinking pattern instead of counting ?

Step 2: slower blinky

You probably got it right: the blinky can be slowed-down either by counting on a larger number of bits (and wiring the most significant bits to the leds), or inserting a "clock divider" (also called a "gearbox") that counts on a large number of bits (and driving the counter with its most significant bit). The second solution is interesting, because you do not need to modify your design, you just insert the clock divider between the CLK signal of the board and your design. Then, even on the device you can distinguish what happens with the LEDs.

To do that, I created a Clockworks module in clockworks.v, that contains the gearbox and a mechanism related with the RESET signal (that I'll talk about later). Clockworks is implemented as follows:

module Clockworks 
(
   input  CLK,   // clock pin of the board
   input  RESET, // reset pin of the board
   output clk,   // (optionally divided) clock for the design.
   output resetn // (optionally timed) negative reset for the design (more on this later)
);
   parameter SLOW;
...
   reg [SLOW:0] slow_CLK = 0;
   always @(posedge CLK) begin
      slow_CLK <= slow_CLK + 1;
   end
   assign clk = slow_CLK[SLOW];
...
endmodule

This divides clock frequency by 2^SLOW.

The Clockworks module is then inserted between the CLK signal of the board and the design, using an internal clk signal, as follows, in step2.v:

`include "clockworks.v"

module SOC (
    input  CLK,        // system clock 
    input  RESET,      // reset button
    output [4:0] LEDS, // system LEDs
    input  RXD,        // UART receive
    output TXD         // UART transmit
);

   wire clk;    // internal clock
   wire resetn; // internal reset signal, goes low on reset
   
   // A blinker that counts on 5 bits, wired to the 5 LEDs
   reg [4:0] count = 0;
   always @(posedge clk) begin
      count <= !resetn ? 0 : count + 1;
   end

   // Clock gearbox (to let you see what happens)
   // and reset circuitry (to workaround an
   // initialization problem with Ice40)
   Clockworks #(
     .SLOW(21) // Divide clock frequency by 2^21
   )CW(
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );
   
   assign LEDS = count;
   assign TXD  = 1'b0; // not used for now   
endmodule

It also handles the RESET signal.

Now you can try it on simulation:

  $ ./run.sh step2.v

As you can see, the counter is now much slower. Try it also on device:

  $ BOARDS/run_xxx.sh step2.v

Yes, now we can see clearly what happens ! And what about the RESET button ? The IceStick has no button. In fact it has one !

Press a finger on the circled region of the image (around pin 47).

Try this Knight-driver mode, and RESET toggles direction.

If you take a look at clockworks.v, you will see it can also create a PLL, it is a component that can be used to generate faster clocks. For instance, the IceStick has a 12 MHz system clock, but the core that we will generate will run at 45 MHz. We will see that later.

Step 3: a blinker that loads LEDs patterns from ROM

Now we got all the tools that we need, so let's see how to transform this blinker into a fully-functional RISC-V processor. This goal seems to be far far away, but the processor we will have created at step 16 is not longer than 200 lines of VERILOG ! I was amazed to discover that it is that simple to create a processor. OK, let us go there one step at a time.

We know already that a processor has a memory, and fetches instructions from there, in a sequential manner most of the time (except when there are jumps and branches). Let us start with something similar, but much simpler: a pre-programmed christmas tinsel, that loads the LEDs pattern from a memory (see step3.v). Our tinsel has a memory with the patterns:

   reg [4:0] MEM [0:20];
   initial begin
       MEM[0]  = 5'b00000;
       MEM[1]  = 5'b00001;
       MEM[2]  = 5'b00010;
       MEM[3]  = 5'b00100;
       ...
       MEM[19] = 5'b10000;
       MEM[20] = 5'b00000;       
   end

Note that what's in the initial block does not generate any circuitry when synthesized, it is directly translated into the initialization data for the BRAMs of the FPGA.

We will also have a "program counter" PC incremented at each clock, and a mechanism to fetch MEM contents indexed by PC:

   reg [4:0] PC = 0;
   reg [4:0] leds = 0;

   always @(posedge clk) begin
      leds <= MEM[PC];
      PC <= (!resetn || PC==20) ? 0 : (PC+1);
   end

Note the test PC==20 to make it cycle.

Now try it with simulation and on device.

Try this create several blinking modes, and switch between modes using RESET.

The RISC-V instruction set architecture

An important source of information is of course the RISC-V reference manual. There you learn that there are several flavors of the RISC-V standard. Let us start from the simplest one (RV32I, that is, 32 bits base integer instruction set). Then we will see how to add things, one thing at a time. This is a very nice feature of RISC-V, since the instruction set is modular, you can start with a very small self-contained kernel, and this kernel will be compliant with the norm. This means standard tools (compiler, assembler, linker) will be able to generate code for this kernel. Then I started reading Chapter 2 (page 13 to page 30). Seeing also the table page 130, there are in fact only 11 different instrutions ! (I say for instance that an AND, an OR, an ADD ... are the same instruction, the operation is just an additional parameter). Now we just try to have an idea of the overall picture, no need to dive into the details for now. Let's take a global look at these 11 instructions:

instruction	description	algo
branch	conditional jump, 6 variants	`if(reg OP reg) PC<-PC+imm`
ALU reg	Three-registers ALU ops, 10 variants	`reg <- reg OP reg`
ALU imm	Two-registers ALU ops, 9 variants	`reg <- reg OP imm`
load	Memory-to-register, 5 variants	`reg <- mem[reg + imm]`
store	Register-to-memory, 3 variants	`mem[reg+imm] <- reg`
`LUI`	load upper immediate	`reg <- (im << 12)`
`AUIPC`	add upper immediate to PC	`reg <- PC+(im << 12)`
`JAL`	jump and link	`reg <- PC+4 ; PC <- PC+imm`
`JALR`	jump and link register	`reg <- PC+4 ; PC <- reg+imm`
`FENCE`	memory-ordering for multicores	(not detailed here, skipped for now)
`SYSTEM`	system calls, breakpoints	(not detailed here, skipped for now)

The 6 branch variants are conditional jumps, that depend on a test on two registers.
ALU operations can be of the form register <- register OP register or register <- register OP immediate
Then we have load and store, that can operate on bytes, on 16 bit values (called half-words) or 32 bit values (called words). In addition byte and half-word loads can do sign expansion. The source/target address is obtained by adding an immediate offset to the content of a register.
The remaining instructions are more special (one may skip their description in a first read, you just need to know that they are used to implement unconditional jumps, function calls, memory ordering for multicores, system calls and breaks):
- LUI (load upper immediate) is used to load the upper 20 bits of a constant. The lower bits can then be set using ADDI or ORI. At first sight it may seem weird that we need two instructions to load a 32 bit constant in a register, but in fact it is a smart choice, because all instructions are 32-bit long.
- AUIPC (add upper immediate to PC) adds a constant to the current program counter and places the result in a register. It is meant to be used in combination with JALR to reach a 32-bit PC-relative address.
- JAL (jump and link) adds an offset to the PC and stores the address of the instruction following the jump in a register. It can be used to implement function calls. JALR does the same thing, but adds the offset to a register.
- FENCE and SYSTEMS are used to implement memory ordering in multicore systems, and system calls/breaks respectively.

To summarize, we got branches (conditional jumps), ALU operations, load and store, and a couple of special instructions used to implement unconditional jumps and function calls. There are also two functions for memory ordering and system calls (but we will ignore these two ones for now). OK, in fact only 9 instructions then, it seems doable... At this point, I had not understood everything, so I'll start from what I think to be the simplest parts (intruction decoder, register file and ALU), then we will see how things are interconnected, how to implement jumps, branches, and all the instructions.

Step 4: the instruction decoder

Now the idea is to have a memory with RISC-V instructions in it, load all instructions sequentially (like in our christmas tinsel), in an instr register, and see how to recognize among the 11 instructions (and light a different LED in function of the recognized instruction). Each instruction is encoded in a 32-bits word, and we need to decode the different bits of this word to recognize the instruction and its parameters.

The RISC-V reference manual has all the information that we need summarized in two tables in page 130 (RV32/64G Instruction Set Listings).

Let us take a look at the big table, first thing to notice is that the 7 LSBs tells you which instruction it is (there are 10 possibilities, we do not count FENCE for now).

   reg [31:0] instr;
   ...
   wire isALUreg  =  (instr[6:0] == 7'b0110011); // rd <- rs1 OP rs2   
   wire isALUimm  =  (instr[6:0] == 7'b0010011); // rd <- rs1 OP Iimm
   wire isBranch  =  (instr[6:0] == 7'b1100011); // if(rs1 OP rs2) PC<-PC+Bimm
   wire isJALR    =  (instr[6:0] == 7'b1100111); // rd <- PC+4; PC<-rs1+Iimm
   wire isJAL     =  (instr[6:0] == 7'b1101111); // rd <- PC+4; PC<-PC+Jimm
   wire isAUIPC   =  (instr[6:0] == 7'b0010111); // rd <- PC + Uimm
   wire isLUI     =  (instr[6:0] == 7'b0110111); // rd <- Uimm   
   wire isLoad    =  (instr[6:0] == 7'b0000011); // rd <- mem[rs1+Iimm]
   wire isStore   =  (instr[6:0] == 7'b0100011); // mem[rs1+Simm] <- rs2
   wire isSYSTEM  =  (instr[6:0] == 7'b1110011); // special

Besides the instruction type, we need also to decode the arguments of the instruction. The table on the top distinguishes 6 types of instructions (R-type,I-type,S-type,B-type,U-type,J-type), depending on the arguments of the instruction and how they are encoded within the 32 bits of the instruction word.

R-type instructions take two source registers rs1 and rs2, apply an operation on them and stores the result in a third destination register rd (ADD, SUB, SLL, SLT, SLTU, XOR, SRL, SRA, OR, AND).

Since RISC-V has 32 registers, each of rs1,rs2 and rd use 5 bits of the instruction word. Interestingly, these are the same bits for all instruction formats. Hence, "decoding" rs1,rs2 and rd is just a matter of drawing some wires from the instruction word:

   wire [4:0] rs1Id = instr[19:15];
   wire [4:0] rs2Id = instr[24:20];
   wire [4:0] rdId  = instr[11:7];

Then, one needs to recognize among the 10 R-type instructions. It is done mostly with the funct3 field, a 3-bits code. With a 3-bits code, one can only encode 8 different instructions, hence there is also a funct7 field (7 MSBs of instruction word). Bit 30 of the instruction word encodes ADD/SUB and SRA/SRL (arithmetic right shift with sign expansion/logical right shift). The instruction decoder has wires for funct3 and funct7:

   wire [2:0] funct3 = instr[14:12];
   wire [6:0] funct7 = instr[31:25];

I-type instructions take one register rs1, an immediate value Iimm, applies an operation on them and stores the result in the destination register rd (ADDI, SLTI, SLTIU, XORI, ORI, ANDI, SLLI, SRLI, SRAI).

Wait a minute: there are 10 R-Type instructions but only 9 I-Type instructions, why is this so ? If you look carefully, you will see that there is no SUBI, but one can instead use ADDI with a negative immediate value. This is a general rule in RISC-V, if an existing functionality can be used, do not create a new functionality.

As for R-type instructions, the instruction can be distinguished using funct3 and funct7 (and in funct7, only the bit 30 of the instruction word is used, to distinguish SRAI/SRLI arithmetic and logical right shifts).

The immediate value is encoded in the 12 MSBs of the instruction word, hence we will draw additional wires to get it:

   wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};

As can be seen, bit 31 of the instruction word is repeated 21 times, this is "sign expansion" (converts a 12-bits signed quantity into a 32-bits one).

There are four other instruction formats S-type (for Store), B-type (for Branch), U-type (for Upper immediates that are left-shifted by 12), and J-type (for Jumps). Each instruction format has a different way of encoding an immediate value in the instruction word.

To understand what it means, let's get back to Chapter 2, page 16. The different instruction types correspond to the way immediate values are encoded in them.

Instr. type	Description	Immediate value encoding
`R-type`	register-register ALU ops. more on this here	None
`I-type`	register-immediate integer ALU ops and `JALR`.	12 bits, sign expansion
`S-type`	store	12 bits, sign expansion
`B-type`	branch	12 bits, sign expansion, upper `[31:1]` (bit 0 is 0)
`U-type`	`LUI`,`AUIPC`	20 bits, upper `31:12` (bits `[11:0]` are 0)
`J-type`	`JAL`	12 bits, sign expansion, upper `[31:1]` (bit 0 is 0)

Note that I-type and S-type encode the same type of values (but they are taken from different parts of instr). Same thing for B-type and J-type.

One can decode the different types of immediates as follows:

   wire [31:0] Uimm={    instr[31],   instr[30:12], {12{1'b0}}};
   wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};
   wire [31:0] Simm={{21{instr[31]}}, instr[30:25],instr[11:7]};
   wire [31:0] Bimm={{20{instr[31]}}, instr[7],instr[30:25],instr[11:8],1'b0};
   wire [31:0] Jimm={{12{instr[31]}}, instr[19:12],instr[20],instr[30:21],1'b0};

Note that Iimm, Simm, Bimm and Jimm do sign expansion (by copying bit 31 the required number of times to fill the MSBs).

And that's all for our instruction decoder ! To summarize, the instruction decoder gets the following information from the instruction word:

signals isXXX that recognizes among the 11 possible RISC-V instructions
source and destination registers rs1,rs2 and rd
function codes funct3 and funct7
the five formats for immediate values (with sign expansion for Iimm, Simm, Bimm and Jimm).

Let us now initialize the memory with a few RISC-V instruction and see whether we can recognize them by lighting a different LED depending on the instruction (step4.v). To do that, we use the big table in page 130 of the RISC-V reference manual. It is a bit painful (we will see easier ways later !). Using the _ character to separate fields of a binary constant is especially interesting under this circumstance.

   initial begin
      // add x1, x0, x0
      //                    rs2   rs1  add  rd  ALUREG
      MEM[0] = 32'b0000000_00000_00000_000_00001_0110011;
      // addi x1, x1, 1
      //             imm         rs1  add  rd   ALUIMM
      MEM[1] = 32'b000000000001_00001_000_00001_0010011;
      ...
      // lw x2,0(x1)
      //             imm         rs1   w   rd   LOAD
      MEM[5] = 32'b000000000000_00001_010_00010_0000011;
      // sw x2,0(x1)
      //             imm   rs2   rs1   w   imm  STORE
      MEM[6] = 32'b000000_00001_00010_010_00000_0100011;
      // ebreak
      //                                        SYSTEM
      MEM[7] = 32'b000000000001_00000_000_00000_1110011;
   end

Then we can fetch and recognize the instructions as follows:

   always @(posedge clk) begin
      if(!resetn) begin
	 PC <= 0;
      end else if(!isSYSTEM) begin
	 instr <= MEM[PC];
	 PC <= PC+1;
      end
   end
   assign LEDS = isSYSTEM ? 31 : {PC[0],isALUreg,isALUimm,isStore,isLoad};

(first led is wired to PC[0] so that we will see it blinking even if there is the same instruction several times).

As you can see, the program counter is only incremented if instruction is not SYSTEM. For now, the only SYSTEM instruction that we support is EBREAK, that halts execution.

In simulation mode, we can in addition display the name of the recognized instruction and the fields:

`ifdef BENCH   
   always @(posedge clk) begin
      $display("PC=%0d",PC);
      case (1'b1)
	isALUreg: $display("ALUreg rd=%d rs1=%d rs2=%d funct3=%b",rdId, rs1Id, rs2Id, funct3);
	isALUimm: $display("ALUimm rd=%d rs1=%d imm=%0d funct3=%b",rdId, rs1Id, Iimm, funct3);
	isBranch: $display("BRANCH");
	isJAL:    $display("JAL");
	isJALR:   $display("JALR");
	isAUIPC:  $display("AUIPC");
	isLUI:    $display("LUI");	
	isLoad:   $display("LOAD");
	isStore:  $display("STORE");
	isSYSTEM: $display("SYSTEM");
      endcase 
   end
`endif

Try this run step4.v in simulation and on the device. Try initializing the memory with different RISC-V instruction and test whether the decoder recognizes them.

Sidebar: the elegance of RISC-V

This paragraph may be skipped. it just contains my own impressions and reflexions on the RISC-V instruction set, inspired by the comments and Q&A in italics in the RISC-V reference manual.

At this point, I realized what an instruction set architecture means: it is for sure a specification of what bit pattern does what (Instruction Set) and it is also at the same time driven by how this will be translated into wires (Architecture). An ISA is not abstract, it is independent on an implementation, but it is strongly designed with implementation in mind ! While the pipeline, branch prediction unit, multiple execution units, caches may differ in different implementations, the instruction decoder is probably very similar in all implementations.

There were things that seemed really weird to me in the first place: all these immediate format variants, the fact that immediate values are scrambled in different bits of instr, the zero register, and the weird instructions LUI,AUIPC,JAL,JALR. When writing the instruction decoder, you better understand the reasons. The ISA is really smart, and is the result of a long evolution (there were RISC-I, RISC-II, ... before). It seems to me the result of a distillation. Now, in 2020, many things were tested in terms of ISA, and this one seems to have benefited from all the previous attempts, taking the good choices and avoiding the suboptimal ones.

What is really nice in the ISA is:

instruction size is fixed. Makes things really easier. (there are extension with varying instrution length, but at least the core instruction set is simple);
rs1,rs2,rd are always encoded by the same bits of instr;
the immediate formats that need to do sign expansion do it from the same bit (instr[31]);
the weird instructions LUI,AUIPC,JAL,JALR can be combined to implement higher-level tasks (load 32-bit constant in register, jump to arbitrary address, function calls). Their existence is justified by the fact it makes the design easier. Then assembly programmer's life is made easier by pseudo-instructions CALL, RET, ... See risc-v assembly manual, the two tables at the end of the page. Same thing for tests/branch instructions obtained by swapping parameters (e.g. a < b <=> b > a etc...), there are pseudo-instructions that do the job for you.

Put differently, to appreciate the elegance of the RISC-V ISA, imagine that your mission is to invent it. That is, invent both the set of instructions and the way they are encoded as bit patterns. The constraints are:

fixed instruction length (32 bits)
as simple as possible: the ultimate sophistication is simplicity [Leonardo da Vinci] !!
source and destination registers always encoded at the same position
whenever there is sign-extension, it should be done from the same bit
it should be simple to load an arbitrary 32-bits immediate value in a register (but may take several instructions)
it should be simple to jump to arbitrary memory locations (but may take several instructions)
it should be simple to implement function calls (but may take several instructions)

Then you understand why there are many different immediate formats. For instance, consider JAL, that does not have a source register, as compared to JALR that has one. Both take an immediate value, but JAL has 5 more bits available to store it, since it does not need to encode the source register. The slightest available bit is used to extend the dynamic range of the immediates. This explains both the multiple immediate formats and the fact that they are assembled from multiple pieces of instr, slaloming between the three fixed 5-bits register encodings, that are there or not depending on the cases.

Now the rationale behind the weird instructions LUI,AUIPC,JAL and JALR is to give a set of functions that can be combined to load arbitrary 32-bit values in register, or to jump to arbitrary locations in memory, or to implement the function call protocol as simply as possible. Considering the constraints, the taken choices (that seemed weird to me in the first place) perfectly make sense. In addition, with the taken choices, the instruction decoder is pretty simple and has a low logical depth. Besides the 7-bits instruction decoder, it mostly consists of a set of wires drawn from the bits of instr, and duplication of the sign-extended bit 31 to form the immediate values.

Before moving forward, I'd like to say a word about the zero register. I think it is really a smart move. With it, you do not need a MOV rd rs instruction (just ADD rd rs zero), you do not need a NOP instruction (ADD zero zero zero), and all the branch variants can compare with zero ! I think that zero is a great invention, not as great as 0, but really makes the instruction set more compact.

Step 5: The register bank and the state machine

The register bank is implemented as follows:

   reg [31:0] RegisterBank [0:31];

Let us take a closer look at what we need to to to execute an instruction. Consider for instance a stream of R-type instructions. For each instruction, we need to do the following four things:

fetch the instruction: instr <= MEM[PC]
fetch the values of rs1 and rs2: rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id] where rs1 and rs2 are two registers. We need to do that because RegisterBank will be synthesized as a block of BRAM, and one needs one cycle to access the content of BRAM.
compute rs1 OP rs2 (where OP depends on funct3 and funct7)
store the result in rd: RegisterBank[rdId] <= writeBackData. This can be done during the same cycle as the previous step if OP is computed by a combinatorial circuit.

The first three operations are implemented by a state machine, as follows (see step5.v):

   localparam FETCH_INSTR = 0;
   localparam FETCH_REGS  = 1;
   localparam EXECUTE     = 2;
   reg [1:0] state = FETCH_INSTR;
   always @(posedge clk) begin
	 case(state)
	   FETCH_INSTR: begin
	      instr <= MEM[PC];
	      state <= FETCH_REGS;
	   end
	   FETCH_REGS: begin
	      rs1 <= RegisterBank[rs1Id];
	      rs2 <= RegisterBank[rs2Id];
	      state <= EXECUTE;
	   end
	   EXECUTE: begin
	      PC <= PC + 1;
	      state <= FETCH_INSTR;	      
	   end
	 endcase
      end 
   end

The fourth one (register write-back) is implemented in this block:

   wire [31:0] writeBackData = ... ;
   wire writeBackEn = ...;
   always @posedge(clk) begin	
      if(writeBackEn && rdId != 0) begin
          RegisterBank[rdId] <= writeBackData;
      end
   end

Remember that writing to register 0 has no effect (hence the test rdId != 0). The signal writeBackEn is asserted whenever writeBackData should be written to register rdId. The data to be written back (writeBackData) will be obtained from the ALU, as explained in the next episode.

Try this: run step5.v in simulation and on the device. You will see your wannabe CPU's state machine dancing waltz on the LEDs (that display the current state).

Step 6: the ALU

Now we can fetch instructions from memory, decode them and read register values, but our (wannabe) CPU is still unable to do anything. Let us see how to do actual computations on register's values.

So, are you going to create an ALU module ? And by the way, why did not you create a Decoder module, and a RegisterBank module ?

My very first design used multiple modules and multiple files, for a total of 1000 lines of code or so, then Matthias Koch wrote a monolithic version, that fits in 200 lines of code. Not only it is more compact, but also it is much easier to understand when you got everything in one place. Rule of thumb: if you have more boxes and wires between the boxes than circuitry in the boxes, then you have too many boxes !

But wait a minute, modular design is good, no ?

Modular design is neither good nor bad, it is useful whenever it makes things simpler. It is not the case in the present situation. There is no absolute answer though, it is a matter of taste and style ! In this tutorial, we use a (mostly) monolithic design.

Now we want to implement two types of instructions:

Rtype: rd <- rs1 OP rs2 (recognized by isALUreg)
Itype: rd <- rs1 OP Iimm (recognized by isALUimm)

The ALU takes two inputs aluIn1 and aluIn2, computes aluIn1 OP aluIn2 and stores it in aluOut:

   wire [31:0] aluIn1 = rs1;
   wire [31:0] aluIn2 = isALUreg ? rs2 : Iimm;
   reg [31:0] aluOut;

Depending on the instruction type, aluIn2 is either the value in the second source register rs2, or an immediate in the Itype format (Immm). The operation OP depends mostly on funct3 (and also on funct7). Keep a copy of the RISC-V reference manual open page 130 on your knees or in another window:

funct3	operation
3'b000	`ADD` or `SUB`
3'b001	left shift
3'b010	signed comparison (<)
3'b011	unsigned comparison (<)
3'b100	`XOR`
3'b101	logical right shift or arithmetic right shift
3'b110	`OR`
3'b111	`AND`

for ADD/SUB, if its an ALUreg operation (Rtype), then one makes the difference between ADD and SUB by testing bit 5 of funct7 (1 for SUB). If it is an ALUimm operation (Itype), then it can be only ADD. In this context, one just needs to test bit 5 of instr to distinguish between ALUreg (if it is 1) and ALUimm (if it is 0).
for logical or arithmetic right shift, one makes the difference also by testing bit 5 of funct7, 1 for arithmetic shift (with sign expansion) and 0 for logical shift.
the shift amount is either the content of rs2 for ALUreg instructions or instr[24:20] (the same bits as rs2Id) for ALUimm instructions.

Putting everything together, one gets the following VERILOG code for the ALU:

   reg [31:0] aluOut;
   wire [4:0] shamt = isALUreg ? rs2[4:0] : instr[24:20]; // shift amount
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? (aluIn1-aluIn2) : (aluIn1+aluIn2);
	3'b001: aluOut = aluIn1 << shamt;
	3'b010: aluOut = ($signed(aluIn1) < $signed(aluIn2));
	3'b011: aluOut = (aluIn1 < aluIn2);
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : (aluIn1 >> shamt); 
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);	
      endcase
   end

Note: although it is declared as a reg, aluOut will be a combinatorial function (no flipflop generated), because its value is determined in a combinatorial block (always @(*)), and all the configurations are enumerated in the case statement.

Register write-back is configured as follows:

   assign writeBackData = aluOut; 
   assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm));

Try this run step6.v in simulation and on the device. In simulation it will display the written value and the written register for all register write-back operation. On the device it will show the 5 LSBs of x1 on the LEDs. Then you can try changing the program, and observe the effect on register values.

You are here ! This is the list of instructions you have to implement, your wannabe RISC-V core currently supports 20 of them. Next steps: jumps, then branches, then... the rest. Before then, as you probably have noticed, translating RISC-V programs into binary (that is, assembling manually) is extremely painful. Next section gives a much easier solution.

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[ ] 2	[ ] 6	[ ]	[ ]	[ ] 5	[ ] 3	[*] 1

Step 7: using the VERILOG assembler

To avoid having to manually translate RISC-V assembly into binary, one can use the GNU assembler, generate a binary file, translate it into hexadecimal and use the VERILOG function readmemh() to initialize memory with the content of that file. We will see later how to do that.

But in our case, it would be very convenient to be able to write small assembly programs directly in the same VERILOG file as our design. In fact, it is possible to do so, by implementing a RISC-V assembler directly in VERILOG (using tasks and functions), as done in riscv_assembly.v.

In step7.v, memory is initialized with the same assembly program as in step6.v. It looks like that now, Much easier to read, no ?

   `include "riscv_assembly.v"
   initial begin
      ADD(x0,x0,x0);
      ADD(x1,x0,x0);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADD(x2,x1,x0);
      ADD(x3,x1,x2);
      SRLI(x3,x3,3);
      SLLI(x3,x3,31);
      SRAI(x3,x3,5);
      SRLI(x1,x3,26);
      EBREAK();
   end

Note: riscv_assembly.v needs to be included from inside the module that uses assembly.

In this step, we make another modification: in the previous steps, PC was the index of the current instruction. For what follows, we want it to be the address of the current instruction. Since each instruction is 32-bits long, it means that:

to increment PC, we do PC <= PC + 4 (instead of PC <= PC + 1 as before)
to fetch the current instruction, we do instr <= MEM[PC[31:2]]; (we ignore the two LSBs of PC).

Step 8: jumps

There are two jump instructions, JAL (jump and link), and JALR (jump and link register). By "and link", one means that the current PC can be written to a register. Hence JAL and JALR can be used to implement not only jumps, but also function calls. Here is what the two instructions are supposed to do:

instruction	effect
JAL rd,imm	rd<-PC+4; PC<-PC+Jimm
JALR rd,rs1,imm	rd<-PC+4; PC<-rs1+Iimm

To implement these two instructions, we need to make the following changes to our core. First thing is register write-back: now value can be PC+4 instead of aluOut for jump instructions:

   assign writeBackData = (isJAL || isJALR) ? (PC + 4) : aluOut;
   assign writeBackEn = (state == EXECUTE && 
			 (isALUreg || 
			  isALUimm || 
			  isJAL    || 
			  isJALR)
			 );

We also need to declare a nextPC value, that implements the three possibilities:

   wire [31:0] nextPC = isJAL  ? PC+Jimm :
	                isJALR ? rs1+Iimm :
	                PC+4;

Then, in the state machine, the line PC <= PC + 4; is replaced with PC <= nextPC; and that's all !

We can now implement a simple (infinite) loop to test our new jump instruction:

`include "riscv_assembly.v"
      integer L0_=4;
      initial begin
	 ADD(x1,x0,x0);
      Label(L0_);
	 ADDI(x1,x1,1);
	 JAL(x0,LabelRef(L0_));
	 EBREAK();
	 endASM();
      end

The integer L0_ is a label. Unlike with a real assembler, we need to specify the value of L0_ by hand. Here it is easy, L0_ is right after the first instruction, hence it corresponds to the beginning of the RAM (0) plus one 32-bits words, that is, 4. For longer programs with many labels, you can let the labels uninitialized (integer L0_;) then the first time you run the program, it will compute and display the values to be used for the labels. It is not super-convenient, but still much better than assembling by hand / determining the labels by hand.

The LabelRef() function computes the label's offset relative to the current program counter. In addition, in simulation mode, it displays the current address (to be used to initialize the label), and if the label was already initialized (like here with L0_=4) it checks that the label corresponds to the current address generated by the assembler. If it is not the case, the endASM() statement displays an error message and exits.

Note 1: I systematically insert an EBREAK() instruction at the end of the program, here it would not be necessary (we have an infinite loop), but if I change my mind and exit the loop, then EBREAK() is already there.

Note 2: the endASM(); statement checks the validity of all the labels and exits simulation whenever an invalid label is detected. If you use the RISC-V VERILOG assembler, systematically run your design in simulation before synthesizing (because this verification cannot be done at synthesis time).

Try this Run the design step8.v in simulation and on the device. Yes, after 8 steps, what we have is just another stupid blinky ! But this time, this blinky is executing a real RISC-V program ! It is not a complete RISC-V core yet, but it starts to have a strong RISC-V flavor. Be patient, our core will be soon able to run RISC-V programs that are more interesting than a blinky.

You are here ! Still some work to do, but we are making progress.

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[ ] 6	[ ]	[ ]	[ ] 5	[ ] 3	[*] 1

Try this add a couple of instructions before the loop, run in simulation, fix the label as indicated by the simulator, re-run in simulation, run on device.

Step 9: Branches

Branches are like jumps, except that they compare two register, and update PC based on the result of the comparison. Another difference is that they are more limited in the address range they can reach from PC (12-bits offset). There are 6 different branch instructions:

instruction	effect
BEQ rs1,rs2,imm	if(rs1 == rs2) PC <- PC+Bimm
BNE rs1,rs2,imm	if(rs1 != rs2) PC <- PC+Bimm
BLT rs1,rs2,imm	if(rs1 < rs2) PC <- PC+Bimm (signed comparison)
BGE rs1,rs2,imm	if(rs1 >= rs2) PC <- PC+Bimm (signed comparison)
BLTU rs1,rs2,imm	if(rs1 < rs2) PC <- PC+Bimm (unsigned comparison)
BGEU rs1,rs2,imm	if(rs1 >= rs2) PC <- PC+Bimm (unsigned comparison)

Wait a minute: there is BLT, but where is BGT ? Always the same principle in a RISC-V processor: if something can be done with a functionality that is already there, do not add a new functionality ! In this case, BGT rs1,rs2,imm is equivalent to BLT rs2,rs1,imm (just swap the first two operands). If you use BGT in a RISC-V assembly program, it will work (and the assembler replaces it with BLT with swapped operands). BGT is called a "pseudo-instruction". There are many pseudo-instructions to make RISC-V assembly programmer's life easier (more on this later).

Back to our branch instructions, we will need to add in the ALU some wires to compute the result of the test, as follows:

   reg takeBranch;
   always @(*) begin
      case(funct3)
	3'b000: takeBranch = (rs1 == rs2);
	3'b001: takeBranch = (rs1 != rs2);
	3'b100: takeBranch = ($signed(rs1) < $signed(rs2));
	3'b101: takeBranch = ($signed(rs1) >= $signed(rs2));
	3'b110: takeBranch = (rs1 < rs2);
	3'b111: takeBranch = (rs1 >= rs2);
	default: takeBranch = 1'b0;
      endcase

Note 1 it is possible to create a much more compact ALU, that uses a much smaller number of LUTs when synthesized, we sill see that later (for now, our goal is to have a RISC-V processor that works, we will optimize it later).

Note 2 Among the 8 possibilites given by funct3, only 6 of them are used by the branch instructions. It is necessary to have a default: statement in the case, else the synthesizer would not be able to keep takeBranch as purely combinatorial (and would generate a latch, which we do not want).

Now the only thing that remains to do for implementing branches is to add a case for nextPC, as follows:

   wire [31:0] nextPC = (isBranch && takeBranch) ? PC+Bimm :	       
   	                isJAL                    ? PC+Jimm :
	                isJALR                   ? rs1+Iimm :
	                PC+4;

We are now ready to test a simple loop, that counts from 0 to 31, displays each iteration on the LEDs (remember, they are wired to x1) and stops:

`include "riscv_assembly.v"
      integer L0_ = 8;
   
      initial begin
         ADD(x1,x0,x0);
         ADDI(x2,x0,32);
      Label(L0_); 
	 ADDI(x1,x1,1); 
         BNE(x1, x2, LabelRef(L0_));
         EBREAK();

	 endASM();
      end

Try this run step9.v in simulation and on device. Try modifying the program, create a "knight driver" blinky with an outer loop and two inner loops (one left to right and one right to left).

You are here ! Wow, we have implemented 28 instructions out of 38 ! Let us continue...

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[ *] 6	[ ]	[ ]	[ ] 5	[ ] 3	[*] 1

Step 10: LUI and AUIPC

We still have these two weird instructions to implement. What do they do ? It is rather simple:

instruction	effect
LUI rd, imm	rd <= Uimm
AUIPC rd, imm	rd <= PC + Uimm

And if you look at the Uimm format, it reads its MSBs (imm[31:12]) from the immediate encoded in the instructions. The 12 LSBs are set to zero. These two instructions are super useful: the immediate formats supported by all the other instructions can only modify the LSBs. Combined with these two functions, one can load an arbitrary value in a register (but this can require up to two instructions).

Implementing these two instructions just requires to change writeBackEn and writeBackData as follows:

   assign writeBackData = (isJAL || isJALR) ? (PC + 4) :
			  (isLUI) ? Uimm :
			  (isAUIPC) ? (PC + Uimm) : 
			  aluOut;
   
   assign writeBackEn = (state == EXECUTE && 
			 (isALUreg || 
			  isALUimm || 
			  isJAL    || 
			  isJALR   ||
			  isLUI    ||
			  isAUIPC)
			 );

You are here ! Seems that we are nearly there ! 8 instructions to go...

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[ *] 6	[*]	[*]	[ ] 5	[ ] 3	[*] 1

Try this run step10.v in simulation and on the device.

Argh !! On my icestick, it does not fit (requires 1283 LUTs and the IceStick only has 1280). What can we do ? Remember, we absolutely took no care about resource consumption, just trying to write a design that works. In fact, there is a lot of room for improvement in our design, we will see that later, but before then, let's organize our SOC a bit better (then we will shrink the processor).

Step 11: Memory in a separate module

In our previous designs, we got everything in our SOC module (memory and processor). In this step, we will see how to separate them.

First, the Memory module:

module Memory (
   input             clk,
   input      [31:0] mem_addr,  // address to be read
   output reg [31:0] mem_rdata, // data read from memory
   input   	     mem_rstrb  // goes high when processor wants to read
);
   reg [31:0] MEM [0:255]; 

`include "riscv_assembly.v"
   integer L0_=8;
   initial begin
                  ADD(x1,x0,x0);      
                  ADDI(x2,x0,31);
      Label(L0_); ADDI(x1,x1,1); 
                  BNE(x1, x2, LabelRef(L0_));
                  EBREAK();
      endASM();
   end

   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[mem_addr[31:2]];
      end
   end
endmodule

In its interface, there is a clk signal connected to the clock. Whenever the processor wants to read in memory, it positions the address to be read on mem_addr, and sets mem_rstrb to 1. Then the Memory module returns the data to be read on mem_rdata.

Symetrically, the Processor module has a mem_addr signal (as output this time), a mem_rdata signal (as input) and a mem_rstrb signal (as output):

module Processor (
    input 	      clk,
    input 	      resetn,
    output     [31:0] mem_addr, 
    input      [31:0] mem_rdata, 
    output 	      mem_rstrb,
    output reg [31:0] x1		  
);
...
endmodule

(in addition, we have a x1 signal that contains the contents of register x1, that can be used for visual debugging. We will plug it to the LEDs).

The state machine has one additional state:

   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;

   case(state)
     FETCH_INSTR: begin
       state <= WAIT_INSTR;
     end
     WAIT_INSTR: begin
       instr <= mem_rdata;
       state <= FETCH_REGS;
     end
     FETCH_REGS: begin
       rs1 <= RegisterBank[rs1Id];
       rs2 <= RegisterBank[rs2Id];
       state <= EXECUTE;
     end
     EXECUTE: begin
        if(!isSYSTEM) begin
  	   PC <= nextPC;
	end
	state <= FETCH_INSTR;
      end
   endcase

Note we will see later how to simplify it and get back to three states.

Now, mem_addr and mem_rstrb can be wired as follows:

   assign mem_addr = PC;
   assign mem_rstrb = (state == FETCH_INSTR);

And finally, everything is installed and connected in the SOC

module SOC (
    input  CLK,        // system clock 
    input  RESET,      // reset button
    output [4:0] LEDS, // system LEDs
    input  RXD,        // UART receive
    output TXD         // UART transmit
);
   wire    clk;
   wire    resetn;
   Memory RAM(
      .clk(clk),
      .mem_addr(mem_addr),
      .mem_rdata(mem_rdata),
      .mem_rstrb(mem_rstrb)
   );

   wire [31:0] mem_addr;
   wire [31:0] mem_rdata;
   wire mem_rstrb;
   wire [31:0] x1;
   Processor CPU(
      .clk(clk),
      .resetn(resetn),		 
      .mem_addr(mem_addr),
      .mem_rdata(mem_rdata),
      .mem_rstrb(mem_rstrb),
      .x1(x1)		 
   );
   assign LEDS = x1[4:0];

   // Gearbox and reset circuitry.
   Clockworks #(
     .SLOW(19) // Divide clock frequency by 2^19
   ) CW (
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );
   
   assign TXD  = 1'b0; // not used for now   
endmodule

Now you can run step11.v in the simulator. As expected, it does the same thing as in the previous step (counts on the LEDs from 0 to 31 and stops). What about running it on the device ? Wow, even worse, 1341 LUTs (and we only got 1280 of them on the IceStick). So let us shrink our code to make it fit !

Step 12: Size optimization: the Incredible Shrinking Core.

Tribute to "the Incredible Shrinking Man" classic movie

There are many things we can do for shrinking this core. Let us first take a look at the ALU. It can compute addition, subtraction, and comparisons. Can't we reuse the result of subtraction for comparisons ? Sure we can, but to do that we need to compute a 33 bits subtraction, and test the sign bit. Matthias Koch (@Mecrisp) explained me this trick, that is also used in swapforth/J1 (another small RISC core that works on the IceStick). The 33 bits subtract is written as follows:

   wire [32:0] aluMinus = {1'b0,aluIn1} - {1'b0,aluIn2};

if you want to know what A-B does in Verilog, it corresponds to A+~B+1 (negate all the bits of B before adding, and add 1), it is how two's complement subtraction works. For instance, take 4'b0000 - 4'b0001, the result is -1, encoded as 4'b1111. It is computed as follows by the formula: 4'b0000 + ~4'b0001 + 1 = 4'b0000 + 4'b1110 + 1 = 4'b1111. So we will keep the following expression (we could have kept the simpler form above, but it is interesting to be aware of what happens under the scene):

   wire [32:0] aluMinus = {1'b1, ~aluIn2} + {1'b0,aluIn1} + 33'b1;

Then we can create the wires for the three tests (this saves three 32-bit adders):

   wire        EQ  = (aluMinus[31:0] == 0);
   wire        LTU = aluMinus[32];
   wire        LT  = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32];

The first one, EQ, goes high when aluIn1 and aluIn2 have the same value, or aluMinus == 0 (no need to test the 33-rd bit)
the second one, LTU, corresponds to unsigned comparison. It is given by the sign bit of our 33-bits subtraction.
for the third one, there are two cases: if the signs differ, then LT goes high if aluIn1 is negative, else it is given by the sign bit of our 33-bits subtraction.

Of course, we still need one adder for addition:

   wire [31:0] aluPlus = aluIn1 + aluIn2;

Then, aluOut is computed as follows:

   reg [31:0]  aluOut;
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
	3'b001: aluOut = aluIn1 << shamt;;
	3'b010: aluOut = {31'b0, LT};
	3'b011: aluOut = {31'b0, LTU};
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : 
			 ($signed(aluIn1) >> shamt); 
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);	
      endcase
   end

Let us try on the IceStick. Yes ! 1167 LUTs, it fits ! But it is not a good reason to stop there, there are still several opportunities to shrink space. Let us take a look at takeBranch, can't we reuse the EQ,LT,LTU signals we just created ? Sure we can:

   reg takeBranch;
   always @(*) begin
      case(funct3)
	3'b000: takeBranch = EQ;
	3'b001: takeBranch = !EQ;
	3'b100: takeBranch = LT;
	3'b101: takeBranch = !LT;
	3'b110: takeBranch = LTU;
	3'b111: takeBranch = !LTU;
	default: takeBranch = 1'b0;
      endcase
   end

For this to work, we also need to make sure that rs2 is routed to the second ALU input also for branches:

   wire [31:0] aluIn2 = isALUreg | isBranch ? rs2 : Iimm;

What does it give on the device ? 1094 LUTs, not that bad, but let us continue... The jump target for JALR is rs1+Iimm, and we created an adder especially for that, it is stupid because the ALU already computes that. OK let us reuse it:

   wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm  :	       
	                isJALR                              ? {aluPlus[31:1],1'b0}:
	                PCplus4;

How do we stand now ? 1030 LUTs. And it is not finished: what eats-up the largest number of LUTs is the shifter, and we have three of them in the ALU (one for left shifts, one for logical right shifts and one for arithmetic right shifts). By another sorcerer's trick indicated by by Matthias Koch (@mecrisp), it is possible to merge the two right shifts, by creating a 33 bits shifter with the additional bit set to 0 or 1 depending on input's bit31 and on whether it is a logical shift or an arithmetic shift.

   wire [31:0] shifter = 
               $signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0];

Even better, Matthias told me it is possible to use in fact a single shifter, by flipping the input and flipping the output if it is a left shift:

   wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1;
   wire [31:0] leftshift = flip32(shifter);

The ALU then looks like that:

   reg [31:0]  aluOut;
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
	3'b001: aluOut = leftshift;
	3'b010: aluOut = {31'b0, LT};
	3'b011: aluOut = {31'b0, LTU};
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = shifter;
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);	
      endcase
   end

Where do we stand now ? 887 LUTs my friend !

Note 1 well, in fact one can gain even more space with the shifter, by shifting 1 single bit at each clock. The ALU then becomes a little bit more complicated (multi-cycle), but much much smaller (Femtorv32-quark uses this trick). We will see that later.

Note 2 with a multi-cycle ALU, we could also have a single 33-bits adder, and compute subtractions in three cycles, by separating the computation of ~aluIn2, aluIn1+(~aluIn2) and aluIn1+(~aluIn2)+1.

Before then, another easy win is factoring the adder used for address computation, as follows:

   wire [31:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
				  instr[4] ? Uimm[31:0] :
				             Bimm[31:0] );
   wire [31:0] PCplus4 = PC+4;

Then these two adders can be used by both nextPC and writeBackData:

   assign writeBackData = (isJAL || isJALR) ? (PCplus4) :
			  (isLUI) ? Uimm :
			  (isAUIPC) ? PCplusImm : 
			  aluOut;

   assign writeBackEn = (state == EXECUTE && !isBranch);

   wire [31:0] nextPC = (isBranch && takeBranch || isJAL) ? PC+Imm  :	       
	                isJALR                   ? {aluPlus[31:1],1'b0} :
	                PCplus;

The verdict ? 839 LUTs (we have gained another 50 LUTs or so...). There is still room for gaining more LUTs (by using a multi-cycle ALU for shifts, and by using a smaller number of bits for address computation), but we'll keep that for later, since we have now enough room on the device for the next steps.

Step 13: subroutines (version 1, using plain RISC-V instructions)

OK, so now we have an (uncomplete) RISC-V processor, a SOC, both fit on the device. Remember, we are approaching the end, only 8 instructions to go (5 Load variants, 3 Store variants).

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[ *] 6	[*]	[*]	[ ] 5	[ ] 3	[*] 1

Before attacking them, let us learn a bit more on RISC-V assembly, and function calls. Up to now, we have used a gearbox to slow down the CPU in such a way we can observe it executing our programs. Could'nt we implement a wait function instead and call it ? Let us see how to do that.

First thing to do is to remove the #(.SLOW(nnn)) parameter in the Clockworks instanciation:

   Clockworks CW(
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );

this no longer generates a gearbox and directly wires the CLK signal of the board to the internal clk signal used by our design.

OK, so now we need to see two different things:

how to write a function that waits for some time
how to call it

Wait a minute you are talking about function calls, but we do not have Load / Store instructions. We won't be able to push the return address on the stack (because we cannot read/write memory, and the stack is in memory !), so how is it possible ?

There would many possible ways of using RISC-V instructions to implement function calls. To make sure everybody uses the same convention, there is an application binary interface that defines how to call functions, how to pass parameters, and which register does what. See this document for more details.

Calling a function In this document, we learn that for calling a function, the return address will be stored in x1. Hence one can call a function using JAL(x1,offset) where offset is the (signed) difference between the program counter and the address of the function to be called. This works provided the offset fits in 20 bits (Jimm format). Note: for function that are further away, one can use a combination of AUIPC and JALR to reach an arbitrary offset.

Returning from a function is done by jumping to the address stored in x1, which can be done by JALR(x0,x1,0).

Function arguments and return value: The first 6 function arguments are passed through x10..x16, and the return value is passed through x10 (it overwrites the first function argument).

That's interesting, even though we do not have Load/Store, we can write programs with functions, but we cannot write functions that call other functions, because this requires saving x1 to the stack (well in fact nothing forbids us from doing that by saving x1 in another register but then it would quickly become a mess, so we won't do that).

One little thing: we have just learnt that in the ABI, x1 is used to store the return address of functions. Up to know we have wired it to the LEDs. Since we are going now to comply with the ABI, we need to chose another register instead. From now, x10 will be wired to the LEDs.

OK, so now we have everything we need to write yet another version of the blinky ! Let us chose a slow_bit constant, wire a wait function that counts to 2^slow_bit, and call it to slow-down our blinky:

`ifdef BENCH
   localparam slow_bit=15;
`else
   localparam slow_bit=19;
`endif

   
`include "riscv_assembly.v"
   integer L0_   = 4;
   integer wait_ = 20;
   integer L1_   = 28;
   
   initial begin
      ADD(x10,x0,x0);
   Label(L0_); 
      ADDI(x10,x10,1);
      JAL(x1,LabelRef(wait_)); // call(wait_)
      JAL(zero,LabelRef(L0_)); // jump(l0_)
      
      EBREAK(); // I keep it systematically
                // here in case I change the program.

   Label(wait_);
      ADDI(x11,x0,1);
      SLLI(x11,x11,slow_bit);
   Label(L1_);
      ADDI(x11,x11,-1);
      BNE(x11,x0,LabelRef(L1_));
      JALR(x0,x1,0);	  
	  
      endASM();
   end

   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[mem_addr[31:2]];
      end
   end
endmodule

Try step13.v in simulation and on the device.

Try this Knight-driver blinky, with one routine for going from left to right, another routine for going from right to left, and the wait routine. Hint you will need to save x1 to another register.

Step 14: subroutines (version 2, using RISC-V ABI and pseudo-instructions)

With the ABI, we have a standard way of writing programs, but there are many things to remember:

all RISC-V registers are the same, but with the ABI, we need to use certain registers for certain tasks (x1 for return address, x10..x16 for function parameters, etc...);
calling a function is implemented using JAL or AUIPC and JALR, and returning from a function is implemented using JALR.

On a CISC processor, there are often special functions for calling functions (CALL) and for returning from a function (RET), and registers are often specialized (function return address, stack pointer, function parameters). This makes programmer's life easier because there is less to remember. There is no reason not doing the same for a RISC processor ! Let us pretend that the register are different and give them different names (or aliases). These names are listed here.

ABI name	name	usage
`zero`	`x0`	read:0 write:ignored
`ra`	`x1`	return address
`t0`...`t6`	...	temporary registers
`fp`,`s0`...`s11`	...	saved registers, `fp`=`so`: frame pointer
`a0`...`a7`	...	function parameters and return value (`a0`)
`sp`	`x2`	stack pointer
`gp`	`x3`	global pointer

Saved registers (s0, ... s11) are supposed to be left untouched or saved/restored by functions. You can put your local variables there. If you write a function, you are supposed to push the ones you use on the stack and pop them before returning.

For all the other registers, you cannot expect them to be preserved through function calls.

The global pointer gp can be used as a "shortcut" to reach memory areas that are far away in 1 instruction. We will see that later (once we have Load and Store).

In our VERILOG assembler riscv_assembly.v, we just need to declare these aliases for register names:

   localparam zero = x0;
   localparam ra   = x1;
   localparam sp   = x2;
   localparam gp   = x3;
   ...
   localparam t4   = x29;
   localparam t5   = x30;      
   localparam t6   = x31;

Besides these names, there are also pseudo-instructions for common tasks, such as:

pseudo-instruction	action
`LI(rd,imm)`	loads a 32-bits number in a register
`CALL(offset)`	calls a function
`RET()`	return from a function
`MV(rd,rs)`	equivalent to `ADD(rd,rs,zero)`
`NOP()`	equivalent to `ADD(zero,zero,zero)`
`J(offset)`	equivalent to `JAL(zero,offset)`
`BEQZ(rd1,offset)`	equivalent to `BEQ(rd1,x0,offset)`
`BNEZ(rd1,offset)`	equivalent to `BNE(rd1,x0,offset)`
`BGT(rd1,rd2,offset)`	equivalent to `BLT(rd2,rd1,offset)`

If the constant in the [-2048,2047] range, LI is implemented using ADDI(rd,x0,imm), else it uses a combination of LUI and ADDI (if you want to know how it works, see this stackoverflow answer, there are tricky details about sign expansion).

Using ABI register names and pseudo-instructions, our program becomes as follows:

   integer L0_   = 4;
   integer wait_ = 24;
   integer L1_   = 32;
   
   initial begin
      LI(a0,0);
   Label(L0_); 
      ADDI(a0,a0,1);
      CALL(LabelRef(wait_)); 
      J(LabelRef(L0_)); 
      
      EBREAK();

   Label(wait_);
      LI(a1,1);
      SLLI(a1,a1,slow_bit);
   Label(L1_);
      ADDI(a1,a1,-1);
      BNEZ(a1,LabelRef(L1_));
      RET();
	  
      endASM();
   end

It does not make a huge difference, but in longer programs, it improves legibility by showing the intent of the programmer (this one is a function, that one is a jump to a label etc...). Without it, since everything looks like the same, reading a program is more difficult.

It is quite funny: the RISC-V standard has a super-simple instruction set, but programming with it is not that easy, so the ABI pretends that the instruction set is more complicated, like a CISC processor, and this makes programmer's life easier. It also ensures that a function written by a programmer can be called from a function written by another programmer, possibly in a different language. We will see later how to use GNU assembler and C compiler to compile programs for our CPU. But before playing with software and toolchains, remember, we still have 8 instructions to implement in hardware (5 Load variants and 3 Store variants).

Try this invent (or copy it from somewhere else) a routine to multiply two numbers, test it on various inputs in simulation, and on the device.

Step 15: Load

Let us see now how to implement load instructions. There are 5 different instructions:

Instruction	Effect
LW(rd,rs1,imm)	Load word at address (rs1+imm) into rd
LBU(rd,rs1,imm)	Load byte at address (rs1+imm) into rd
LHU(rd,rs1,imm)	Load half-word at address (rs1+imm) into rd
LB(rd,rs1,imm)	Load byte at address (rs1+imm) into rd then sign extend
LH(rd,rs1,imm)	Load half-word at address (rs1+imm) into rd then sign extend

Note addresses are aligned on word boundaries for LW (multiple of 4 bytes) and halfword boundaries for LH,LHU (multiple of 2 bytes). It is a good thing, it makes things much easier for us...

But we still have some work to do ! First, some circuitry that determines the loaded value (that we will call LOAD_data).

As you can see, we got instructions for loading words, half-words and bytes, and instructions that load half-words and bytes exist in two versions:

LBU,LHU that load a byte,halfword in the LSBs of rd
LB,LH that load a byte,halfword in the LSBs of rd then do sign extensin:

For instance, imagine a sign byte with the value -1, that is 8'b11111111, loading it in a 32-bit register with LBU will result in 32'b0000000000000000000000011111111, whereas loading it with LB will result in 32'b11111111111111111111111111111111, that is, the 32-bits version of -1.

So we got a "two-dimensional" array of cases (whether we load a byte, halfword, word, and whether we do sign extension or not). Well, in fact it is even more complicated. Remember, our memory is structured into words, so when we load a byte, we need to know which one it is (among 4), and when we load a halfword, we need to know which one it is (among 2). This can be done by examining the 2 LSBs of the address of the data to be loaded (rs1 + Iimm):

   wire [31:0] loadstore_addr = rs1 + Iimm;
   wire [15:0] LOAD_halfword =
	       loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];

   wire  [7:0] LOAD_byte =
	       loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];

OK, so now we need to select among mem_rdata (LW), LOAD_halfword (LH,LHU) and LOAD_byte (LB,LBU). Examining the table in the RISC-V reference manual page 130, this is determined by the two LSBs of funct3:

   wire mem_byteAccess     = funct3[1:0] == 2'b00;
   wire mem_halfwordAccess = funct3[1:0] == 2'b01;

   wire [31:0] LOAD_data =
         mem_byteAccess ? LOAD_byte     :
     mem_halfwordAccess ? LOAD_halfword :
                          mem_rdata     ;

Now we need to insert sign expansion into this expression. The value to be written in the MSBs of rd, LOAD_sign, depends on both whether the instruction does sign expansion (LB,LH), characterized by funct3[2]=0, and the MSB of the loaded value:

   wire LOAD_sign =
	!funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]);

   wire [31:0] LOAD_data =
         mem_byteAccess ? {{24{LOAD_sign}},     LOAD_byte} :
     mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} :
                          mem_rdata ;

Pfiuuuu, it was a bit painful, but in the end it is not too complicated. My initial design was much more complicated, but Matthias Koch (@mecrisp) simplified it a lot, resulting in the (reasonably easy to understand) design above.

We are not completely done though, now we need to modify the state machine. It will have two additional states, LOAD and WAIT_DATA:

   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;
   localparam LOAD        = 4;
   localparam WAIT_DATA   = 5;
   reg [2:0] state = FETCH_INSTR;

Note 1 we could do with a smaller number of states, but for now our goal is to have something that works and that is as easy to understand as possible. We will see later how to simplify the state machine. Note 2 do not forget to check that state has the required number of bits ! (reg [2:0] state instead of reg [1:0] state as before !!). Then the new states are plugged in as follows:

     ...
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad ? LOAD : FETCH_INSTR;
	   end
	   LOAD: begin
	      state <= WAIT_DATA;
	   end
	   WAIT_DATA: begin
	      state <= FETCH_INSTR;
	   end

     ...

And finally, the signals mem_addr (with the address to be read) and mem_rstrb (that goes high whenever the processor wants to read data) are driven as follows:

   assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
		     PC : loadstore_addr ;
   assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);

Let us test now our new instructions with the following program:

   integer L0_   = 8;
   integer wait_ = 32;
   integer L1_   = 40;
   
   initial begin
      LI(s0,0);   
      LI(s1,16);
   Label(L0_); 
      LB(a0,s0,400); // LEDs are plugged on a0 (=x10)
      CALL(LabelRef(wait_));
      ADDI(s0,s0,1); 
      BNE(s0,s1, LabelRef(L0_));
      EBREAK();
      
   Label(wait_);
      LI(t0,1);
      SLLI(t0,t0,slow_bit);
   Label(L1_);
      ADDI(t0,t0,-1);
      BNEZ(t0,LabelRef(L1_));
      RET();

      endASM();

      // Note: index 100 (word address)
      //     corresponds to 
      // address 400 (byte address)
      MEM[100] = {8'h4, 8'h3, 8'h2, 8'h1};
      MEM[101] = {8'h8, 8'h7, 8'h6, 8'h5};
      MEM[102] = {8'hc, 8'hb, 8'ha, 8'h9};
      MEM[103] = {8'hff, 8'hf, 8'he, 8'hd};            
   end

This program initializes some values in four words at address 400, and loads them in a10 in a loop. There is also a delay loop (wait function) to let you see something, just as before.

Try this Run the program in simulation and on the device. Test the other instructions. Do a programmable tinsel as in step 3.

You are here ! Just three instructions to go and we will be done !

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[*] 6	[*]	[*]	[*] 5	[ ] 3	[*] 1

Step 16: Store

We are approaching the end, but still some work to do, to implement the following three instructions:

Instruction	Effect
SW(rs2,rs1,imm)	store rs2 at address rs1+imm
SB(rs2,rs1,imm)	store 8 LSBs of rs2 at address rs1+imm
SH(rs2,rs1,imm)	store 16 LSBs of rs2 at address rs1+imm

To do so, we will need to do three different things:

modify the interface between the processor and the memory in such a way that the processor can write to the memory
the memory is addressed by words. Each write operation will modify a word. But SB and SH need to be able to write individual bytes. Besides the word to be written, we need to compute which byte of this word should be effectively modified in memory (a 4-bits mask)
the state machine needs to be modified.

The Memory module is modified as follows:

module Memory (
   input             clk,
   input      [31:0] mem_addr,  
   output reg [31:0] mem_rdata, 
   input   	     mem_rstrb, 
   input      [31:0] mem_wdata, 
   input      [3:0]  mem_wmask	
);

   reg [31:0] MEM [0:255]; 

   initial begin
      ...
   end

   wire [29:0] word_addr = mem_addr[31:2];
   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[word_addr];
      end
      if(mem_wmask[0]) MEM[word_addr][ 7:0 ] <= mem_wdata[ 7:0 ];
      if(mem_wmask[1]) MEM[word_addr][15:8 ] <= mem_wdata[15:8 ];
      if(mem_wmask[2]) MEM[word_addr][23:16] <= mem_wdata[23:16];
      if(mem_wmask[3]) MEM[word_addr][31:24] <= mem_wdata[31:24];	 
   end

We have two new input signals: mem_wdata, a 32-bits signal with the value to be written, and mem_wmask a 4-bits signal that indicates which byte should be written.

Note you may wonder how it is implemented in practice, in particular how the masked write to memory is synthesized on the device. BRAMs on most FPGAs directly support masked writes, through vendor's special primitives. Yosys has a (super smart) special step called "technology mapping" that detects some patterns in the source VERILOG file, and instances the vendor's primitive best adapted to the usage. In fact technology mapping was used before in our tutorial, to represent the registers bank: at each cycle we read two registers, rs1 and rs2. In the IceStick, BRAMs can read a single value at each clock, so to make it possible, yosys automatically duplicates the register bank. Whenever a value is written to rd, it is written to the two register banks: bank1[rdId] <- writeBackValue; bank2[rdId] <- writeBackValue;, and two different registers can be read at the same cycle, each one in its own register bank rs1 <- bank1[rs1Id]; rs2 <- bank2[rs2Id;. With the magic of Yosys, you do not have to take care of this, it will automatically select the best mapping for you (duplicated register bank, single register bank with two read ports if target supports it, or even array of flipflops with address decoder for larger FPGAs with many LUTs). In our case, the IceStick has an Ice40HX1K, that has 8 kB of BRAM, organized in 8 blocks of 1 kB each. Two of them are used for the (duplicated) register bank, leaving 6 kB of BRAM that we use to synthesize system RAM.

The Processor module is updated accordingly:

module Processor (
    input 	      clk,
    input 	      resetn,
    output [31:0]     mem_addr, 
    input [31:0]      mem_rdata, 
    output 	      mem_rstrb,
    output [31:0]     mem_wdata,
    output [3:0]      mem_wmask,
    output reg [31:0] x10 = 0		  
);

(and everything is connected in the SOC).

Let us see now how to compute the word to be written and the mask. The address where the value should be written is still rs1 + imm, but the format of the immediate value is different between Load (Iimm) and Store (Simm):

   wire [31:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);

Now the data to be written depends on whether we write a byte, a halfword or a word, and for bytes and halfwords, also depends on the 2 LSBs of the address. Interestingly, we do not need to test whether we write a byte, a halfword or a word, because the write mask (see lated) will ignore MSBs for byte and halfword write:

   assign mem_wdata[ 7: 0] = rs2[7:0];
   assign mem_wdata[15: 8] = loadstore_addr[0] ? rs2[7:0]  : rs2[15: 8];
   assign mem_wdata[23:16] = loadstore_addr[1] ? rs2[7:0]  : rs2[23:16];
   assign mem_wdata[31:24] = loadstore_addr[0] ? rs2[7:0]  :
			     loadstore_addr[1] ? rs2[15:8] : rs2[31:24];

And finally, the 4-bits write mask, that indicate which byte of mem_wdata should be effectively written to memory. It is determined as follows:

write mask	Instruction
`4'b1111`	`SW`
`4'b0011` or `4'b1100`	`SH`, depending on `loadstore_addr[1]`
`4'b0001`, `4'b0010`, `4'b0100` or `4'b1000`	`SB`, depending on `loadstore_addr[1:0]`

Deriving the expression is a bit painful. With Matthias Koch we ended up with this one:

   wire [3:0] STORE_wmask =
	      mem_byteAccess      ?
	            (loadstore_addr[1] ?
		          (loadstore_addr[0] ? 4'b1000 : 4'b0100) :
		          (loadstore_addr[0] ? 4'b0010 : 4'b0001)
                    ) :
	      mem_halfwordAccess ?
	            (loadstore_addr[1] ? 4'b1100 : 4'b0011) :
              4'b1111;

Let us now create additional states in the state machine:

   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;
   localparam LOAD        = 4;
   localparam WAIT_DATA   = 5;
   localparam STORE       = 6;

   ...

   always @(posedge clk) begin
   ...
       case(state)
           ...
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad  ? LOAD  : 
		       isStore ? STORE : 
		       FETCH_INSTR;
	   LOAD: begin
	      state <= WAIT_DATA;
	   end
	   WAIT_DATA: begin
	      state <= FETCH_INSTR;
	   end
	   STORE: begin
	      state <= FETCH_INSTR;
	   end
	 endcase 
      end
   end

The signals interfaced with the memory as driven as follows:

   assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
		     PC : loadstore_addr ;
   assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);
   assign mem_wmask = {4{(state == STORE)}} & STORE_wmask;

And, at last, a little thing: do not write back to register bank if instruction is a Store !

   assign writeBackEn = (state==EXECUTE && !isBranch && !isStore && !isLoad) ||
			(state==WAIT_DATA) ;

Note The !isLoad term that prevents writing rd during EXECUTE can be removed from the condition, since rd will be overwritten right after during the WAIT_DATA. It is there to have something easier to understand with simulations.

try this Run step16.v in simulation and on the device. It copies 16 bytes from address 400 to address 800, then displays the values of the copied bytes.

You are here ! Congratulations ! You have finished implementing your first RV32I RISC-V core !

ALUreg	ALUimm	Jump	Branch	LUI	AUIPC	Load	Store	SYSTEM
[*] 10	[*] 9	[*] 2	[*] 6	[*]	[*]	[*] 5	[*] 3	[*] 1

But wait a minute for sure we have worked a lot to implement a RISC-V core, but all what I can see know is just something that looks like the stupid blinky at step 1 ! I want to see more !

To do so, we need to let our device communicate with the outside word with more than 5 LEDs.

Step 17: Memory-mapped device - let's do (much) more than a blinky !

Now the idea is to add devices to our SOC. We already have LEDs, that are plugged to register a0 (x10). Plugging devices on a register like that is not super elegant, it would be better to have a special address in memory that is not really actual RAM but that has a register plugged to the LEDs. With this idea, one can add as many devices as he likes, by assigning a virtual address to each device. Then the SOC will have address decoding hardware that routes the data to the right device. As you will see, besides removing from the processor the wires drawn from x10 to the LEDS, this only requires some small modifications in the SOC.

Before starting to modify the SOC, the first thing to do is to decide about the "memory map", that is, which address space portion corresponds to what. In our system, we have 6 kB of RAM, so in practice we could say that addresses between 0 and 2^13-1 (8 kB, let us keep a power of two) correspond to RAM. I decided to use a larger portion of address space for RAM (because we also have FPGAs that have ampler quantities of BRAM), then the address space dedicated to RAM will be between 0 and 2^22-1 (that is, 4 MB of RAM).

Then, I decided to say that if bit 22 is set in an address, then this address corresponds to a device. Now we need to specify how to select among multiple devices. A natural idea is to use bits 0 to 21 as a "device index", but doing so is going to require multiple 22-bits wide comparators, and on our IceStick, it will eat-up a significant portion of the removing LUTs. A better idea, suggested (once again) by Matthias Koch (@mecrisp), is to use 1-hot encoding, that is, data is routed to device number n if bit n is set in the address. We will only consider "word addresses" (that is, ignore the two LSBs). Doing that, we can only plug 20 different devices to our SOC, but it is still much more than what we need. The advantage is that it dramatically simplifies address decoding, in such a way that everything still fits in the IceStick.

To determine whether a memory request should be routed to the RAM or to the devices, we insert the following circuitry into the SOC:

   wire [31:0] RAM_rdata;
   wire [29:0] mem_wordaddr = mem_addr[31:2];
   wire isIO  = mem_addr[22];
   wire isRAM = !isIO;
   wire mem_wstrb = |mem_wmask;

The RAM is wired as follows:

   Memory RAM(
      .clk(clk),
      .mem_addr(mem_addr),
      .mem_rdata(RAM_rdata),
      .mem_rstrb(isRAM & mem_rstrb),
      .mem_wdata(mem_wdata),
      .mem_wmask({4{isRAM}}&mem_wmask)
   );

(note the isRAM signal ANDed with the write mask)

Now we can add the logic to wire our LEDs. They are declared as a reg in the SOC module interface:

module SOC (
    input 	     CLK, 
    input 	     RESET,
    output reg [4:0] LEDS, 
    input 	     RXD, 
    output 	     TXD  
);

driven by a simple block:

   localparam IO_LEDS_bit = 0;  

   always @(posedge clk) begin
      if(isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin
	 LEDS <= mem_wdata;
      end
   end

Now we can write (yet another version of) our old good blinky:

      LI(gp,32'h400000); 
      LI(a0,0);
   Label(L1_);
      SW(a0,gp,4);
      CALL(LabelRef(wait_));
      ADDI(a0,a0,1);
      J(LabelRef(L1_));

First we load the base address of the IO page in gp (that is, 2^22). To write LEDs value, we store a0 to word address 1 (that is address 4) in the IO page. To make things easier when we'll have several devices (right after), let us write some helper functions:

   // Memory-mapped IO in IO page, 1-hot addressing in word address.   
   localparam IO_LEDS_bit      = 0;  // W five leds

   // Converts an IO_xxx_bit constant into an offset in IO page.
   function [31:0] IO_BIT_TO_OFFSET;
      input [31:0] bit;
      begin
	 IO_BIT_TO_OFFSET = 1 << (bit + 2);
      end
   endfunction

Then we can write to the LEDs as follows:

   SW(a0,gp,IO_BIT_TO_OFFSET(IO_LEDS_bit));

OK, is it all what you have, still your stupid blinky after 17 (!) tutorial steps ?

Sure, you are right man. Let us add an UART to allow our core to display stuff to a virtual terminal. The IceStick (and many other FPGA boards) has a special chip (FTDI2232H if you want to know), that translates between the plain old RS232 serial protocol and USB. It is good news for us, because RS232 is a simple protocol, much easier to implement than USB. In fact, our core will communicate with the outside word through two pins (one for sending data, called TXD and one for receiving data, called RXD), and the FTDI chip converts to the USB protocol for you. Moreover, it is a good idea not reinventing the wheel, and there are many existing implementation of UART (Universal Asynchronous Receiver Transmitter, that implement the RS232 protocol) in VERILOG. For our purpose, for now we will only implement half of it (that is, the part that lets our processor send data over it to display text in a terminal emulator).

Olof Kindren has written a Tweet-size UART, more legible version here.

Let us insert it into our SOC and connect it:

   // Memory-mapped IO in IO page, 1-hot addressing in word address.   
   localparam IO_LEDS_bit      = 0;  // W five leds
   localparam IO_UART_DAT_bit  = 1;  // W data to send (8 bits) 
   localparam IO_UART_CNTL_bit = 2;  // R status. bit 9: busy sending
   ...

   wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DAT_bit];
   wire uart_ready;
   
   corescore_emitter_uart #(
      .clk_freq_hz(`BOARD_FREQ*1000000),
      .baud_rate(115200)			    
   ) UART(
      .i_clk(clk),
      .i_rst(!resetn),
      .i_data(mem_wdata[7:0]),
      .i_valid(uart_valid),
      .o_ready(uart_ready),
      .o_uart_tx(TXD)      			       
   );

   wire [31:0] IO_rdata = 
	       mem_wordaddr[IO_UART_CNTL_bit] ? { 22'b0, !uart_ready, 9'b0}
	                                      : 32'b0;

   assign mem_rdata = isRAM ? RAM_rdata :
	                      IO_rdata ;

The UART is projected onto two different addresses in memory space. The first one, that can be only written to, sends one character. The second one, that can be only read from, indicates whether the UART is ready (bit 9 = 0) or busy sending a character (bit 9 = 1).

Now our processor has more possibilities to communicate with the outside world than the poor five LEDs we had before ! Let us implement a function to send a character:

   Label(putc_);
      // Send character to UART
      SW(a0,gp,IO_BIT_TO_OFFSET(IO_UART_DAT_bit));
      // Read UART status, and loop until bit 9 (busy sending)
      // is zero.
      LI(t0,1<<9);
   Label(putc_L0_);
      LW(t1,gp,IO_BIT_TO_OFFSET(IO_UART_CNTL_bit));     
      AND(t1,t1,t0);
      BNEZ(t1,LabelRef(putc_L0_));
      RET();

It writes the character to the UART address projected in IO space, then loops while the UART status indicates that it is busy sending a character.

Try this run step17.v in simulation.

Wait a minute in simulation, how does it know how to display something ?

It's because I cheated a bit, I added the following block of code to the SOC:

`ifdef BENCH
   always @(posedge clk) begin
      if(uart_valid) begin
	 $write("%c", mem_wdata[7:0] );
	 $fflush(32'h8000_0001);
      end
   end
`endif

(the magic constant argument to$fflush() corresponds to stdout, you need to do that else you do not see anything on the terminal until the output buffer of stdout is full). Doing so we do not test the UART in simulation (it is completely bypassed). I trust Olof that it works fine, but to do things properly, it would be better to plug something on the simulated TXD signal, decode the RS232 protocol and display the characters (we'll see examples of this type of simulation later on).

Try this run step17.v on device.

To display what's sent to the UART, use:

  $ ./terminal.sh

Note edit terminal.sh and chose your favourite terminal emulator in there. You may also need to change DEVICE=/dev/ttyUSB1 according to your local configuration.

Step 18: Computing the Mandelbrot set

Now that we have a functional RISC-V processor and a SOC with an UART that can send characters to a virtual terminal, let us rest a little bit with a purely software step. In this step, we are going to write a program in RISC-V assembly that computes a crude, ASCII-art version of the Mandelbrot set.

Our "image" will be made of 80x80 characters. So let us start by writing a program that fills the image with "*" characters. To do that, we will use two nested loops. The Y coordinate will be stored in s0 and the X coordinate in s1. The upper bound (80) will be stored in s11. The program looks like that:

      LI(gp,32'h400000); // IO page
      LI(s1,0);
      LI(s11,80);
      
   Label(loop_y_);
      LI(s0,0);

   Label(loop_x_);
      LI(a0,"*");
      CALL(LabelRef(putc_));

      ADDI(s0,s0,1);
      BNE(s0,s11,LabelRef(loop_x_));

      LI(a0,13);
      CALL(LabelRef(putc_));
      LI(a0,10);
      CALL(LabelRef(putc_));      

      ADDI(s1,s1,1);
      BNE(s1,s11,LabelRef(loop_y_));

      EBREAK();

(and we copy the putc function from the previous example).

Fixed point So now we want to compute the Mandelbrot set. To do that, we need to manipulate real numbers. Unfortunately, our super simplistic RISC-V core is not able to directly manipulate floating point numbers. The C compiler's support library libgcc has some functions to support them, but we will see later how to use them. For now, the idea is to compute the Mandelbrot set using fixed-point numbers, that is, in an integer number, we will use some bits to represent the fractional part (10 bits in our case), and some bits to represent the integer parts (22 bits in our case). In other words, it means that if we want to represent a real number x, we will store (the integer part of) x*2^10 in a register. It is similar to floating point numbers, except that the exponent in our case is always 10. We will use the following constants in our program:

   `define mandel_shift 10
   `define mandel_mul (1 << `mandel_shift)

Now, to compute the sum or the difference of two numbers, it does not change anything, because the 2^10 factor is the same for both numbers to be added (or subtracted). For a product it is a different story, because when you compute x*y, the actual computation that you do is x*2^10*y*2^10, so what you get is (x*y)*2^20, and you wanted (x*y)*2^10, so you need to divide by 2^10 (right shift by 10). OK, that's good, but how do we compute the product of two integer numbers stored in two registers ? Our processor has no MUL instruction ? In fact it is possible to add a MUL instruction (it is part of the RV32M instruction set), we will see that later, but it will not fit within our tiny IceStick ! So what can we do ? We can implement a function that takes two numbers in a0 and a1, computes their products and returns it in a0. The C compiler support library libgcc has one (it is what is used when compiling C for small RV32I RISC-V processors that do not have the MUL instruction, like ours). The source-code of this function is here. Let us port it to our VERILOG RISC-V assembler (that has a slightly different syntax unfortunately, we will see later how to directly use gcc and gas):

      // Mutiplication routine,
      // Input in a0 and a1
      // Result in a0
   Label(mulsi3_);
      MV(a2,a0);
      LI(a0,0);
   Label(mulsi3_L0_); 
      ANDI(a3,a1,1);
      BEQZ(a3,LabelRef(mulsi3_L1_)); 
      ADD(a0,a0,a2);
   Label(mulsi3_L1_);
      SRLI(a1,a1,1);
      SLLI(a2,a2,1);
      BNEZ(a1,LabelRef(mulsi3_L0_));
      RET();

(do not forget to declare the new labels before the initial block).

So now, before displaying the Mandelbrot set, to test our fixed-point computation idea, let us display a simpler shape, that is, we consider we are visualizing the [-2.0,2.0]x[-2.0,2.0] square (mapped to our 30x30 characters display), and we want to display a disk of radius 2 centered on (0,0). To do that, we need first to compute the (fixed point) coordinates x,y. They will be stored in s2 and s3. Then we need to compute x^2+y^2. We can do that by invoking the mulsi3 routine twice (do not forget to rightshift the result by 10). Finally, we compare the result with 4 << 10 (4 because it is the squared radius, and shifted to the left by 10 because of our fixed-point representation), to decide whether the point was inside or outside the disk, and use a different character to display it. The corresponding program looks like that:

   `define mandel_shift 10
   `define mandel_mul (1 << `mandel_shift)
   `define xmin (-2*`mandel_mul)
   `define xmax ( 2*`mandel_mul)
   `define ymin (-2*`mandel_mul)
   `define ymax ( 2*`mandel_mul)	
   `define dx ((`xmax-`xmin)/30)
   `define dy ((`ymax-`ymin)/30)
   `define norm_max (4 << `mandel_shift)
   
   integer    loop_y_      = 28;
   integer    loop_x_      = 36;
   integer    in_disk_     = 92;

   initial begin
      LI(gp,32'h400000); // IO page

      LI(s1,0);
      LI(s3,`xmin);
      LI(s11,30);
      LI(s10,`norm_max);
      
   Label(loop_y_);
      LI(s0,0);
      LI(s2,`ymin);

   Label(loop_x_);

      MV(a0,s2);
      MV(a1,s2);
      CALL(LabelRef(mulsi3_));
      SRLI(s4,a0,`mandel_shift); // s4 = x*x
      MV(a0,s3);
      MV(a1,s3);
      CALL(LabelRef(mulsi3_));
      SRLI(s5,a0,`mandel_shift); // s5 = y*y
      ADD(s6,s4,s5);             // s6 = x*x+y*y
      LI(a0,"*");
      BLT(s6,s10,LabelRef(in_disk_)); // if x*x+y*y < 4
      LI(a0," ");
  Label(in_disk_);
      CALL(LabelRef(putc_)); 

      ADDI(s0,s0,1);
      ADDI(s2,s2,`dx);
      BNE(s0,s11,LabelRef(loop_x_));

      LI(a0,13);
      CALL(LabelRef(putc_));
      LI(a0,10);
      CALL(LabelRef(putc_));      

      ADDI(s1,s1,1);
      ADDI(s3,s3,`dy);
      BNE(s1,s11,LabelRef(loop_y_));

      EBREAK();

and the output looks like that:

          ***********         
        ***************       
       ******************     
     *********************    
    ***********************   
    ************************  
   *************************  
  *************************** 
  *************************** 
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
  *************************** 
  *************************** 
   *************************  
   *************************  
    ***********************   
     *********************    
      *******************     
        ***************       
          ***********

Now to compute the Mandelbrot set, we need to iterate the following operation:

   Z <- 0; iter <- 0
   do
      Z <- Z^2 + C
      iter <- iter + 1
   while |Z| < 2

where Z and C are complex numbers. C = x + iy corresponds to the current pixel. Remember the rule for complex number multiplication (i*i = -1), we can compute Z^2 = (Zr + i*Zi)^2 = Zr^2-Zi^2 + 2*i*Zr*Zi. The loop that computes these iterates writes:

   Label(loop_Z_);
      MV(a0,s4); // Zrr  <- (Zr*Zr) >> mandel_shift
      MV(a1,s4);
      CALL(LabelRef(mulsi3_));
      SRLI(s6,a0,`mandel_shift);
      MV(a0,s4); // Zri <- (Zr*Zi) >> (mandel_shift-1)
      MV(a1,s5);
      CALL(LabelRef(mulsi3_));
      SRAI(s7,a0,`mandel_shift-1);
      MV(a0,s5); // Zii <- (Zi*Zi) >> (mandel_shift)
      MV(a1,s5);
      CALL(LabelRef(mulsi3_));
      SRLI(s8,a0,`mandel_shift);
      SUB(s4,s6,s8); // Zr <- Zrr - Zii + Cr  
      ADD(s4,s4,s2);
      ADD(s5,s7,s3); // Zi <- 2Zri + Cr

      ADD(s6,s6,s8); // if norm > norm max, exit loop
      LI(s7,`norm_max);
      BGT(s6,s7,LabelRef(exit_Z_));
      
      ADDI(s10,s10,-1);  // iter--, loop if non-zero
      BNEZ(s10,LabelRef(loop_Z_));
      
   Label(exit_Z_);

in the end, we display different characters depending on the value of iter (s10) when the loop is exited:

   Label(exit_Z_);
      LI(a0,colormap_);
      ADD(a0,a0,s10);
      LBU(a0,a0,0);
      CALL(LabelRef(putc_));

where the "colormap" is an array of characters that mimic different "intensities", from the darkest to the brightest:

   Label(colormap_);
      DATAB(" ",".",",",":");
      DATAB(";","o","x","%");
      DATAB("#","@", 0 , 0 );

Try that run step18.v in simulation and on the device. Modify it to draw your own graphics (for instance, try drawing "concentric circles" using the "colormap").

Step 19: Faster simulation with Verilator

As you have seen in Step 18, simulation is much much slower than running the design on the device. However, there is another tool, called verilator, that lets you convert a VERILOG design into C++. Then you compile the C++, and you have a simulation that is much much faster than icarus/iverilog. Let us first install verilator:

  $ apt-get install verilator

Before transforming our design into C++, we will have to create a "bench", that is, some C++ code that will generate the signals for our design, and that will declare the C++ main() function. The main role of the main function is to declare an object of class VSOC (generated from our SOC module), and wiggle its CLK signal. Each time the CLK signal is changed, you need to call the eval() function to take the change into account. The sim_main.cpp file is as follows:

#include "VSOC.h"
#include "verilated.h"
#include <iostream>

int main(int argc, char** argv, char** env) {
   VSOC top;
   top.CLK = 0;
   while(!Verilated::gotFinish()) {
      top.CLK = !top.CLK;
      top.eval();
   }
   return 0;
}

In addition, in sim_main.cpp, there is some code to decode whenever the LEDs change, and display their status.

To convert a design to C++, use the following command:

  $ verilator -DBENCH -DBOARD_FREQ=12 -Wno-fatal --top-module SOC -cc -exe sim_main.cpp step18.v

Then to compile the C++ and run the generated program:

  $ cd obj_dir
  $ make -f VSOC.mk
  $ ./VSOC

As you can see, it is much much faster than icarus/iverilog ! For a small design, it does not make a huge difference, but believe me, when you are developping an RV32IMFC core, with a FPU, it is good to have efficient simulation !

To make things easier, there is a run_verilator.sh script, that you can invoke as follows:

  $ run_verilator.sh step18.v

Step 20: Using the GNU toolchain to compile programs - assembly

At this step, you may have the feeling that our RISC-V design is just a toy, for educational purpose, far away from "the real thing". In fact, at this step, you will start feeling that what you have done is as real as any other RISC-V processor ! What makes a processor interesting is the software you can run on it, hence if our thingy can run any software written for a (RV32I) RISC-V processor, then it is a RV32I RISC-V processor.

Wait a minute but what we have used up to now to write the software is the VERILOG assembler, it is just a toy, different from the real thing no ?

In fact, the VERILOG assembler generates exactly the same machine code as any other RISC-V assembler. We coud use instead any other RISC-V assembler, load the generated machine code into our design and run it !

To do so, VERILOG has a $readmemh() command, that loads the data to initialize a memory from an external file. It is used as follows in step20.v:

   initial begin
       $readmemh("firmware.hex",MEM);
   end

where firmware.hex is an ASCII file with the initial content of MEM in hexadecimal.

So if we want to use an external assembler, all we have to do is figure out the following things:

how to compile RISC-V assembly code using GNU tools
how to tell GNU tools about the device we have created (RAM start address, RAM amount)
how to convert the output of GNU tools into a file that $readmemh() can understand

OK, let us start with a simple blinker, in blinker.S:

# Simple blinker

.equ IO_BASE, 0x400000  
.equ IO_LEDS, 4

.section .text

.globl start

start:
        li   gp,IO_BASE
	li   sp,0x1800
.L0:
	li   t0, 5
	sw   t0, IO_LEDS(gp)
	call wait
	li   t0, 10
	sw   t0, IO_LEDS(gp)
	call wait
	j .L0

wait:
        li t0,1
	slli t0, t0, 17
.L1:       
        addi t0,t0,-1
	bnez t0, .L1
	ret

As you can see, it is very similar to the code we wrote up to now in the VERILOG assembler. In this program, we have three different things:

main program
utilities, here the wait function
setup, that is, initializing gp and sp

So we will split the file into three parts:

FIRMWARE/blinker.S with the main function
FIRMWARE/wait.S with the wait function
FIRMWARE/start.S with the setup code, that calls main in the end.

To compile it, you will need to install the RISC-V toolchain (compiler, assembler, linker) on your machine. Our makefile can do that for you:

  $ cd learn-fpga/FemtoRV
  $ make ICESTICK.firmware_config

Note: always use ICESTICK.firmware_config, even if you have a larger board, it will configure the makefiles for RV32I build (and that's what our processor supports).

This will download some files and unpack them in learn-fpga/FemtoRV/FIRMWARE/TOOLCHAIN. Add the riscv64-unknown-elf-gcc..../bin/ directory to your path.

Now to compile our program:

  $ cd learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/FIRMWARE
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax start.S -o start.o  
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax blinker.S -o blinker.o
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax wait.S -o wait.o

We specify the architecture (rv32i) that corresponds to the instructions supported by our processor and the ABI (ilp32) that corresponds to the way functions are called. THe no-relax option concerns the gp register that we use for accessing the IO page (so we do not let the assembler use it for anything else).

This generates object files (.o). We now need to generate an executable from them, by invoking the linker. The linker will determine where our code and data should be implanted in memory. For that, we need to specify how the memory in our device is organized, in a linker script (FIRMWARE/bram.ld):

MEMORY
{
   BRAM (RWX) : ORIGIN = 0x0000, LENGTH = 0x1800  /* 6kB RAM */
}
SECTIONS
{
    everything :
    {
	. = ALIGN(4);
	start.o (.text)
        *(.*)
    } >BRAM
}

A linker script contains a description of MEMORY. In our case, there is a single segment of 6 kB of memory, that we call BRAM. It starts from address 0x0000. Then we have SECTIONS, that indicates what goes where (or which segment goes to which memory). In our case, it is super simple: everything goes to BRAM. We also indicate that the content of start.o should be installed first in memory. The linker is invoked as follows:

  $ riscv64-unknown-elf-ld blinker.o wait.o -o blinker.bram.elf -T bram.ld -m elf32lriscv -nostdlib -norelax

It generates an "elf" executable ("elf" stands for Executable and Linkable Format). It is the same format as the binaries in a Linux system. The option -T bram.ld tells it to use our linker script. The option -m elf32lriscv indicates that we are generating a 32-bits executable. We are not using the C stdlib for now (-nostdlib) and we keep gp for ourselves (-norelax). We do not need to have start.o on the command line in the list of objects to link, because it is already included in the linker script bram.ld.

We are not completely done, now we need to extract the relevant information from the elf executable, and generate a file with all the machine code in hexadecimal, so that VERILOG's $readmemh() function can understand it. For that, I wrote a firmware_words utility, that understands the elf file formats, extracts the parts that are interesting for us and writes them in ASCII hexadecimal:

  $ make blinker.bram.hex

Note you can invoke make xxxx.bram.hex directly, it will invoke the assembler, linker and elf conversion utility for you automatically.

Now you can run the example in simulation and on the device:

  $ cd ..
  $ ./run_verilator.sh step20.v
  $ BOARDS/run_xxx.sh step20.v

Now that things are easier, we can write more complicated programs. Let us see how to write the famous "hello world" program. What we need is a putstring routine to display a string on the tty. It takes as input the address of the first character of the string to display in a0. We just need to loop on all characters of the string, and exit the loop as soon as we find a null character, and call putchar for each character:

# Warning, buggy code ahead !
putstring:
	mv t2,a0	
.L2:    lbu a0,0(t2)
	beqz a0,.L3
	call putchar
	addi t2,t2,1	
	j .L2
.L3:    ret

Have you seen the comment ? It means the code above has an error, can you spot it ?

A hint, putstring is a function that calls a function. Don't we need to do special in this case ?

Do you remember what call and ret do ? Yes, call stores PC+4 in ra then jumps to the function, and ret jumps to the address in ra. Now suppose that somebody called our putstring function. When we enter the function, ra contains the address we are supposed to jump to when reaching the ret statement in putstring. But inside putstring, we call putchar, and it overwrites ra with the address right after the call, so that putchar will be able to jump there when it will return, but putstring will jump there as well, which is not what we want. To avoid that, we need to save ra at the beginning of putstring, and restore it at the end. To do that, we use the stack as follows:

putstring:
	addi sp,sp,-4 # save ra on the stack
	sw ra,0(sp)   # (need to do that for functions that call functions)
	mv t2,a0	
.L2:    lbu a0,0(t2)
	beqz a0,.L3
	call putchar
	addi t2,t2,1	
	j .L2
.L3:    lw ra,0(sp)  # restore ra
	addi sp,sp,4 # resptore sp
	ret

The function can be used as follows:

   la   a0, hello
   call putstring
   
   ...

hello:
	.asciz "Hello, world !\n"

The la (load address) pseudo-instruction loads the address of the string in a0. The string is declared with a standard label, and the .asciz directive that generates a zero-terminated string.

Try this Compile hello.S (cd FIRMWARE; make hello.bram.hex) and test it in simulation and on device. Try also mandelbrot.S. As you can see, FIRMWARE/mandelbrot.S does not have the __mulsi function. If you take a look at FIRMWARE/Makefile, the executable is linked with the right version of libgcc.a (for RV32I), that has it.

Now you can start having a feeling that your processor is a real thing: when you run the Mandelbrot example, it executes code on your processor that was written by somebody else. Can we go further and run code generated by standard tools ?

Step 21: Using the GNU toolchain to compile programs - C

Let us see now how we can write code in C for our processor. At this point, we are able to generate object files (.o) and produce an elf executable from them using the linker. Our linker script ensures that everything goes at the right place in memory, then our processor can execute the code, first the content of start.S, implanted at address 0, that calls in turn the main function. Up to now our programs were completely written in assembly. The nice thing with the ABI (Application Binary Interface), that we have seen at steps 13 and 14, is that it makes it possible to combine object files (.o) produced by different tools, as soon as they respect the ABI, which is the case (of course) of the C compiler.

The example FIRMWARE/sieve.c, taken from the examples in picorv is a good candidate. It is interesting, it does multiplications, divisions and modulos using integer numbers. These operations are not implemented by our RV32I core, but they are supported by the compiler using functions in libgcc.a, and since we link with libgcc.a, this will work. However, the program also uses printf() to display the result, and this function is declared in libc.a. In principle, it would be possible to use it, but printf() supports so many formats that its code is too large and will not fit in our 6 kB or RAM. For this reason, we include a much smaller / much simpler version in FIRMWARE/print.c (also taken from picorv), and included in the objects to be linked with executables.

There are two other examples, a C version of the Mandelbrot program: FIRMWARE/mandel_C.c. It uses ANSI colors to display low-resolution "graphics" in the terminal. There is also FIRMWARE/riscv_logo.c that displays a spinning Risc-V logo (in a 90-ish demoscene style !).

Try this Compile sieve.c (cd FIRMWARE; make sieve.bram.hex) and test it in simulation (./run_verilator.sh step20.v) and on device (BOARDS/run_xxx.sh step20.v; ./terminal.sh). Try the other programs. Write your own programs (if you do not have an idea, try for instance cellular automata, Life ...). Note: the Verilator framework can directly load ELF executables in simulation (no need to regenerate firmware.hex). You can generate all demo programs: cd FIRMWARE; make hello.bram.elf mandelbrot.bram.elf mandel_C.bram.elf riscv_logo.bram.elf;cd .., then run the one that you want using ./run_verilator.sh step20.v FIRMWARE/mandel_C.bram.elf or ./obj_dir/FIRMWARE/mandel_C.bram.elf.

Now you can see that your processor is not just a toy, it is a real RISC-V processor on which you can run programs produced by standard tools !

Note on the IceStick, we only have 6kB of RAM, so only tiny programs will fit. If the compiled program is larger than 6kB then you will get an error. A more problematic case is a program that nearly fills the whole BRAM, then we have nearly no space for the stack, and the stack will overwrite the rest, putting the CPU in an invalid state, probably frozen. This situation is difficult to understand / to debug when you encounter it, so firmware_words displays a big warning message whenever the generated code fills more than 95% of the BRAM.

Step 22: Storing data: can I have more than 6 kB of memory ?

and some optimizations in the processor

On the IceStick, there are only 8 blocks of 1 kB of BRAM, and since we need to use two of them for the registers, this leaves only 6 kB of RAM for our programs. It is sufficient for small programs like Mandelbrot or little graphic demos, but you will very soon reach the limit. The IceStick has a little chip (see figure) with 4 MBs of FLASH memory (other boards have a similar chip). When you synthesize a design, it is stored in this FLASH memory. On startup, the FPGA loads its configuration from this chip. The nice thing is that the FPGA configuration takes no more than a few kilobytes, this leaves us a lot of space to store our own data. But we will need to create some additional hardware to communicate with this chip.

As you can see on the figure, this chip only has 8 legs, how can we address 4 MBs of data using 8 pins only ? In fact, this chip uses a serial protocol (SPI). To access data, one sends the address to be read on a pin, one bit at a time, then the chip sends the data back on another pin, one bit at a time. If you want to learn more about it, my notes about SPI flash are here and the VERILOG implementation is in spi_flash.v. It supports different protocols, depending on the used number of pins and whether pins are bidirectional.

The MappedSPIFlash module has the following interface:

module MappedSPIFlash( 
    input wire 	       clk,          
    input wire 	       rstrb,        
    input wire [19:0]  word_address, 

    output wire [31:0] rdata,        
    output wire        rbusy,        

    output wire        CLK,  
    output reg         CS_N, 
    inout  wire [1:0]  IO 
);

signal	description
clk	system clock
rstrb	read strobe, goes high whenever processor wants to read a word
word_address	address of the word to be read
rdata	data read from memory
rbusy	asserted if busy receiving data
CLK	clock pin of the SPI flash chip
CS_N	chip select pin of the SPI flash chip, active low
IO	two bidirectional pins for sending and receiving data

Now the idea is to modify our SOC in such a way that some addresses correspond to the SPI flash. First we need to decide how it will be projected into the memory space of our processor. The idea is to use bit 23 of memory addresses to select the SPI Flash. Bit 22 is for IO (LEDs, UART). In addition, for IO, we need to check that bit 23 is zero. And if both bits 23 and 22 are zero, then we are in BRAM. So our memory space is decomposed into four "quadrants" depending on bits 23 and 22, and we use three of them.

Then we have the different signals to discriminate the different zones of our memory:

   wire isSPIFlash  = mem_addr[23];      
   wire isIO        = mem_addr[23:22] == 2'b01;
   wire isRAM       = mem_addr[23:22] == 2'b00;

The MappedSPIFlash module is wired as follows:

   wire SPIFlash_rdata;
   wire SPIFlash_rbusy;
   MappedSPIFlash SPIFlash(
      .clk(clk),
      .word_address(mem_wordaddr),
      .rdata(SPIFlash_rdata),
      .rstrb(isSPIFlash & mem_rstrb),
      .rbusy(SPIFlash_rbusy),
      .CLK(SPIFLASH_CLK),
      .CS_N(SPIFLASH_CS_N),
      .IO(SPIFLASH_IO)
   );

(the pins SPIFLASH_CLK, SPIFLASH_CS_N, SPIFLASH_IO[0] and SPIFLASH_IO[1] are declared in the constraint file, in the BOARDS subdirectory).

The data sent to the processor has a three-ways mux:

   assign mem_rdata = isRAM      ? RAM_rdata :
                      isSPIFlash ? SPIFlash_rdata : 
	                           IO_rdata ;

OK, now our processor can automatically trigger a SPI flash read by accessing memory with bit 23 set in the address, but how does it know that data is ready ? (remember, data arrives one bit at a time). There is this SPIFlash_rbusy that goes high whenever MappedSPIFlash is busy receiving some data, we need to take it into account in our processor's state machine. We add a new input signal mem_rbusy to our processor, and modify the state machine as follows:

   ...
   WAIT_DATA: begin
      if(!mem_rbusy) begin
	 state <= FETCH_INSTR;
      end
   end
   ...

Then, in the SOC, this signal is wired to SPIFlash_rbusy:

   wire mem_rbusy;
   ...
   Processor CPU(
     ...
     .mem_rbusy(mem_rbusy),
     ...
   );
   ...
   assign mem_rbusy = SPIFlash_rbusy;

By the way, since we are revisiting the state machine, there is something we can do. Remember this portion of the state machine, don't you think we could go faster ?

   WAIT_INSTR: begin
      instr <= mem_rdata;
      state <= FETCH_REGS;
   end
   FETCH_REGS: begin
      rs1 <= RegisterBank[rs1Id];
      rs2 <= RegisterBank[rs2Id];
      state <= EXECUTE;
   end

Yes, rs1Id and rs2Id are simply 5 wires (each) drawn from instr, so we can get them from mem_rdata directly, and fetch the registers in the WAIT_INSTR state, as follows:

   WAIT_INSTR: begin
      instr <= mem_rdata;
      rs1 <= RegisterBank[mem_rdata[19:15]];
      rs2 <= RegisterBank[mem_rdata[24:20]];
      state <= EXECUTE;
   end

Doing so we gain one cycle per instruction, and it is an easy win !

Oh, and one more thing, why do we need a LOAD and a STORE state, could'nt we initiate memory transfers in the EXECUTE state ? Yes we can, so we need to change the write mask and read strobes accordingly, like that:

   assign mem_rstrb = (state == FETCH_INSTR || (state == EXECUTE & isLoad));
   assign mem_wmask = {4{(state == EXECUTE) & isStore}} & STORE_wmask;

Then the state machine has 4 states only !

   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam EXECUTE     = 2;
   localparam WAIT_DATA   = 3;
   reg [1:0] state = FETCH_INSTR;
   always @(posedge clk) begin
      if(!resetn) begin
	 PC    <= 0;
	 state <= FETCH_INSTR;
      end else begin
	 if(writeBackEn && rdId != 0) begin
	    RegisterBank[rdId] <= writeBackData;
	 end
	 case(state)
	   FETCH_INSTR: begin
	      state <= WAIT_INSTR;
	   end
	   WAIT_INSTR: begin
	      instr <= mem_rdata;
	      rs1 <= RegisterBank[mem_rdata[19:15]];
	      rs2 <= RegisterBank[mem_rdata[24:20]];
	      state <= EXECUTE;
	   end
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad  ? WAIT_DATA : FETCH_INSTR;
	   end
	   WAIT_DATA: begin
	      if(!mem_rbusy) begin
		 state <= FETCH_INSTR;
	      end
	   end
	 endcase 
      end
   end

There are several other things that we can optimize. First thing, you may have noticed that the two LSBs of the instructions are always 2'b11 in RV32I, so we do not need to load them:

   reg [31:2] instr;  
   ...
   instr <= mem_rdata[31:2];
   ...
   wire isALUreg  =  (instr[6:2] == 5'b01100); 
   ...

Something else: we are doing all address computations with 32 bits, whereas our address space has 24 bits only, we can save significant resources there:

   localparam ADDR_WIDTH=24;
   wire [ADDR_WIDTH-1:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
				  instr[4] ? Uimm[31:0] :
				             Bimm[31:0] );
   wire [ADDR_WIDTH-1:0] PCplus4 = PC+4;

   wire [ADDR_WIDTH-1:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm   :
	                                  isJALR   ? {aluPlus[31:1],1'b0} :
	                                             PCplus4;

   wire [ADDR_WIDTH-1:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);

The up to date verilog file is avalaible in step22.v. Let us now check that we are able to access the SPI flash from our processor, with the following program:

#include "io.h"
#define SPI_FLASH_BASE ((char*)(1 << 23))
int main()  {
   for(int i=0; i<16; ++i) {
      IO_OUT(IO_LEDS,i);
      int lo = (int)SPI_FLASH_BASE[2*i  ];
      int hi = (int)SPI_FLASH_BASE[2*i+1];
      print_hex_digits((hi << 8) | lo,4); // print four hexadecimal digits
      printf(" ");
   }
   printf("\n");
}

The SPI flash is mapped in memory space, using addresses with bit 23 set (the first address, that we call SPI_FLASH_BASE, is 1 << 23). Then we access all individual bytes, and display them by grouping them into 16-bit words (for each word, the first byte in memory is the least significant one, because RISC-V follows the little-endian convention). We have a print_hex_digits() function in FIRMWARE/print.c that does the job (the second argument is the number of hex characters we want to print for each number).

Now compile the program, synthesize the design and send it to the device as follows:

 $ cd FIRMWARE
 $ make read_spiflash.bram.hex
 $ cd ..
 $ BOARDS/run_icestick.sh step22.v
 $ ./terminal.sh

... and you see nothing. While is this so ? The program finished before you started the terminal, so we were not able to see anything, but you can reset the processor, pushing the invisible reset button (mentioned in step 2). Each time you push the "button", it will display on the terminal the first 16 words stored in the SPI flash. On a IceStick, you will see something like:

00FF FF00 AA7E 7E99 0051 0501 0092 6220 4B01 0072 8290 0000 0011 0101 0000 0000

Do you have an idea where these values come from ? Remember why there is this SPI flash chip on your FPGA board: it is where your design is stored. When the FPGA starts, it loads its design from the SPI flash. The design corresponds to the file SOC.bin, that is generated at the end of the yosys/nextpnr/icepack pipeline:

yosys transforms your verilog into a "circuit", also called a "netlist"
then nextpnr maps the gates of this circuit to the logical elements of the FPGA,
and finally icepack converts the result into a "binary stream" directly understood by the FPGA.

Let us examine the 16 first words of the binary stream:

  $ od -x -N 32 SOC.bin

Then you'll see something like:

0000000 00ff ff00 aa7e 7e99 0051 0501 0092 6220
0000020 4b01 0072 8290 0000 0011 0101 0000 0000
0000040

and this corresponds to what we have just seen on the terminal, read from the SPI flash chip. So our CPU can read its own FPGA representation from the SPI flash, like a biologist sequencing his hown DNA ! While it has a nice and intriguing recursion flavor, it is probably of very little practical use, but let us take a deeper look at it: the SOC.bin file is not very large:

$ ls -al SOC.bin
-rw-rw-r-- 1 blevy blevy 32220 Jan  7 07:31 SOC.bin

It weights only 32KB or so, and our SPI flash chip has capacity for 4MB, so there is plenty of room for us ! The only thing we need to take care of is not overwriting the FPGA configuration (in other words, always start further away then the size of SOC.bin). So we will use a 1MB offset for storing our data (you will say we are wasting a lot of space between 32KB and 1MB but we shall use that space for something else in subsequent steps of this tutorial).

Try this Create a text file hello.txt, send it to the FPGA at the 1MB offset (see below how to do that), write a program that displays the stored file. To know where to stop, you may need either to decide for a termination character or to precode the length of the file.

For ICE40 boards (IceStick, IceBreaker, ...), use:

 $ iceprog -o 1M hello.txt

For ECP5 boards (ULX3S), use:

 $ cp hello.txt hello.img
 $ ujprog -j flash -f 1048576 hello.img

(using latest version of ujprog compiled from https://github.com/kost/fujprog).

OK, so now we are ready to use the new storage that we have for more interesting things. What we will do is displaying an animation on the terminal. The animation is a demo from the 90's, that streams polygon data to a software polygon renderer. Polygon data is a 640 kB binary file, available from learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat (see other files in the same directory for more information about the file format). First thing to do is writing the file to the SPI flash, from a 1MBytes offset. For ICE40-based boards (IceStick, IceBreaker), use:

 $ iceprog -o 1M learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat

For ECP5 boards (ULX3S), use:

 $ cp learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat scene1.img
 $ ujprog -j flash -f 1048576 scene1.img

(using latest version of ujprog compiled from https://github.com/kost/fujprog).

Now you can compile the program:

 $ cd FIRMWARE
 $ make ST_NICCC.bram.hex
 $ cd ..

and send the design and the program to the device:

 $ BOARDS/run_xxx.sh step22.v
 $ ./terminal.sh

Try this Store an image in SPI Flash (in a format that is easy to read), and write a program to display it. You can use printf("\033[48;2;%d;%d;%dm ",R,G,B); to send a pixel (where R,G,B are numbers between 0 and 255), and printf("\033[48;2;0;0;0m\n"); after each scanline.

Step 23: running programs from SPI Flash, first steps

With what we have done in the previous step, we are now able to load data from the SPI flash, and we have ample space for all our data, but we still have only 6 kB that is shared between our code and variables, it is not much ! It would be great to be able to use the SPI flash to store our code, and execute it directly from there. We were able to write nice demos that fit in 6 kB, imagine what you could do with 2 MB for code, and the entire 6 kB available for your variables !

To be able to load code from the SPI flash, the only thing we need to change is staying in the WAIT_INSTR state until mem_rbusy is zero, hence we just need to test mem_rbusy before changing state to EXECUTE:

   WAIT_INSTR: begin
      instr <= mem_rdata[31:2];
      rs1 <= RegisterBank[mem_rdata[19:15]];
      rs2 <= RegisterBank[mem_rdata[24:20]];
      if(!mem_rbusy) begin
	 state <= EXECUTE;
      end
   end

and we initialize the BRAM with the following program, that jumps to address 0x00820000:

   initial begin
      LI(a0,32'h00820000);
      JR(a0);
   end

This address corresponds to the address where the SPI flash is projected into the address space of our CPU (0x00800000 = 1 << 23) plus an offset of 128kB (0x20000). This offset of 128 kB is necessary because remember, we share the SPI Flash with the FPGA that stores its configuration in it !

OK, that's mostly it for the hardware part. Let us see now if we can execute code from there. To do that, we will need a new linker script (FIRMWARE/spiflash0.ld):

MEMORY {
   FLASH (RX)  : ORIGIN = 0x00820000, LENGTH = 0x100000 /* 4 MB in flash */
}
SECTIONS {
    everything : {
	. = ALIGN(4);
	start.o (.text)
        *(.*)
    } >FLASH
}

It is the same thing as before, but we tell the linker to put everything in flash memory (for now, we will see later how it works for global variables). Let us test it with a program that does not write to global variables, for instance FIRMWARE/hello.S. To link it using our new linker script, we do:

  $ riscv64-unknown-elf-ld -T spiflash0.ld -m elf32lriscv -nostdlib -norelax hello.o putchar.o -o hello.spiflash0.elf

But since it is tedious to type, it is automated by the Makefile:

  $ make hello.spiflash0.elf

Now you need to convert the ELF executable into a flat binary:

  $ riscv64-unknown-elf-objcopy hello.spiflash0.elf hello.spiflash0.bin -O binary

or with our Makefile:

  $ make hello.spiflash0.bin

and send it to the SPI flash at offset 128k:

  $ iceprog -o 128k hello.spiflash0.bin

or with our Makefile:

  $ make hello.spiflash0.prog

and then:

  $ ./terminal.sh

Step 24: running programs from SPI Flash, a better linker script

Before starting, let us make a little change in our core: when pushing the reset button, it jumps at address 0, which is initialized as a jump to flash memory, but after executing our program, it is possible (and highly probable) that the RAM will have been used for something else, and no longer has the jump-to-flash instruction. To fix this, one can make the CPU jump to flash memory each time reset goes low:

   if(!resetn) begin
     PC    <= 32'h00820000; 
     state <= WAIT_DATA;    
   end

Note that state is set to WAIT_DATA, so that it waits for mem_rbusy to go low before doing anything else.

OK, so now we have a large quantity of flash memory in which we can install the code and run it from there. We can also install readonly variables in there, like the string .asciz "Hello, world !\n" in the previous example. And what about local variables ? They are allocated on the stack, that resides in the 6 kB of RAM that we have, so it will work. How does it know where the stack is ? Remember, we have written FIRMWARE/start.S, that initializes sp at the end of the RAM (0x1800) and it suffices.

But how does it work for a program like that ?

  int x = 3;
  void main() {
     x = x + 1;
     printf("%d\n",x);
  }

The global variable x has an initial value that needs to be stored somewhere, so we need to put it in flash memory, but we are modifying it after, so we need to put it in RAM, how is it possible ? In fact, what we need is a mechanism for storing all the initial values of the (initialized) global variables in flash memory and copy them to RAM on startup. To do that, we will need a new linker script (that indicates where to put the variables and where to put their initial values) and a new start.S (that copies the initial values to the variables). Let us see how to do that.

When you compile C code, the compiler inserts directives to indicate where the different things go (sections). To take a look, generate assembly from one of our C programs:

$ cd FIRMWARE
$ make ST_NICCC.o
$ readelf -S ST_NICCC.o

it will show you the different sections that are present in the object file.

section	description
text	executable code
bss, sbss	uninitialized data
data, sdata	read-only data
rodata	read-only data

The section name (bss) for uninitialized data has an historic reason that dates back to the 60's (BSS: Block Started by Symbol is a pseudo-instruction of an assembler for the IBM 704). Uninitialized and initialized data sections come in two flavor, sbss and sdata is for small uninitialized (resp) initialized) data.

In readelf output, there is also a type field. PROGBIT means that some data needs to be loaded from the file (for text, data and rodata) segments. NOBITS means that no data should be loaded (for bss). Then the Addr indicates where the section will be mapped into memory (for a .o file, it is always 0, but it is useful for a linked elf executable, you can check using readelf). Then the Offs field indicates the offset for the section's data in the .o file, and the Size field the number of bytes in the section.

So what we have to do is writing a linker script that will say the following things:

text sections go to the flash memory
bss sections go to BRAM
data sections go to BRAM, but have their initial values stored in the flash memory

For text and bss, we already know how to do it. For data, linker scripts can specify a LMA (Load Memory Address), that indicates where initial values need to be stored. In our linker script, we will have something like:

  MEMORY {
      FLASH (rx)  : ORIGIN = 0x00820000, LENGTH = 0x100000
      RAM   (rwx) : ORIGIN = 0x00000000, LENGTH = 0x1800  
  }
  SECTIONS {
  
    .data: AT(address_in_spi_flash) {
      *(.data*)          
      *(.sdata*)
    } > RAM
    
    .text : {
      start_spiflash1.o(.text) 
      *(.text*) 
      *(.rodata*) 
      *(.srodata*)
    } >FLASH
    
    .bss : {
      *(.bss*)
      *(.sbss*)
    } >RAM
  }

Each section indicates how to map sections read from object files to sections in the executable (.data, .text and .bss), and how to map these sections to the flash memory and to the BRAM. For each section, some pattern matching rules indicate which sections from the object files are concerned. For the .text section, we make sure that the first section is the text section of start_spiflash1.o, because our processor jumps there on reset. Note also that we put the readonly data (.rodata and .srodata) into the flash.

For the .data section, the AT keyword indicates the LMA (Load Memory Address) where the linker will put the initial values (an address in spi flash), and whenever a symbol in a data or sdata section is referenced, the linker will use its address in RAM.

But a question remains: how does the system know that it should copy initialization data from the flash into BRAM ? How does it know at which address ? How can we initialize uninitialized data (BSS) to zero ? In fact we need to do it by hand, in the startup code start_spiflash1.S, that looks like that:

.equ IO_BASE, 0x400000  

.text
.global _start
.type _start, @function

_start:
.option push
.option norelax
     li  gp,IO_BASE
.option pop

     li   sp,0x1800

# zero-init bss section:
     la a0, _sbss
     la a1, _ebss
     bge a0, a1, end_init_bss
loop_init_bss:
     sw zero, 0(a0)
     addi a0, a0, 4
     blt a0, a1, loop_init_bss
end_init_bss:

# copy data section from SPI Flash to BRAM:
     la a0, _sidata
     la a1, _sdata
     la a2, _edata
     bge a1, a2, end_init_data
loop_init_data:
     lw a3, 0(a0)
     sw a3, 0(a1)
     addi a0, a0, 4
     addi a1, a1, 4
     blt a1, a2, loop_init_data
end_init_data:

     call main
     ebreak

The first thing that we do is initializing the stack pointer and the general pointer gp (with the IO page address in our case).
the first loop clears the memory between _sbss and _ebss.
the second loop copies data from _sidata to _sdata ... _edata
finally we call main

... but wait a minute, how do we know the values for _sbss,_ebss,_sidata,_sdata,_edata ?

In fact, the linker script can generate them for us. Here is what the .data section looks like:

    .data : AT ( _sidata ) {
        . = ALIGN(4); 
        _sdata = .;
        *(.data*)          
        *(.sdata*)
        . = ALIGN(4);
        _edata = .;  
    } > RAM

where . denotes the current address. In addition, lines like . = ALIGN(4); make sure that addresses remain aligned on 4-bytes boundaries, since our initialization loops in start_spiflash1.S depend on that.

The declaration for the .text section looks like:

    .text : {
        . = ALIGN(4);
        start_spiflash1.o(.text)  
        *(.text*)                 
        . = ALIGN(4);
        *(.rodata*)              
        *(.srodata*)             
        _etext = .;              
        _sidata = _etext;        
    } >FLASH

note that it declares _sidata right at the end of the text section, so that the .data section can put its initialization data there.

OK, so let us try it with one of our examples:

  $ cd FIRMWARE
  $ make mandel_C.spiflash1.prog
  $ cd ..
  $ ./terminal.sh

Yes, it works, but wait a minute, it is significantly slower than before. Can you guess why ?

Remember that the FLASH memory is a serial memory, wich means that addresses are sent one bit at a time and the result is obtained also one bit at a time (well, in fact two bits at a time for both in our case), it is much slower than the BRAM that gets a 32-bits value in one cycle. Can we do something ? Sure we can ! What about putting some critical functions in BRAM ? To do that, we can change our linker script as follows (result in FIRMWARE/spiflash2.ld):

    .data_and_fastcode : AT ( _sidata ) {
        . = ALIGN(4);
        _sdata = .;  
	
	/* Initialized data */
        *(.data*)          
        *(.sdata*)

	/* integer mul and div */
	*/libgcc.a:muldi3.o(.text)
	*/libgcc.a:div.o(.text)    

	putchar.o(.text)
	print.o(.text)	

	/* functions with attribute((section(".fastcode"))) */	
	*(.fastcode*)      

        . = ALIGN(4);
        _edata = .;  
    } > RAM

By doing so, we indicate that some specific functions (integer multiply and divide from libgcc and IO functions) should be put in fast RAM, and that's all we have to do ! The linker will put the code for these functions in the same section as the initialization data for initialized variables, and our runtime start_spiflash1.S will copies them with the initialization data to RAM at startup, cool !

Let us try it with our example:

  $ cd FIRMWARE
  $ make mandel_C.spiflash2.prog
  $ cd ..
  $ ./terminal.sh

Aaaah, much better !

Note also the line *(.fastcode*): you can put your own functions in BRAM, by indicating that they are in a fastcode section. In C, you can do that as follows:

 void my_function(my args ...) __attribute((section(".fastcode")));
 void my_function(my args ...) {
      ...
 }

Try this run the ST_NICCC demo (make ST_NICCC.spiflash2.prog). Then uncomment the line in ST_NICCC.c with the definition for RV32_FASTCODE and re-run it.

Now we can run larger programs on our device:

FIRMWARE/pi.c (by Fabrice Beillard, computes the decimals of pi)
FIRMWARE/tinyraytracer.c (by Dmitry Sokolov, raytracing)

Both of them use floating point numbers. For a RV32I core such as ours, floating point numbers use routines implemented in libgcc. As a consequence, executables are larger (pi weights 17 kB and tinyraytracer weights 25 kB) and would have been impossible to run in 6 kB of RAM. The additional memory offered by the SPI FLASH offers much more possibilities to our device !

At this point, not only our device runs code compiled using standard tools (gcc), but also it runs existing code, independently developped (the mathematical routines in libgcc). It is quite exciting to run existing binary code on a processor that you create on your own !

Next tutorial

Pipelining

Files for all the steps

step 1: Blinker, too fast, can't see anything
step 2: Blinker with clockworks
step 3: Blinker that loads pattern from ROM
step 4: The instruction decoder
step 5: The register bank and the state machine
step 6: The ALU
step 7: Using the VERILOG assembler
step 8: Jumps
step 9: Branches
step 10: LUI and AUIPC
step 11: Memory in separate module
step 12: Size optimization: the Incredible Shrinking Core !
step 13: Subroutines 1 (standard Risc-V instruction set)
step 14: Subroutines 2 (using Risc-V pseudo-instructions)
step 15: Load
step 16: Store
step 17: Memory-mapped devices
step 18: Mandelbrot set
step 19: Faster simulation with Verilator
step 20: Using the GNU toolchain to compile assembly programs
step 21: Using the GNU toolchain to compile C programs
step 22: More memory ! Using the SPI Flash
step 23: Running programs from the SPI Flash, first steps
step 24: Running programs from the SPI Flash, better linker script

WIP

step 25: More devices (LED matrix, OLED screen...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

From Blinker to RISC-V

About this tutorial

Introduction and references on processor design

Prerequisites:

Step 1: your first blinky

Step 2: slower blinky

Step 3: a blinker that loads LEDs patterns from ROM

The RISC-V instruction set architecture

Step 4: the instruction decoder

Sidebar: the elegance of RISC-V

Step 5: The register bank and the state machine

Step 6: the ALU

Step 7: using the VERILOG assembler

Step 8: jumps

Step 9: Branches

Step 10: LUI and AUIPC

Step 11: Memory in a separate module

Step 12: Size optimization: the Incredible Shrinking Core.

Step 13: subroutines (version 1, using plain RISC-V instructions)

Step 14: subroutines (version 2, using RISC-V ABI and pseudo-instructions)

Step 15: Load

Step 16: Store

Step 17: Memory-mapped device - let's do (much) more than a blinky !

Step 18: Computing the Mandelbrot set

Step 19: Faster simulation with Verilator

Step 20: Using the GNU toolchain to compile programs - assembly

Step 21: Using the GNU toolchain to compile programs - C

Step 22: Storing data: can I have more than 6 kB of memory ?

Step 23: running programs from SPI Flash, first steps

Step 24: running programs from SPI Flash, a better linker script

Next tutorial

Files for all the steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

From Blinker to RISC-V

About this tutorial

Introduction and references on processor design

Prerequisites:

Step 1: your first blinky

Step 2: slower blinky

Step 3: a blinker that loads LEDs patterns from ROM

The RISC-V instruction set architecture

Step 4: the instruction decoder

Sidebar: the elegance of RISC-V

Step 5: The register bank and the state machine

Step 6: the ALU

Step 7: using the VERILOG assembler

Step 8: jumps

Step 9: Branches

Step 10: LUI and AUIPC

Step 11: Memory in a separate module

Step 12: Size optimization: the Incredible Shrinking Core.

Step 13: subroutines (version 1, using plain RISC-V instructions)

Step 14: subroutines (version 2, using RISC-V ABI and pseudo-instructions)

Step 15: Load

Step 16: Store

Step 17: Memory-mapped device - let's do (much) more than a blinky !

Step 18: Computing the Mandelbrot set

Step 19: Faster simulation with Verilator

Step 20: Using the GNU toolchain to compile programs - assembly

Step 21: Using the GNU toolchain to compile programs - C

Step 22: Storing data: can I have more than 6 kB of memory ?

Step 23: running programs from SPI Flash, first steps

Step 24: running programs from SPI Flash, a better linker script

Next tutorial

Files for all the steps