This lesson starts at commit d20e09da83bc7dac0d753d0a4db9a9ce99c50327.

4. Execute and writeback stage

To recap, we have

A fetch stage (which is a temporary hack that should be re-done, but for now it allows us to keep making progress).
A decode stage, which only decodes the ADDI instruction.

The goal for this lesson is to actually execute the ADDI instruction and write the result back to the target register. Naturally, this requires changes to the execute and writeback stages.

From the execute stage, we want to return the result from the operation, and the destination register.

src/core/constants.vhd CHANGED Viewed

@@ -20,7 +20,8 @@ package core_constants is
 	);
 	constant DEFAULT_EXECUTE_OUTPUT: execute_output_t := (
-		placeholder => '0'
 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (

 	);
 	constant DEFAULT_EXECUTE_OUTPUT: execute_output_t := (
+		result => (others => '0'),
+		destination_reg => (others => '0')
 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (

src/core/types.vhd CHANGED Viewed

@@ -20,7 +20,8 @@ package core_types is
 	end record decode_output_t;
 	type execute_output_t is record
-		placeholder: std_logic;
 	end record execute_output_t;
 	type memory_output_t is record

 	end record decode_output_t;
 	type execute_output_t is record
+		result: std_logic_vector(31 downto 0);
+		destination_reg: std_logic_vector(4 downto 0);
 	end record execute_output_t;
 	type memory_output_t is record

We want to pass this on to the writeback stage, but the memory stage is inbetween, still. We'll adapt the memory stage to just copy the input.

src/core/constants.vhd CHANGED Viewed

@@ -25,6 +25,7 @@ package core_constants is
 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (
-		placeholder => '0'
 	);
 end package core_constants;

 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (
+		result => (others => '0'),
+		destination_reg => (others => '0')
 	);
 end package core_constants;

src/core/memory.vhd CHANGED Viewed

@@ -21,7 +21,8 @@ begin
 	process (clk)
 	begin
 		if rising_edge(clk) then
-			-- TODO: implement
 		end if;
 	end process;

 	process (clk)
 	begin
 		if rising_edge(clk) then
+			output.result <= input.result;
+			output.destination_reg <= input.destination_reg;
 		end if;
 	end process;

src/core/types.vhd CHANGED Viewed

@@ -25,6 +25,7 @@ package core_types is
 	end record execute_output_t;
 	type memory_output_t is record
-		placeholder: std_logic;
 	end record memory_output_t;
 end package core_types;

 	end record execute_output_t;
 	type memory_output_t is record
+		result: std_logic_vector(31 downto 0);
+		destination_reg: std_logic_vector(4 downto 0);
 	end record memory_output_t;
 end package core_types;

Now, we still need to provide the actual implementations in the execute and writeback stage.

In the execute stage, we want to ignore inactive signals and invalid instructions (we will need to handle invalid instructions some day, but today is not that day). For invalid or inactive instructions we'll simply output the default output. To signal "we don't need to write to a register", we simply set the destination register to 0, because in RISC-V the first register, x0, always holds zero, and it cannot be overwritten. We'll need to handle this in the writeback stage.

src/core/execute.vhd CHANGED Viewed

@@ -21,7 +21,12 @@ begin
 	process (clk)
 	begin
 		if rising_edge(clk) then
-			-- TODO: implement
 		end if;
 	end process;

 	process (clk)
 	begin
 		if rising_edge(clk) then
+			if input.is_active = '1' and input.is_invalid = '0' then
+				output.result <= (others => '0');  -- TODO: fill this with the result from the operation
+				output.destination_reg <= input.destination_reg;
+			else
+				output <= DEFAULT_EXECUTE_OUTPUT;
+			end if;
 		end if;
 	end process;

Now, we're ready to actually handle the addition operation in the execute stage.

src/core/execute.vhd CHANGED Viewed

@@ -22,7 +22,12 @@ begin
 	begin
 		if rising_edge(clk) then
 			if input.is_active = '1' and input.is_invalid = '0' then
-				output.result <= (others => '0');  -- TODO: fill this with the result from the operation
 				output.destination_reg <= input.destination_reg;
 			else
 				output <= DEFAULT_EXECUTE_OUTPUT;

 	begin
 		if rising_edge(clk) then
 			if input.is_active = '1' and input.is_invalid = '0' then
+				if input.operation = OP_ADD then
+					output.result <= std_logic_vector(unsigned(input.operand1) + unsigned(input.operand2));
+				else
+					-- this should never happen
+				end if;
 				output.destination_reg <= input.destination_reg;
 			else
 				output <= DEFAULT_EXECUTE_OUTPUT;

We can make the "this should never happen" a bit more robust by making it an assertion.

src/core/execute.vhd CHANGED Viewed

@@ -25,7 +25,7 @@ begin
 				if input.operation = OP_ADD then
 					output.result <= std_logic_vector(unsigned(input.operand1) + unsigned(input.operand2));
 				else
-					-- this should never happen
 				end if;
 				output.destination_reg <= input.destination_reg;

 				if input.operation = OP_ADD then
 					output.result <= std_logic_vector(unsigned(input.operand1) + unsigned(input.operand2));
 				else
+					assert false report "Unhandled operation value in execute stage" severity failure;
 				end if;
 				output.destination_reg <= input.destination_reg;

If we test this in simulation, we see the correct values for result and destination_reg show up in the writeback stage.

However, the writeback stage itself doesn't do anything. We run into a problem here; In the last lesson we put the registers in the decode stage, so we don't have access to them from the write stage...

Like I mentioned last time, one solution is to make a module for a register file, that the decode stage and the writeback stage can both talk to. However, I don't really like making an additional module, and instead I'll opt to merge the writeback stage and the decode stage into a single module, which I'll simply call decode_write. I will adapt the decode module and delete the write module (which was only a placeholder anyway).

So, the core module will need to route the output of the memory stage back to the decode stage, which will then handle writing the final value to the destination register.

First, I'll rename decode.vhd to decode_write.vhd and delete write.vhd.

src/core/write.vhd DELETED Viewed

@@ -1,27 +0,0 @@
-library ieee;
-use ieee.std_logic_1164.all;
-use ieee.numeric_std.all;
-use work.core_types.all;
-use work.core_constants.all;
-entity write is
-	port (
-		clk: in std_logic;
-		input: in memory_output_t
-	);
-end write;
-architecture rtl of write is
-begin
-	process (clk)
-	begin
-		if rising_edge(clk) then
-			-- TODO: implement
-		end if;
-	end process;
-end rtl;

Then, I'll make the decode_write module take the output from the memory stage.

src/core.vhd CHANGED Viewed

@@ -26,11 +26,12 @@ architecture rtl of core is
 		);
 	end component;
-	component decode is
 		port (
 			clk: in std_logic;
-			input: in fetch_output_t;
-			output: out decode_output_t
 		);
 	end component;
@@ -50,22 +51,13 @@ architecture rtl of core is
 		);
 	end component;
-	component write is
-		port (
-			clk: in std_logic;
-			input: in memory_output_t
-		);
-	end component;
 begin
 	fetch_inst: fetch port map(clk => clk, output => fetch_output);
-	decode_inst: decode port map(clk => clk, input => fetch_output, output => decode_output);
 	execute_inst: execute port map(clk => clk, input => decode_output, output => execute_output);
 	memory_inst: memory port map(clk => clk, input => execute_output, output => memory_output);
-	write_inst: write port map(clk => clk, input => memory_output);
 end rtl;

 		);
 	end component;
+	component decode_write is
 		port (
 			clk: in std_logic;
+			decode_input: in fetch_output_t;
+			decode_output: out decode_output_t;
+			write_input: in memory_output_t
 		);
 	end component;
 		);
 	end component;
 begin
 	fetch_inst: fetch port map(clk => clk, output => fetch_output);
+	decode_write_inst: decode_write port map(clk => clk, decode_input => fetch_output, decode_output => decode_output, write_input => memory_output);
 	execute_inst: execute port map(clk => clk, input => decode_output, output => execute_output);
 	memory_inst: memory port map(clk => clk, input => execute_output, output => memory_output);
 end rtl;

src/core/decode_write.vhd CHANGED Viewed

@@ -6,16 +6,19 @@ use work.core_types.all;
 use work.core_constants.all;
-entity decode is
 	port (
 		clk: in std_logic;
-		input: in fetch_output_t;
-		output: out decode_output_t := DEFAULT_DECODE_OUTPUT
 	);
-end decode;
-architecture rtl of decode is
 	type registers is array(0 to 31) of std_logic_vector(31 downto 0);
 	signal reg: registers := (others => (others => '0'));
@@ -29,38 +32,38 @@ begin
 		variable i_imm: std_logic_vector(11 downto 0);
 		variable i_imm_s: std_logic_vector(31 downto 0);
-		variable v_output: decode_output_t;
 	begin
 		if rising_edge(clk) then
-			opcode := input.instr(6 downto 0);
-			rs1    := input.instr(19 downto 15);
-			rs2    := input.instr(24 downto 20);
-			funct3 := input.instr(14 downto 12);
-			rd     := input.instr(11 downto 7);
-			i_imm := input.instr(31 downto 20);
 			i_imm_s := std_logic_vector(resize(signed(i_imm), 32));
-			v_output := DEFAULT_DECODE_OUTPUT;
-			if input.is_active = '1' then
-				v_output.is_active := '1';
-				v_output.is_invalid := '0';
 				if opcode = "0010011" and funct3 = "000" then
 					-- ADDI rd, rs, imm (I-type): sets rd to the sum of rs1 and the sign-extended immediate
-					v_output.operation := OP_ADD;
-					v_output.operand1 := reg(to_integer(unsigned(rs1)));
-					v_output.operand2 := i_imm_s;
-					v_output.destination_reg := rd;
 				else
-					v_output.is_invalid := '1';
 				end if;
 			else
-				output <= DEFAULT_DECODE_OUTPUT;
 			end if;
-			output <= v_output;
 		end if;
 	end process;

 use work.core_constants.all;
+entity decode_write is
 	port (
 		clk: in std_logic;
+		decode_input: in fetch_output_t;
+		decode_output: out decode_output_t := DEFAULT_DECODE_OUTPUT;
+		write_input: in memory_output_t
 	);
+end decode_write;
+architecture rtl of decode_write is
 	type registers is array(0 to 31) of std_logic_vector(31 downto 0);
 	signal reg: registers := (others => (others => '0'));
 		variable i_imm: std_logic_vector(11 downto 0);
 		variable i_imm_s: std_logic_vector(31 downto 0);
+		variable v_decode_output: decode_output_t;
 	begin
 		if rising_edge(clk) then
+			opcode := decode_input.instr(6 downto 0);
+			rs1    := decode_input.instr(19 downto 15);
+			rs2    := decode_input.instr(24 downto 20);
+			funct3 := decode_input.instr(14 downto 12);
+			rd     := decode_input.instr(11 downto 7);
+			i_imm := decode_input.instr(31 downto 20);
 			i_imm_s := std_logic_vector(resize(signed(i_imm), 32));
+			v_decode_output := DEFAULT_DECODE_OUTPUT;
+			if decode_input.is_active = '1' then
+				v_decode_output.is_active := '1';
+				v_decode_output.is_invalid := '0';
 				if opcode = "0010011" and funct3 = "000" then
 					-- ADDI rd, rs, imm (I-type): sets rd to the sum of rs1 and the sign-extended immediate
+					v_decode_output.operation := OP_ADD;
+					v_decode_output.operand1 := reg(to_integer(unsigned(rs1)));
+					v_decode_output.operand2 := i_imm_s;
+					v_decode_output.destination_reg := rd;
 				else
+					v_decode_output.is_invalid := '1';
 				end if;
 			else
+				decode_output <= DEFAULT_DECODE_OUTPUT;
 			end if;
+			decode_output <= v_decode_output;
 		end if;
 	end process;

Now, it's relatively easy to perform the write to the destination register.

src/core/decode_write.vhd CHANGED Viewed

@@ -35,6 +35,11 @@ begin
 		variable v_decode_output: decode_output_t;
 	begin
 		if rising_edge(clk) then
 			opcode := decode_input.instr(6 downto 0);
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);

 		variable v_decode_output: decode_output_t;
 	begin
 		if rising_edge(clk) then
+			-- write back result if the destination register is not x0 (which always stays 0)
+			if write_input.destination_reg /= "00000" then
+				reg(to_integer(unsigned(write_input.destination_reg))) <= write_input.result;
+			end if;
 			opcode := decode_input.instr(6 downto 0);
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);

If we simulate this for 70 ns and observer the input to the write stage and the x1 register, we see the following waveforms.

Simulation waveforms

This looks good; The value 0x7b (which is 123 in hex) gets written to the x1 register. We have implemented our first RISC-V instruction, and it looks like it's being executed correctly! You can give yourself a pat on the back, this is a nice milestone!

When you're done celebrating, let's try another test case, that increments the x1 register twice in a row. That is, let's execute

ADDI x1, x1, 1
ADDI x1, x1, 1

Again using this sweet online RISC-V assembler, we see that ADDI x1, x1, 1 assembles to 00108093. We put this instruction in our instruction memory twice.

src/core/fetch.vhd CHANGED Viewed

@@ -17,7 +17,7 @@ end fetch;
 architecture rtl of fetch is
 	type instruction_memory_t is array(0 to 15) of std_logic_vector(31 downto 0);
 	signal imem: instruction_memory_t := (
-		X"07b10093", X"00000002", X"00000003", X"00000004", X"00000005", X"00000006", X"00000007", X"00000008",
 		X"00000009", X"0000000A", X"0000000B", X"0000000C", X"0000000D", X"0000000E", X"0000000F", X"00000010"
 	);

 architecture rtl of fetch is
 	type instruction_memory_t is array(0 to 15) of std_logic_vector(31 downto 0);
 	signal imem: instruction_memory_t := (
+		X"00108093", X"00108093", X"00000003", X"00000004", X"00000005", X"00000006", X"00000007", X"00000008",
 		X"00000009", X"0000000A", X"0000000B", X"0000000C", X"0000000D", X"0000000E", X"0000000F", X"00000010"
 	);

Now we run the simulation again for 70 ns, observing the input for the write stage, and the value of the x1 register.

Simulation waveforms

We can see that in the fifth and sixth cycle, a write to the x1 register happens (destination_reg is set to 1). However, the value of result, which is written to the register, should be 2 the second time. So, we have a bug.

What is going on? When the second ADDI x1, x1, 1 instruction arrives in the decode stage, the value of the x1 is read as 0, since the instruction before it has not yet finished executing.

This phenomenon where one instruction needs the result of an instruction before it that has not yet finished, is known as a read-after-write hazard. In pipelined processors, you need to track these dependencies, and wait until all the dependencies have finished executing. The cycles where the processor is waiting are called "pipeline bubbles" or "pipeline stalls". In our case, these manifest as cycles where the output of a stage has the is_active set to 0.

So the proper solution is to keep track of the number of instructions in the pipeline that write to every register. However, I don't feel like doing that at this point. Instead, I am going to do something much simpler for now: Let the fetch wait until the previous instruction has finished. This is potentially much slower than doing the proper solution, but I really want to get a simple processor working before spending a lot of effort to pipeline it.

So, what we can do is add an output pipeline_ready to the write stage, that will be set to 1 for a cycle whenever an active instruction finishes. This signal will then be fed back to the fetch stage, and a new instruction will only be fetched when this signal is 1.

For this, first we need to propagate the is_active signal all the way to the write stage.

src/core/constants.vhd CHANGED Viewed

@@ -20,11 +20,13 @@ package core_constants is
 	);
 	constant DEFAULT_EXECUTE_OUTPUT: execute_output_t := (
 		result => (others => '0'),
 		destination_reg => (others => '0')
 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (
 		result => (others => '0'),
 		destination_reg => (others => '0')
 	);

 	);
 	constant DEFAULT_EXECUTE_OUTPUT: execute_output_t := (
+		is_active => '0',
 		result => (others => '0'),
 		destination_reg => (others => '0')
 	);
 	constant DEFAULT_MEMORY_OUTPUT: memory_output_t := (
+		is_active => '0',
 		result => (others => '0'),
 		destination_reg => (others => '0')
 	);

src/core/execute.vhd CHANGED Viewed

@@ -19,19 +19,23 @@ architecture rtl of execute is
 begin
 	process (clk)
 	begin
 		if rising_edge(clk) then
 			if input.is_active = '1' and input.is_invalid = '0' then
 				if input.operation = OP_ADD then
-					output.result <= std_logic_vector(unsigned(input.operand1) + unsigned(input.operand2));
 				else
 					assert false report "Unhandled operation value in execute stage" severity failure;
 				end if;
-				output.destination_reg <= input.destination_reg;
-			else
-				output <= DEFAULT_EXECUTE_OUTPUT;
 			end if;
 		end if;
 	end process;

 begin
 	process (clk)
+		variable v_output: execute_output_t;
 	begin
 		if rising_edge(clk) then
+			v_output := DEFAULT_EXECUTE_OUTPUT;
+			v_output.is_active := input.is_active;
 			if input.is_active = '1' and input.is_invalid = '0' then
 				if input.operation = OP_ADD then
+					v_output.result := std_logic_vector(unsigned(input.operand1) + unsigned(input.operand2));
 				else
 					assert false report "Unhandled operation value in execute stage" severity failure;
 				end if;
+				v_output.destination_reg := input.destination_reg;
 			end if;
+			output <= v_output;
 		end if;
 	end process;

src/core/memory.vhd CHANGED Viewed

@@ -21,6 +21,7 @@ begin
 	process (clk)
 	begin
 		if rising_edge(clk) then
 			output.result <= input.result;
 			output.destination_reg <= input.destination_reg;
 		end if;

 	process (clk)
 	begin
 		if rising_edge(clk) then
+			output.is_active <= input.is_active;
 			output.result <= input.result;
 			output.destination_reg <= input.destination_reg;
 		end if;

src/core/types.vhd CHANGED Viewed

@@ -20,11 +20,13 @@ package core_types is
 	end record decode_output_t;
 	type execute_output_t is record
 		result: std_logic_vector(31 downto 0);
 		destination_reg: std_logic_vector(4 downto 0);
 	end record execute_output_t;
 	type memory_output_t is record
 		result: std_logic_vector(31 downto 0);
 		destination_reg: std_logic_vector(4 downto 0);
 	end record memory_output_t;

 	end record decode_output_t;
 	type execute_output_t is record
+		is_active: std_logic;
 		result: std_logic_vector(31 downto 0);
 		destination_reg: std_logic_vector(4 downto 0);
 	end record execute_output_t;
 	type memory_output_t is record
+		is_active: std_logic;
 		result: std_logic_vector(31 downto 0);
 		destination_reg: std_logic_vector(4 downto 0);
 	end record memory_output_t;

With that done, we "loop" the signal back around as pipeline_ready from the write stage back to the fetch stage, and only fetch if it's 1. We initialize the value to 1 in the output of the write stage to not get in a "deadlock", with the pipeline waiting for an instruction, and the fetch unit waiting until an instruction finishes.

src/core.vhd CHANGED Viewed

@@ -18,10 +18,12 @@ architecture rtl of core is
 	signal decode_output: decode_output_t;
 	signal execute_output: execute_output_t;
 	signal memory_output: memory_output_t;
 	component fetch is
 		port (
 			clk: in std_logic;
 			output: out fetch_output_t
 		);
 	end component;
@@ -31,7 +33,8 @@ architecture rtl of core is
 			clk: in std_logic;
 			decode_input: in fetch_output_t;
 			decode_output: out decode_output_t;
-			write_input: in memory_output_t
 		);
 	end component;
@@ -52,9 +55,9 @@ architecture rtl of core is
 	end component;
 begin
-	fetch_inst: fetch port map(clk => clk, output => fetch_output);
-	decode_write_inst: decode_write port map(clk => clk, decode_input => fetch_output, decode_output => decode_output, write_input => memory_output);
 	execute_inst: execute port map(clk => clk, input => decode_output, output => execute_output);

 	signal decode_output: decode_output_t;
 	signal execute_output: execute_output_t;
 	signal memory_output: memory_output_t;
+	signal pipeline_ready: std_logic;
 	component fetch is
 		port (
 			clk: in std_logic;
+			pipeline_ready: in std_logic;
 			output: out fetch_output_t
 		);
 	end component;
 			clk: in std_logic;
 			decode_input: in fetch_output_t;
 			decode_output: out decode_output_t;
+			write_input: in memory_output_t;
+			pipeline_ready: out std_logic
 		);
 	end component;
 	end component;
 begin
+	fetch_inst: fetch port map(clk => clk, output => fetch_output, pipeline_ready => pipeline_ready);
+	decode_write_inst: decode_write port map(clk => clk, decode_input => fetch_output, decode_output => decode_output, write_input => memory_output, pipeline_ready => pipeline_ready);
 	execute_inst: execute port map(clk => clk, input => decode_output, output => execute_output);

src/core/decode_write.vhd CHANGED Viewed

@@ -13,7 +13,8 @@ entity decode_write is
 		decode_input: in fetch_output_t;
 		decode_output: out decode_output_t := DEFAULT_DECODE_OUTPUT;
-		write_input: in memory_output_t
 	);
 end decode_write;
@@ -40,6 +41,8 @@ begin
 				reg(to_integer(unsigned(write_input.destination_reg))) <= write_input.result;
 			end if;
 			opcode := decode_input.instr(6 downto 0);
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);

 		decode_input: in fetch_output_t;
 		decode_output: out decode_output_t := DEFAULT_DECODE_OUTPUT;
+		write_input: in memory_output_t;
+		pipeline_ready: out std_logic := '1'
 	);
 end decode_write;
 				reg(to_integer(unsigned(write_input.destination_reg))) <= write_input.result;
 			end if;
+			pipeline_ready <= write_input.is_active;
 			opcode := decode_input.instr(6 downto 0);
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);

src/core/fetch.vhd CHANGED Viewed

@@ -9,6 +9,7 @@ use work.core_constants.all;
 entity fetch is
 	port (
 		clk: in std_logic;
 		output: out fetch_output_t := DEFAULT_FETCH_OUTPUT
 	);
 end fetch;
@@ -28,10 +29,14 @@ begin
 	process (clk)
 	begin
 		if rising_edge(clk) then
-			pc <= pc + 4;
-			output.is_active <= '1';
-			output.instr <= imem(to_integer(pc(5 downto 2)));
 		end if;
 	end process;

 entity fetch is
 	port (
 		clk: in std_logic;
+		pipeline_ready: in std_logic;
 		output: out fetch_output_t := DEFAULT_FETCH_OUTPUT
 	);
 end fetch;
 	process (clk)
 	begin
 		if rising_edge(clk) then
+			if pipeline_ready = '1' then
+				pc <= pc + 4;
+				output.is_active <= '1';
+				output.instr <= imem(to_integer(pc(5 downto 2)));
+			else
+				output <= DEFAULT_FETCH_OUTPUT;
+			end if;
 		end if;
 	end process;

With this change, we see the value of x1 settles on 2, after 100 ns.

Simulation waveforms

So, now we can also execute multiple successive ADDI instructions.

While we have only implemented a single instruction, it is worth realizing that implementing more instructions is relatively easy, since most of the "infrastructure" is there. To illustrate this, let's implement the ADD instruction, which is similar to ADDI but operates on two registers instead of a register and an immediate value.

src/core/decode_write.vhd CHANGED Viewed

@@ -28,6 +28,7 @@ begin
 	process (clk)
 		variable opcode: std_logic_vector(6 downto 0);
 		variable funct3: std_logic_vector(2 downto 0);
 		variable rs1, rs2, rd : std_logic_vector(4 downto 0);
 		variable i_imm: std_logic_vector(11 downto 0);
@@ -47,6 +48,7 @@ begin
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);
 			funct3 := decode_input.instr(14 downto 12);
 			rd     := decode_input.instr(11 downto 7);
 			i_imm := decode_input.instr(31 downto 20);
@@ -64,6 +66,12 @@ begin
 					v_decode_output.operand1 := reg(to_integer(unsigned(rs1)));
 					v_decode_output.operand2 := i_imm_s;
 					v_decode_output.destination_reg := rd;
 				else
 					v_decode_output.is_invalid := '1';
 				end if;

 	process (clk)
 		variable opcode: std_logic_vector(6 downto 0);
 		variable funct3: std_logic_vector(2 downto 0);
+		variable funct7: std_logic_vector(6 downto 0);
 		variable rs1, rs2, rd : std_logic_vector(4 downto 0);
 		variable i_imm: std_logic_vector(11 downto 0);
 			rs1    := decode_input.instr(19 downto 15);
 			rs2    := decode_input.instr(24 downto 20);
 			funct3 := decode_input.instr(14 downto 12);
+			funct7 := decode_input.instr(31 downto 25);
 			rd     := decode_input.instr(11 downto 7);
 			i_imm := decode_input.instr(31 downto 20);
 					v_decode_output.operand1 := reg(to_integer(unsigned(rs1)));
 					v_decode_output.operand2 := i_imm_s;
 					v_decode_output.destination_reg := rd;
+				elsif opcode = "0110011" and funct3 = "000" and funct7 = "0000000" then
+					-- ADD rd, rs1, rs2 (R-type): sets rd to the sum of rs1 and rs2
+					v_decode_output.operation := OP_ADD;
+					v_decode_output.operand1 := reg(to_integer(unsigned(rs1)));
+					v_decode_output.operand2 := reg(to_integer(unsigned(rs2)));
+					v_decode_output.destination_reg := rd;
 				else
 					v_decode_output.is_invalid := '1';
 				end if;

That's it! Many other instructions operate on two operands and write the result to a destination register; these are very easy to implement (although for most of them we do need to implement a new operation in the execute stage).

Instructions that are very different (like memory operations, or control flow operations) will be a bit more work, but even so, it's a lot less than the work we've already done at this point.

In the next lesson we'll look at actually running our design on the Mimas A7 dev board.