# F28HS Hardware-Software Interface: Systems Programming

#### Hans-Wolfgang Loidl

School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh



#### Semester 2 - 2019/20

| <sup>0</sup> No proprietary software   | has been used in producing th     | nese slides | HERIOT<br>WATT |
|----------------------------------------|-----------------------------------|-------------|----------------|
| Hans-Wolfgang Loidl (Heriot-Watt Univ) | F28HS Hardware-Software Interface | 2019/20     | 1 / 39         |

# Lecture 6: Computer Architecture

## Outline

| Lecture 1: Introduction to Systems Programming                                                                     |                                                        |         |                              |  |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|---------|------------------------------|--|--|--|--|--|
| 2 Lecture 2: Systems Prog                                                                                          | 2 Lecture 2: Systems Programming with the Raspberry Pi |         |                              |  |  |  |  |  |
| <ul> <li>Lecture 3: Memory Hierarchy</li> <li>Memory Hierarchy</li> <li>Principles of Cacl</li> </ul>              |                                                        |         |                              |  |  |  |  |  |
| <ul> <li>Lecture 4: Programming external devices</li> <li>Basics of device-level programming</li> </ul>            |                                                        |         |                              |  |  |  |  |  |
| 5 Lecture 5: Exceptional 0                                                                                         | 5 Lecture 5: Exceptional Control Flow                  |         |                              |  |  |  |  |  |
| <ul> <li>Lecture 6: Computer Architecture</li> <li>Processor Architectures Overview</li> <li>Pipelining</li> </ul> |                                                        |         |                              |  |  |  |  |  |
| Lecture 7: Code Securit                                                                                            | Lecture 7: Code Security: Buffer Overflow Attacks      |         |                              |  |  |  |  |  |
| B Lecture 8: Interrupt Handling                                                                                    |                                                        |         |                              |  |  |  |  |  |
| Lecture 9: Miscellaneous Topics                                                                                    |                                                        |         |                              |  |  |  |  |  |
| Lecture 10: Revision                                                                                               |                                                        |         | HERIOT<br>WATT<br>UNIVERSITY |  |  |  |  |  |
| Hans-Wolfgang Loidl (Heriot-Watt Univ)                                                                             | F28HS Hardware-Software Interface                      | 2019/20 | 2/39                         |  |  |  |  |  |

#### **Classes of Computer Architectures**

- There is a wide range of computer architectures from small-scale (embedded) to large-scale (super-computers)
- In this course we focus on embedded systems
- A key requirement for these devices is low power consumption
- This is also increasingly important for main-stream hardware and even for super-computing
- Embedded devices are found in cars, planes, house-hold devices, network-devices, cell-phones etc
- This is the most rapidly growing market for computer hardware

HERIOT WATT

#### Number of processors produced



### **Processor Architectures: Introduction**

- In this part we take a brief look at the design of processor hardware.
- This view will give you a better understanding of how computers work.
- In particular you will gain a better understanding of issues relevant to **resource consumption.**
- So far we have used a very simple model of a CPU: each instruction is fetched and executed to completion before the next one begins.
- Modern processor architectures use **pipeling** to execute multiple instructions simultaneously ("super-scalar architectures").
- Special measures need to be taken to ensure that the processor computes the same results as it would with sequential execution

7/39

# Limitations to further improvements



#### A simple picture of the CPU



- The ALU executes arithmetic/logic operations with arguments in registers
- Load and store instructions move data between memory and registers

HERIOT WATT

#### Why should you learn about architecture design?

- It is intellectually interesting and important.
- Understanding how the processor works aids in understanding how the overall computer system works.
- Although few people design processors, many design hardware systems that contain processors.
- You just might work on a processor design.





On each 320 ps cycle, the system spends 300 ps evaluating a combinational logic function and 20 ps storing the results in an output register.

Lec 6: Computer Architecture

11/39

<sup>0</sup>From Bryant, Chapter 4 Hans-Wolfgang Loid (Heriot-Watt Univ)

|  | Stages of | f executing | an assen | h | ler | ins | truct | tior |
|--|-----------|-------------|----------|---|-----|-----|-------|------|
|--|-----------|-------------|----------|---|-----|-----|-------|------|

Processing an assembler instruction involves a number of operations:

- Fetch: The fetch stage reads the bytes of an instruction from memory, using the program counter (PC) as the memory address.
- 2 Decode: The decode stage reads up to two operands from the register file.
- Execute: In the execute stage, the arithmetic/logic unit (ALU) either performs the operation specified by the instruction, computes the effective address of a memory reference, or increments or decrements the stack pointer.
- Memory: The memory stage may write data to memory, or it may read data from memory.
- Write back: The write-back stage writes up to two results to the register file.
- **OPC update:** The PC is set to the address of the next instruction.

NB: The processing depends on the instruction, and certain stages HERIOT may not be used.

Lec 6: Computer Architecture

10/39

HERIOT WATT

12/39

Hans-Wolfgang Loidl (Heriot-Watt Univ

#### Instruction-level parallelism

- Key observation: We can do the different stages of the execution in parallel ("instruction-level parallelism")
- An architecture that allows this kind of parallelism is called "pipelined" architecture
- This is a big performance boost: ideally each instruction takes just 1 cycle (as opposed to 5 cycles for the 5 stages of the execution)
- However, the ideal case is often not reached, and modern architecture play clever tricks to get closer to the ideal case: branch prediction, out-of-order execution etc





The computation is split into stages A, B, and C. On each 120-ps cycle, each instruction progresses through one stage.

| <sup>0</sup> From Bryant, Chapter 4    |                                   |                              | WAT   |
|----------------------------------------|-----------------------------------|------------------------------|-------|
| Hans-Wolfgang Loidl (Heriot-Watt Univ) | F28HS Hardware-Software Interface | Lec 6: Computer Architecture | 13/39 |

### Example: One clock cycle of pipeline operation.



- We now take a closer look on how values are propagated through the pipeline.
- Instruction I1 has completed stage B
- Instruction I2 has completed stage A

Lec 6: Computer Architecture

HERIOT

#### Three-stage pipeline timing



#### Example: One clock cycle of pipeline operation.

Hans-Wolfgang Loidl (Heriot-Watt Univ)

HERIOT WATT

14/39

Lec 6: Computer Architecture



Just **before** clock rise: values have been computed (stage A of instruction I2, stage B of instruction I1), but the pipeline registers have not been updated, yet.



Example: One clock cycle of pipeline operation.

#### ② Time = 241 Clock 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb Comb. Comb logic logic logic Α В С

On clock rise, inputs are loaded into the pipeline registers.

| <sup>0</sup> From Bryant, Chapter 4    |                                   |                              | HERIOT<br>WATT |
|----------------------------------------|-----------------------------------|------------------------------|----------------|
| Hans-Wolfgang Loidl (Heriot-Watt Univ) | F28HS Hardware-Software Interface | Lec 6: Computer Architecture | 17/39          |

# Example: One clock cycle of pipeline operation.



Before time 360, the result values reach the inputs of the pipeline registers, to be propagated at the next rising clock.

| <sup>0</sup> From Bryant, Chapter 4    |                                   |                              | WATT  |
|----------------------------------------|-----------------------------------|------------------------------|-------|
| Hans-Wolfgang Loidl (Heriot-Watt Univ) | F28HS Hardware-Software Interface | Lec 6: Computer Architecture | 19/39 |

HERIOT

#### Example: One clock cycle of pipeline operation.



Signals then propagate through the combinational logic (possibly at different rates).

| <sup>0</sup> From Bryant, Chapter 4    |                                   |                              | WATT    |
|----------------------------------------|-----------------------------------|------------------------------|---------|
| Hans-Wolfgang Loidl (Heriot-Watt Univ) | F28HS Hardware-Software Interface | Lec 6: Computer Architecture | 18 / 39 |

#### Multiple-clock-cycle pipeline diagram



FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of five instructions in Figure 4.43.



#### Abstract view of a sequential processor



Hans-Wolfgang Loidl (Heriot-Watt Univ

<sup>0</sup>From Bryant, Chapter 4 Hans-Wolfgang Loidl (Heriot-Watt Univ

The information processed during execution of an instruction follows a clockwise flow starting with an instruction fetch using the program counter (PC), shown in the lower left-hand corner of the figure.

Abstract view of a pipelined processor Write Stat M stat icor pipelined Execute E stat icode ifun

Hardware structure of a implementation. By inserting pipeline registers between the stages, we create a five-stage pipeline.

Lec 6: Computer Architecture

Lec 6: Computer Architecture

HERIO WAT

21/39

HERIOT WATT

23/39

## Discussion of pipelined execution

The main pipeline stages are:

Hans-Wolfgang Loidl (Heriot-Watt Univ)

- Fetch: Using the program counter register as an address, the instruction memory reads the bytes of an instruction. The PC incrementer computes valP, the incremented program counter.
- Decode: The register file has two read ports, A and B, via which register values valA and valB are read simultaneously.
- Execute: This uses the arithmetic/logic (ALU) unit for different purposes according to the instruction type: integer operations, memory access, or branch instructions.
- Memory: The Data Memory unit reads or writes a word of memory (memory instruction). The instruction and data memories access the same memory locations, but for different purposes.
- Write back: The register file has two write ports. Port E is used to write values computed by the ALU, while port M is used to write HERIOT WATT values read from the data memory.

Lec 6: Computer Architecture

22/39

### **Pipeline registers**

The pipeline registers are labeled as follows:

- F holds a predicted value of the program counter.
- D sits between the fetch and decode stages. It holds information about the most recently fetched instruction for processing by the decode stage.
- E sits between the decode and execute stages. It holds information about the most recently decoded instruction and the values read from the register file for processing by the execute stage.
- M sits between the execute and memory stages. It holds the results of the most recently executed instruction for processing by the memory stage. It also holds information about branch conditions and branch targets for processing conditional jumps.
- W sits between the memory stage and the feedback paths that supply the computed results to the register file for writing and the return address to the PC selection logic when completing a return instruction.

#### Example of instruction flow through pipeline



# Pipelining and branches

- How can a pipelined architecture deal with conditional branches?
- In this case the processor doesn't know the successor instruction until further down the pipeline.
- To deal with this, modern architectures perform some form of **branch prediction** in hardware.
- There are two forms of branch prediction:
  - static branch prediction always takes the same guess (e.g. guess always taken)
  - dynamic branch prediction uses the history of the execution to take better guesses
- Performance is significantly higher when branch predictions are correct
- If they are wrong, the processor needs to stall or inject bubbles into the pipeline

# The ARM picture

The pipeline in the BCM2835 SoC for the RPi has 8 pipeline stages:

- Fe1: The first Fetch stage, where the address is sent to memory and an instruction is returned.
- Fe2: Second fetch stage, where the processor tries to predict the destination of a branch.
- **Oe:** Decoding the instruction.
- Iss: Register read and instruction issue

# Only for ALU operations:

| Sh: Perform shift operations as required.                                                                                                 |                |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------|----------------|--|--|--|--|
| 2 ALU: Perform arithmetic/logic operations.                                                                                               |                |  |  |  |  |
| Sat: Saturate integer results.                                                                                                            |                |  |  |  |  |
| Only for Multiply operations:                                                                                                             |                |  |  |  |  |
| MAC1: First stage of the multiply-accumulate pipeline.                                                                                    |                |  |  |  |  |
| MAC2: Second stage of the multiply-accumulate pipeline.                                                                                   |                |  |  |  |  |
| MAC3: Third stage of the multiply-accumulate pipeline.                                                                                    |                |  |  |  |  |
| Only for Load/Store operations:                                                                                                           |                |  |  |  |  |
| ADD: Address generation stage.                                                                                                            | HERIOT<br>WATT |  |  |  |  |
| DC1: First stage of data cache access.                                                                                                    | W UNIVERSITY   |  |  |  |  |
| Hans-Wolfgang Loidi (Heriot-Watt Univ) F28HS Hardware-Software Interface Lec 6: Computer Architecture                                     | 26 / 39        |  |  |  |  |
| <ul> <li>WBi: Write back of data from any of the above sub-pipelines.</li> <li>Case slides DBAck and the table in Smith's back</li> </ul> |                |  |  |  |  |

<sup>0</sup>See slidesRPiArch and the table in Smith's book

Hans-Wolfgang Loidl (Heriot-Watt Univ)

# Example: bad branch prediction

|         | .global _start |     |     |    |   |              |  |
|---------|----------------|-----|-----|----|---|--------------|--|
| .text   |                |     |     |    |   |              |  |
| _start: | EORS           | R1, | R1, | R1 | Q | always O     |  |
|         | BNE            | tar | get |    | g | Not taken    |  |
|         | MOV            | R0, | #11 |    | Ø | fall through |  |
|         | MOV            | R7, | #1  |    |   |              |  |
|         | SWI            | 0   |     |    |   |              |  |
| target: | MOV            | R0, | #1  |    |   |              |  |
|         | MOV            | R7, | #1  |    |   |              |  |
|         | SWI            | 0   |     |    |   |              |  |

Branch prediction: we assume the processor takes an always taken policy, i.e. it always assumes that that a branch is taken NB: the conditional branch (BNE) will never be taken, because exclusive-or with itself always gives 0, i.e. this is a deliberately bad example for the branch predictor

#### Processing mispredicted branch instructions.



- Predicting "branch taken", instruction 0x014 is fetched in cycle 3, and instruction  $0 \times 018$  is fetched in cycle 4.
- In cycle 4 the branch logic detects that the branch is not taken
- It therefore abandons the execution of 0x014 and 0x018 by injecting **bubbles** into the pipeline.
- The result will be as expected, but performance is sub-optimal! HERIOT <sup>0</sup>Adapted from Bryant, Figure 4.62 Hans-Wolfgang Loidl (Heriot-Watt Univ)

Lec 6: Computer Architecture

29/39

# Hazards of Pipelining

- Pipelining complicates the processing of instructions because of:
  - Control hazards, where branches are mis-predicted (as we have seen)
  - Data hazards, where data dependencies exist between subsequent instructions
- Several ways exist to solve these problems:
  - To deal with control hazards, branch prediction is used and, if necessary, partially executed instructions are abandoned.
  - To deal with data hazards, bubbles can be injected to delay the execution of instructions, or data in pipeline registers (but not written back) can be forwarded to other stages in the pipeline.
- A lot of the complexities in modern processors is due to deep pipelining, (possibly dynamic) branch prediction, and forwarding of data

For details on pipelining and data hazards, see Bryant & O'Hallaron, Computer Systems: A Programmer's View, Chapter 4 (especially HERIOT WATT Sec 4.4 and 4.5).

# Example of bad branch prediction

Code example: sumav3\_asm

#### **Data Hazards**

HERIOT WAT T

30/39

Lec 6: Computer Architecture

- The branch-prediction example above was a case of a control hazard.
- Now we look into a simple example of a data hazard.

Hans-Wolfgang Loidl (Heriot-Watt Univ)

• Consider the following simple ARM assembler program:

| ADD | R3, | R1, | R2 | Q | R3 = | R1 | + | R2 |
|-----|-----|-----|----|---|------|----|---|----|
| SUB | R0, | R3, | R4 | Q | R0 = | R3 | _ | R4 |

- Note, the result from the first instruction, in R3, will only become available in the write-back (5th) stage
- But, the data in R3 is needed already in the decode (2nd) stage of the second instruction
- Without intervention, this would stall the pipeline, similar to the branch-mis-prediction case
- The solution to this is to introduce forwarding (or by-passing) to HERIOT WATT the hardware of the processor

#### A Graphical Representation of Forwarding





#### Example: Reordering Code to Avoid Pipeline Stalls

• We have previously examined, how C expressions are compiled to Assembler code. For example, consider this C program fragment:

```
int a, b, c, d, e, f;
a = b + e;
c = b + f;
```

- Knowing about control and data hazards motivates reordering of code that should be done by the compiler to avoid pipeline stalls.
- Such reordering is commonly done in the backend of compilers.
- Therefore, the sequence of Assembler instructions might be different from the one you expect.

HERIOT WATT

Lec 6: Computer Architecture

#### Example: Reordering Code to Avoid Pipeline Stalls

Example: Translate the following C expression into Assembler:

```
int a, b, c, d, e, f;
a = b + e;
c = b + f;
```

Hans-Wolfgang Loidl (Heriot-Watt Univ)

Hans-Wolfgang Loidl (Heriot-Watt Univ)

Example: We assume the variables are stored in memory, starting from the location held in register R0. Here is the naive Assembler code:

| LDR | R1, | [R0, #4]  | G | load b  |
|-----|-----|-----------|---|---------|
| LDR | R2, | [RO, #16] | Q | load e  |
| ADD | R3, | R1, R2    | g | b + e   |
| STR | R3, | [R0, #0]  | G | store a |
| LDR | R4, | [R0, #20] | g | load f  |
| ADD | R5, | R1, R4    | g | b + f   |
| STR | R5, | [RO, #12] | Q | store c |
|     |     |           |   |         |

Can you spot the data hazard in this example?

HERIOT WATT

#### A Graphical Representation of a Load-Store Hazard



#### Summary: Processor Architecture and Pipelining

- Modern ("super-scalar") processors can execute several instructions at the same time, by organising the execution of an instruction into several stages and using a pipeline structure.
- This exploits **instruction-level** parallelism and boosts performance.
- However, there is a risk of control and data hazards, leading to reduced performance, e.g. due to poor branch prediction
- Knowing these risks, you can develop faster code!

Hans-Wolfgang Loidl (Heriot-Watt Univ)

• These code transformations are often done internally by the compiler.

HERIOT

39/39

Lec 6: Computer Architecture

#### Example: Reordering Code to Avoid Pipeline Stalls

Example: Translate the following C expression into Assembler:

int a, b, c, d, e, f; a = b + e; c = b + f;

Hans-Wolfgang Loidl (Heriot-Watt Univ)

Example: The reordered Assembler code, eliminating the data hazard:

| LDR | R1, | [RO, #4]  | 0 load b                      |
|-----|-----|-----------|-------------------------------|
| LDR | R2, | [RO, #16] | 0 load e                      |
| LDR | R4, | [R0, #20] | <pre>@ load f; moved_up</pre> |
| ADD | R3, | R1, R2    | @ b + e                       |
| STR | R3, | [R0, #0]  | 0 store a                     |
| ADD | R5, | R1, R4    | @ b + f                       |
| STR | R5, | [RO, #12] | 0 store c                     |
|     |     |           |                               |

Moving the third LDR instruction upward, makes its result available soon enough to avoid a pipeline stall.

Lec 6: Computer Architecture

38/39