# F28HS Hardware-Software Interface: Systems Programming

#### Hans-Wolfgang Loidl

School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh



Semester 2 — 2023/24



<sup>&</sup>lt;sup>0</sup>No proprietary software has been used in producing these slides > < > >

#### **Outline**

- 1: Using Python and the Linux FS for GPIO Control
- 2 Tutorial 2: Programming an LED
- Tutorial 3: Programming a Button input device
- 4 Tutorial 4: Inline Assembler with gcc
- Tutorial 5: Programming an LCD Display
- Tutorial 6: Performance Counters on the RPi 2



2023/24

#### Tutorial 6: Performance Counters on the RPi 2

- Performance counters are hardware support for monitoring basic operations on the CPU
- They are very accurate and useful for monitoring resource consumption
- It is possible to count cycles, but also cache misses, (mispredicted) branches etc
- In this tutorial we will cover how to use performance counters to get a precise measure of the runtime of a program

## **Architecture Support**

- Both the BCM2835 (of the RPi 1) and BCM2836 (of the RPi 2) provide a Performance Monitoring Unit (PMU) as a co-processor on the chip
- The unit supports in total 4 counter registers and a separate cycle counter register.
- These 4 registers can be configured to count a range of low-level events.
- There are 2 different interfaces for accessing this information.
  - the APB interface, which uses memory mapping and access registers on the PMU directly
  - the CP15 interface, which uses special assembler instructions for communicating between processor and PMU
- The PMU operations are usually not available for user programs (trying to run them directly will trigger an SIGILL exception)
- However, we can write a simple Linux kernel module to enable this functionality, and then use it through assembler instructions in our user code.

#### Overview: How to use the PMU

We need to go through the following steps:

- Find out how to interact with the PMU
- Enable access to the PMU from "user space"
- Define what we want to monitor
- Use access to the PMU to measure programs



#### Step 1: Find out how to interact with the PMU



The PMU is a **co-processor**, called **CP15**, separate from the main processor, but on the same chip.

The special assembler instructions MRC and MCR transfer data between processor register (R) and co-processor (C).  $\blacksquare$ 

<sup>0</sup>From Linux Magazin 05/2015: Kerntechnik

# Instructions for data transfer between processor and co-processor

- The ARM instruction set provides 2 instructions for the
  - MCR: Move to Coproc from ARM Reg
  - MRC: Move to ARM Reg from Coproc

#### The technical reference manual describes the instructions like this:

To access the PMCR, read or write the CP15 registers with:

```
MRC p15, 0, <Rt>, c9, c12, 0; Read Performance Monitor Control Register MCR p15, 0, <Rt>, c9, c12, 0; Write Performance Monitor Control Register
```

<sup>&</sup>lt;sup>0</sup>See Cortex A7 MPcore Technical Reference Manual, Table 11-1 PMU register Warnstrammary, p 241

# Step 2: Enabling PMU access through a kernel module

- By default, the PMU can only be accessed in "privileged mode", but this can be changed
- We need to construct a small Linux kernel module that enables the access to the PMU
- In essence, we need to embed some assembler instructions into an API pre-scribed by the Linux kernel
- For details on how to build a Linux kernel module see
  - ► The Linux Kernel Module Programming Guide, Peter Jay Salzman
  - Building instructions from a course on "Introduction to Embedded Computing" at Univ of California, San Diego, by Tajana Simunic Rosing
- Here, I'll just shortly summarise the steps needed, and how to use performance monitoring in a simple example program

Tutorial 6: Perf Counters

#### Table 11-1: PMU registers

Table 11-1 PMU register summary

| Register<br>number | Offset      | CRn | Op1 | CRm | Op2 | Name        | Type | Description                                                        |  |  |
|--------------------|-------------|-----|-----|-----|-----|-------------|------|--------------------------------------------------------------------|--|--|
| 0                  | 0×000       | c9  | 0   | c13 | 2   | PMXEVCNTR0  | RW   | Event Count Register, see the ARM                                  |  |  |
| 1                  | 0x004       | c9  | 0   | c13 | 2   | PMXEVCNTR1  | RW   | Architecture Reference Manual                                      |  |  |
| 2                  | 0×008       | c9  | 0   | c13 | 2   | PMXEVCNTR2  | RW   |                                                                    |  |  |
| 3                  | 0×00C       | c9  | 0   | c13 | 2   | PMXEVCNTR3  | RW   |                                                                    |  |  |
| 4-30               | 0x010-0x78  | -   | -   | -   | -   | -           | -    | Reserved                                                           |  |  |
| 31                 | 0x07C       | c9  | 0   | c13 | 0   | PMCCNTR     | RW   | Cycle Count Register, see the ARM<br>Architecture Reference Manual |  |  |
| 32-255             | 0x080-0x3FC | -   | -   | -   | -   | -           |      | Reserved                                                           |  |  |
| 256                | 0×400       | c9  | 0   | c13 | 1   | PMXEVTYPER0 | RW   | Event Type Selection Register, see                                 |  |  |
| 257                | 0x404       | c9  | 0   | c13 | 1   | PMXEVTYPER1 | RW   | the ARM Architecture Reference<br>Manual                           |  |  |
| 258                | 0×408       | c9  | 0   | c13 | 1   | PMXEVTYPER2 | RW   |                                                                    |  |  |
| 259                | 0x40C       | c9  | 0   | c13 | 1   | PMXEVTYPER3 | RW   |                                                                    |  |  |

<sup>&</sup>lt;sup>0</sup>See Cortex A7 MPcore Technical Reference Manual, Table 11-1 PMU register #ERIOT summary, p 237

#### Table 11-1: PMU registers

| 897     | 0×E04       | c9 | 0 | c12 | 0 | PMCR      | RW | Performance Monitor Control<br>Register on page 11-7               |
|---------|-------------|----|---|-----|---|-----------|----|--------------------------------------------------------------------|
| 898     | 0×E08       | c9 | 0 | c14 | 0 | PMUSERENR | RW | User Enable Register, see the ARM<br>Architecture Reference Manual |
| 899-903 | 0xE0C-0xE1C | -  | - | -   | - | -         | -  | Reserved                                                           |

- The two main registers that we need to access are PMCR and PMUSERENR
  - ▶ PMCR: controls access to the PMU in general
  - ▶ PMUSERENR: is the User Enable Register that needs to be configured to allow user code to access the PMU

<sup>&</sup>lt;sup>0</sup>See Cortex A7 MPcore Technical Reference Manual, Table 11-1 PMU register

## Table 11-1: PMU registers

| 897     | 0xE04       | c9 | 0 | c12 | 0 | PMCR      | RW | Performance Monitor Control<br>Register on page 11-7               |
|---------|-------------|----|---|-----|---|-----------|----|--------------------------------------------------------------------|
| 898     | 0xE08       | c9 | 0 | c14 | 0 | PMUSERENR | RW | User Enable Register, see the ARM<br>Architecture Reference Manual |
| 899-903 | 0xE0C-0xE1C | -  | - | -   | - | -         | -  | Reserved                                                           |

- The two main registers that we need to access are PMCR and PMUSERENR
  - PMCR: controls access to the PMU in general
  - ▶ PMUSERENR: is the User Enable Register that needs to be configured to allow user code to access the PMU

<sup>&</sup>lt;sup>0</sup>See Cortex A7 MPcore Technical Reference Manual, Table 11-1 PMU register HERIOT summary, p 237

## Structure of the PMCR register

To enable access to the PMU, we need to access the PMCR register. The **Performance Monitor Control Register (PMCR)** defines the core behaviour of the PMU:



Figure 11-2 Performance Monitor Control Register bit assignments

OSee Cortex A7 MPcore Technical Reference Manual, Figure 11-2 Performance Manual, Figure 11-2

#### The bits in the PMCR

#### Table 11-2 PMCR bit assignments (continued)

| Bits | Name | Function                                                                                                                                                                            |                   |
|------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
| [4]  | X    | Export enable, This bit permits events to be exported to another debug device, such as a tra an event bus:                                                                          | ce macrocell, ove |
|      |      | Export of events is disabled. This is the reset value.                                                                                                                              |                   |
|      |      | 1 Export of events is enabled. This bit is read/write.                                                                                                                              |                   |
| [3]  | D    | Clock divider:                                                                                                                                                                      |                   |
|      |      | When enabled, PMCCNTR counts every clock cycle. This is the reset va                                                                                                                | lue.              |
|      |      | <ol> <li>When enabled, PMCCNTR counts once every 64 clock cycles.</li> </ol>                                                                                                        |                   |
|      |      | This bit is read/write.                                                                                                                                                             |                   |
| [2]  | С    | Clock counter reset:                                                                                                                                                                |                   |
|      |      | No action. This is the reset value.                                                                                                                                                 |                   |
|      |      | <ol> <li>Reset PMCCNTR to 0.</li> </ol>                                                                                                                                             |                   |
|      |      | This bit is write-only, and always RAZ.                                                                                                                                             |                   |
| [1]  | P    | Event counter reset:                                                                                                                                                                |                   |
|      |      | No action. This is the reset value.                                                                                                                                                 |                   |
|      |      | <ol> <li>Reset all event counters, not including PMCCNTR, to 0.</li> </ol>                                                                                                          |                   |
|      |      | In Non-secure modes other than Hyp mode, writing a 1 to this bit does not reset event co<br>HDCR.HPMN field reserves for Hyp mode use. See <i>Hyp Debug Control Register</i> on pag |                   |
|      |      | In Secure state and Hyp mode, writing a 1 to this bit resets all event counters.                                                                                                    |                   |
|      |      | This bit is write-only, and always RAZ.                                                                                                                                             |                   |
| [0]  | Е    | Enable bit. Performance monitor overflow IRQs are only signaled when the enable bit is                                                                                              | set to 1.         |
|      |      | 0 All counters, including PMCCNTR, are disabled. This is the reset value.                                                                                                           |                   |
|      |      | <ol> <li>All counters are enabled.</li> </ol>                                                                                                                                       |                   |
|      |      | This bit is read/write.                                                                                                                                                             |                   |



# Configuring the PMCR register

#### We are almost there!

The encoding for the PMCR register is (see Table 11-1): c9, c12, 0

We now configure the PMCR by setting the E, P, C, and X bits. These are bits 0, 1, 2, and 4 in the PMCR register. This means we need a bitmask of 0b00010111 or 0x17.

#### Here is the code:

NB: For longer running programs you probably also want to enable the D bit, which divides the cylce counter by 64!

#### The PMUSERENR register

The PMUSERENR bit assignments are:



Bits[31:1] Reserved, UNK/SBZP.

EN, bit[0] User mode access enable bit. The possible values of this bit are:

- User mode access to the Performance Monitors disabled.
- User mode access to the Performance Monitors enabled

Some MCR and MRC instruction accesses to the Performance Monitors are UNDEFINED in User mode when the EN bit is set to 0. For more information, see Access permissions on page C12-2330.

#### Accessing the PMUSERENR

To access the PMUSERENR, read or write the CP15 registers with  $\langle opc1 \rangle$  set to 0,  $\langle CRn \rangle$  set to c9,  $\langle CRn \rangle$  set to c14, and  $\langle opc2 \rangle$  set to 0. For example:

MRC p15, 0, <Rt>, c9, c14, 0 : Read PMUSERENR into Rt MCR p15, 0, <Rt>, c9, c14, 0 : Write Rt to PMUSERENR

OSee ARM Architecture Reference Manual Cortex-A7, Sec B6.1.81, PMUSERE Performance Monitors User Enable Register, p 1924

# Enabling access to the PMU

We can enable access to the PMU from "user space", from normal applications that are running outside the Linux "kernel space", by setting the lowest bit in the PMUSERENR:

```
mov r2, \#0x01 @ store bitmask 0x01 in reg r2 mcr p15, 0, r2, c9, c14, 0 @ transfer r2 to PMUSERENR
```

The MCR instruction transfers a value in a register to the co-processor. To find the encoding of the PMUSERENR we look up Table 11-1: c9, c14, 0

<sup>&</sup>lt;sup>0</sup>See also ARM Architecture Reference Manual Cortex-A7, Sec B5.8.2, Table B5-11: Summary of PMSA CP15 register descriptions, p 1796



## Enabling access to the PMU

#### We also need to configure the following registers

- PMCNTENSET: Count Enable Set Register<sup>1</sup>:
   Purpose: The PMCNTENSET register enables the Cycle Count Register, PMCCNTR, and any implemented event counters, PMNx.
   Reading this register shows which counters are enabled. This register is a Performance Monitors register.
- PMOVSR: Overflow Status Register PMCNTENSET: Count Enable Set Register<sup>2</sup>:

Purpose: The PMOVSR holds the state of the overflow bits for:

- the Cycle Count Register, PMCCNTR
- each of the implemented event counters, PMNx.

Software must write to this register to clear these bits. This register is a Performance Monitors register.

<sup>&</sup>lt;sup>1</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B6.1.74, p 1910 <sup>2</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B6.1.78, p 1908 ■



## Table 11-1: PMCNTENSET and PMOVSR registers

We now have to find the register encodings for PMCNTENSET and PMOVSR.

Table 11-1 PMU register summary (continued)

| Register<br>number | Offset      | CRn | Op1 | CRm | Op2 | Name       | Туре | Description                                                                    |
|--------------------|-------------|-----|-----|-----|-----|------------|------|--------------------------------------------------------------------------------|
| 800                | 0xC80       | с9  | 0   | c12 | 3   | PMOVSR     | RW   | Overflow Flag Status Register, see<br>the ARM Architecture Reference<br>Manual |
| 801-807            | 0xC84-0xC9C | -   | -   | -   |     | -          | -    | Reserved                                                                       |
| 768                | 0×C00       | c9  | 0   | c12 | 1   | PMCNTENSET | RW   | Count Enable Set Register, see the<br>ARM Architecture Reference Manual        |
| 69-775             | 0xC04-0xC1C | -   | -   | -   | -   | -          | -    | Reserved                                                                       |
| 76                 | 0xC20       | c9  | 0   | c12 | 2   | PMCNTENCLR | RW   | Count Enable Clear Register, see the<br>ARM Architecture Reference Manual      |
| 777-783            | 0xC24-0xC3C | -   | -   | -   | -   |            | -    | Reserved                                                                       |

<sup>&</sup>lt;sup>2</sup>See either Cortex A7 MPcore Technical Reference Manual, Figure 11-2
Performance Monitor Control Register bit assignments, p 240
or ARM Architecture Reference Manual Cortex-A7, Sec B5.8.2, Table B5-11:
Summary of PMSA CP15 register descriptions, p 1796



# Bits in PMCNTENSET and PMOVSR registers

The PMCNTENSET register enables the Cycle Count Register, PMCCNTR, and any implemented event counters, PMNx<sup>3</sup>



The PMOVSR holds the state of the overflow bit for: (i) the Cycle Count Register, PMCCNTR; (ii) each of the implemented event counters, PMNx.<sup>4</sup>

The PMOVSR bit assignments are:



<sup>&</sup>lt;sup>3</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B4.1.116, p 1676

<sup>4</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B4.1.116, p ±685 ≥



## Bits in PMCNTENSET and PMOVSR registers

The PMCNTENSET register enables the Cycle Count Register, PMCCNTR, and any implemented event counters, PMNx<sup>3</sup>



The PMOVSR holds the state of the overflow bit for: (i) the Cycle Count Register, PMCCNTR; (ii) each of the implemented event counters, PMNx.<sup>4</sup>

<sup>&</sup>lt;sup>3</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B4.1.116, p 1676 <sup>4</sup>See ARM Architecture Reference Manual Cortex-A7, Sec B4.1.116, p 1685 ■



# Enabling access to the PMU

#### Almost there!

Both registers hold bitmasks over the event counters, to enable them and to control overflow.

We want to turn on the bit for every counter.

We have 4 counters in total, so we need to set the 4 least significant bits: we need a bitmask of 0b1111 or 0x0f

Finally, here is the code to set the PMCNTENSET and PMOVSR registers:

```
mov r2, #0x0f @ store bitmask 0x0f in reg r2 mcr p15, 0, r2, c9, c12, 1 @ transfer to PMCNTENSET mov r2, #0x0f @ store bitmask 0x0f in reg r2 mcr p15, 0, r2, c9, c12, 3 @ transfer to PMOVSR
```

# Step 3: Defining what to monitor

- Now that the PMU is enabled we need to decide what we want to monitor
- The PMU contains one cycle counter register, which we can use without special configuration: PMCCNTR
- The PMU contains 4 configurable counter registers
- For each of these registers we need to specify an event type to monitor

#### Table 16-1: PMU monitor events

Table 16-1 Performance monitor events

| Number | Event counted                                                             |
|--------|---------------------------------------------------------------------------|
| 0×00   | Software increment of the Software Increment Register                     |
| 0x01   | Instruction fetch that causes a Level 1 instruction cache refill          |
| 0x02   | Instruction fetch that causes a Level 1 instruction TLB refill            |
| 0x03   | Data Read or Write operation that causes a Level 1 instruction TLB refill |
| 0x04   | Data Read or Write operation that causes a Level 1 data cache access      |
| 0×05   | Data Read or Write operation that causes a Level 1 data TLB refill        |
| 0x06   | Memory-reading instruction executed                                       |
| 0x07   | Memory-writing instruction executed                                       |
| 0x09   | Exception taken                                                           |
| 0×0A   | Exception return executed                                                 |
| 0×0B   | Instruction that writes to the Context ID register                        |
| 0×0C   | Software change of program counter                                        |
| 0x0D   | Immediate branch instruction executed                                     |
| 0x0F   | Unaligned load or store                                                   |
| 0x10   | Branch mispredicted or not predicted                                      |
| 0x11   | Cycle count; the register is incremented on every cycle                   |





#### Table 16-1: PMU monitor events

| 0x11      | Cycle count; the register is incremented on every cycle |
|-----------|---------------------------------------------------------|
| 0x12      | Predictable branch speculatively executed               |
| 0x13      | Data memory access                                      |
| 0x14      | Level 1 instruction cache access                        |
| 0x15      | Level 1 data cache write-back                           |
| 0x16      | Level 1 data cache write-back                           |
| 0x17      | Level 2 data cache refill                               |
| 0x18      | Level 2 data cache write-back                           |
| 0x19      | Bus access                                              |
| 0x1A      | Local memory error                                      |
| 0x1B      | Instruction speculatively executed                      |
| 0x1C      | Instruction write to TTBR                               |
| 0x1D      | Bus cycle                                               |
| 0x1E-0x3F | Reserved                                                |
|           |                                                         |



# Defining what to monitor

#### We can define the events we want to monitor like this:

```
mov r2, #0x00 @ counter #0
mcr p15, 0, r2, c9, c12, 5 @ transfer to PMSELR
mov r2, #0x11 @ event type #11: cycle count
mcr p15, 0, r2, c9, c13, 1 @ transfer to PMXEVTYPER
```

The first 2 lines identify counter no. 0 ( $0 \times 00$ ) as the counter we are configuring.

The next 2 lines specify that this counter should monitor event no.  $0 \times 11$ : instruction cycles.



# Defining what to monitor

We can define the events we want to monitor like this:

The first 2 lines identify counter no. 0 ( $0 \times 00$ ) as the counter we are configuring.

The next 2 lines specify that this counter should monitor event no.  $0 \times 11$ : instruction cycles.



#### The complete kernel module

```
// 1. Enable "User Enable Register"
asm volatile("mcr_p15, 0, %0, c9, c14, 0\n\t" :: "r" (0
   x00000001));
// 2. Reset Performance Monitor Control Register (PMCR), Count
   Enable Set Register, and Overflow Flag Status Register
asm volatile ("mcr_p15,_0,_%0,_c9,_c12,_0\n\t" :: "r"(0
   x00000017));
asm volatile ("mcr_p15,_0,_%0,_c9,_c12,_1\n\t" :: "r"(0
   x8000000f));
asm volatile ("mcr_p15,_0,_%0,_c9,,_c12,,_3\n\t" :: "r"(0
   x8000000f));
// 3. Disable Interrupt Enable Clear Register
asm volatile("mcr_p15,_0,_%0,_c9,_c14,_2\n\t" :: "r" (~0));
// 4. Read how many event counters exist
asm volatile("mrc_p15,_0,_%0,_c9,_c12,_0\n\t" : "=r" (v)); //
   Read PMCR
printk("pmon_init():_have_%d_configurable_event_counters.\n",
   v >> 11) & 0x1f);
```

#### Build the module

You first need to download the kernel sources. To build the module, get the sample sources from PMU\_pmuon and do this:

```
sudo make clean
sudo make
sudo insmod ./pmuon.ko
dmesg | tail
sudo rmmod pmuon
```



Tutorial 6: Perf Counters

## Step 4: Use the PMU in a user program

First we define macros for assembler 1-liners, which reset all counters (by writing to PMCR) and read the counters from the PMU:

```
#define armv7_reset_counters \
        asm volatile ("mcr_p15,_0,_%0,_c9,,_c12,,_0\n\t" :: "r"(0
           x00000017)) /* write to PMCR */
#define armv7_read_ccr( val ) \
        asm volatile("mrc___p15,_0,_%0,_c9,_c13,_0" : "=r"(val)
#define armv7_read_cr0( val ) \
        asm volatile("mcr___p15,_0,_%0,_c9,,_c12,,_5" :: "r"(0x00
           )); /* select counter #0 */ \
        asm volatile("mrc___p15,_0,_%0,_c9,_c13,_2" : "=r"(val)
           ) /* read its value */
```

Tutorial 6: Perf Counters

# Measuring a simple C loop

#### The core of our user program is a counting loop:

```
armv7 reset counters;
armv7_read_ccr( before_ccr );
armv7_read_cr0(before_cr0);
for (i=0; i<n; i++ ) /* nothing */; // code to measure
armv7 read ccr( after ccr);
armv7_read_cr0(after_cr0);
```

## Example: running the measurement

```
> gcc -DCP15 -o rpi2-pmu01 rpi2-pmu01.c
> sudo ./rpi2-pmu01 10
Raspberry Pi 2 performance monitoring, using CP15 interface
The result is: 10
ccr: 338 (before: 0 after: 338) CYCLES
cr0: 338 (before: 6 after: 344) CYCLES
cr1: 12 (before: 0 after: 12) BRANCHES
cr2: 48 (before: 3 after: 51) CACHE HITS (Data read or write
   operation that causes a cache access at (at least) the
   lowest level of data or unified cache)
cr3: 32 (before: 0 after: 32) CACHE MISSES (Data read
   architecturally executed)
PMCR=41072011
Done.
```



#### Measuring assembler code This is an assembler version of the counting loop:

```
armv7 reset counters;
armv7_read_ccr( before_ccr );
armv7 read cr0( before cr0);
asm volatile (/* inline assembler version of a counting loop */
" measure me asm %=:\n"
"\t_____@_initialise_counter_
   register\n"
"\t...,B...,B...,TEST%=.@.uncond..jump\n"
"LOOP%=:____@_loop_over_counter,R3\
   n"
"\t_______.ADD_______.R3,__.#1______.0__increment_counter
   LLLLL\n"
"TEST%=:____CMP____R3,_%[n]_____@_test,_end_.value\n"
"\t____LOOP%=\n"
"\t____%[res],_R3_____%_done__\n"
: [res] "=r" (i) : [n] "r" (n) : "r3", "cc");
armv7 read ccr( after ccr);
armv7_read_cr0( after_cr0);
```

#### Output

```
> gcc -DCP15 -o rpi2-pmu01 rpi2-pmu01.c
> sudo ./rpi2-pmu01 10
Raspberry Pi 2 performance monitoring, using CP15 interface
The result is: 10
ccr: 249 (before: 0 after: 249) CYCLES
cr0: 249 (before: 6 after: 255) CYCLES
cr1: 12 (before: 0 after: 12) BRANCHES
cr2: 7 (before: 3 after: 10) CACHE HITS (Data read or write
   operation that causes a cache access at (at least) the
   lowest level of data or unified cache)
cr3: 1 (before: 0 after: 1) CACHE MISSES (Data read
   architecturally executed)
PMCR=41072011
Done.
```

NB: we get precise runtime in machine-cycles; because we execute the loop 10 times (plus entry and exit), the branch counter shows 12; most operations work in registers, only a few memory access are needed and most of them can use the cache

```
armv7 reset counters;
armv7_read_ccr( before_ccr );
armv7 read cr0( before cr0);
asm volatile (/* inline assembler version of a counting loop
   with bad branch prediction */
"_measure_me_asm_%=:\n"
"\t_____@_initialise_counter_
    register\n"
"TEST%=:____CMP____R3,_%[n]____@_test_end_value\n"
 "\t_____BGE____LEAVE%=____@_leave_loop_(BAD_
    BRANCH PRED!) ...\n"
 "\t_____@_increment counter"
   \n"\n"
"\t_____B_____TEST%=_____@_unconditional_jump_.\
"LEAVE%=: _____MOV____%[res],_R3_____@_done_\n"
: [res] "=r" (i) : [n] "r" (n) : "r3", "cc");
armv7_read_ccr( after_ccr );
armv7 read cr0( after cr0);
```

#### Output

```
> gcc -DCP15 -o rpi2-pmu01 rpi2-pmu01.c
> sudo ./rpi2-pmu01 10
Raspberry Pi 2 performance monitoring, using CP15 interface
The result is: 10
ccr: 116 (before: 0 after: 116) CYCLES
cr0: 116 (before: 6 after: 122) CYCLES
cr1: 21 (before: 0 after: 21) BRANCHES
cr2: 7 (before: 3 after: 10) CACHE HITS (Data read or write
   operation that causes a cache access at (at least) the
   lowest level of data or unified cache)
cr3: 1 (before: 0 after: 1) CACHE MISSES (Data read
   architecturally executed)
PMCR=41072011
Done.
```

NB: In this case we have 21 rather than 12 branches, for the same kind of counting loop; this is because each iteration resulted in a mis-predicted branch, which was partially executed by the processor-pipeline, but then had to be aborted.

# A larger user program: sum-and-average

Code example: sumav3\_asm\_pmu.c



Tutorial 6: Perf Counters

## Summary

- The ARM Cortex-A7 has an on-chip co-processor for hardware performance monitoring (PMU)
- The PMU can be configured to count a range of low-level events, e.g. cycles, branches, cache hits
- The PMU needs to be enabled from within a kernel module, so that user space programs can access it
- Once configured, inline assember instructions can be used to start/stop counting and read values
- The relevant assembler instructions are MCR and MRC, with a bespoke formatting of specifying registers on the CP15 co-processor (and on other on-chip co-processor)

