Introduction

The Debug Task Group is discussing two competing debug proposals. The main difference between the specs is the core debug mechanic. One proposal revolves around having the core execute instructions to implement all debugging. This is known as Instruction design. (Historically, this has also been referred to as SiFive/Tim Newsome/Full-Featured/one on the right.)

The other proposal defines a slightly more abstract interface where logically GPRs and CSRs are accessed directly on a Debug Bus. This is known as the Direct design. (Historically, this has also been referred to as ROA Logic/Richard Herveille/Lightweight/one on the left.)

If you’d like to contribute, the git repository is at https://github.com/sifive/debug-mechanism-comparison

Background

Definitions

Run control

Halt, step, and resume a hart. Set breakpoints. Access GPRs, CSRs, other registers, and memory.

Self-hosted debug

Debugging where software running on the target system is the thing that allows run control of the thing you’re debugging. Eg. running gdb on Linux, and gdbserver on a bare metal target.

External debug

Debugging a system without making any assumptions about the software running on that system.

Debug monitor

Software run on a bare metal system that allows for self-hosted debug.

JTAG

De facto standard transport for external debug, but really developed for post-manufacture test. It’s a simple 4-wire serial interface.

Trace

Have a core emit what it’s doing in real time.

Goals

A debug spec should allow a user to connect a debugger running on their computer to a target system. The emphasis is on features that support debugging systems that can’t or won’t support self-hosted debug. The user should have full access to the target’s resources, with the ability to download and execute code, set breakpoints, access registers, and access memory.

The spec must describe how debug hardware behaves in such a way that software written to debug one RISC-V system will also work to debug another. It must also support multiple transports, and not be limited to JTAG.

Metrics

The following is a list of metrics that should be considered when evaluating a debug solution. These are in rough order from most important to least important, but not everybody will agree with any given ranking:

Ease of hardware implementation

How easy it to implement the spec in hardware?

Ease of hardware verification

How easy it to verify the hardware. The smaller and more self-contained each component is, the better.

Hardware cost

How many gates does the implementation require?

Scalability

How well does the design scale across all RISC-V core options, including number of harts, configurations (RV32E—RV128), standard and custom extensions, and compressed and extended instruction sets.

Security

What are the security implications?

Speed of debugging

How much (wall) time does it take to perform basic debug operations?

Ease of software implementation

How easy is it to write a debugger that communicates with the hardware?

Quick access

What is the impact on a running system if the debugger needs to access a single word in memory, or a single CSR? This could be used to eg. sample a counter or set a breakpoint on a system that cannot safely be halted for any length of time.

Usefulness for self-hosted debug

Are the hardware features added for external debug useful to help software on the core debug itself?

Usefulness for hardware bringup

How useful is the interface when bringing up hardware? This is not a requirement per the goals, but certainly a nice-to-have.

Ease of software verification

How easy is it to show that software will always do the right thing?

Instruction Design

Overview

This design is focused on the debugger feeding a halted hart arbitrary instructions, while keeping gate count to a minimum. This is necessary functionality, and once it exists it can be used to accomplish all debug tasks. To speed up certain repetitive operations, an optional Instruction Buffer can be implemented. This is designed so that debug software can easily take advantage of it if it exists, but doesn’t need separate code paths if it doesn’t.

Instruction Design Overview

When a hart halts (because it’s interrupted by the debugger or hits a breakpoint), it starts fetching instructions from the DM. The DM communicates this to the debugger, which feeds instructions. Data is exchanged by the hart writing to special CSRs or using a memory-mapped interface. Below is an example of what the debugger does to read memory.

Table 1. Read from memory
Using CSRs Using memory-mapped interface
  1. Push csrr s0, daccess32

  2. Push address to read from

  3. Push lw s0, 0(s0)

  4. Push csrw s0, daccess32

  5. Read data the hart wrote to the DM

  1. Push lw s0, 0x400(zero)

  2. Push address to read from

  3. Push lw s0, 0(s0)

  4. Push sw s0, 0x400(s0)

  5. Read data the hart wrote to the DM

These operations can all be combined in a single transaction from debugger to USB JTAG adapter, because of the way that errors are sticky when one occurs. When all operations are done the debugger can check for errors, and retry with longer delays if necessary.

The optional Instruction Buffer and Scratch RAM can be used to set up this action without any interaction (steps 2 and 5 use Scratch RAM). This enables a debugger to perform actions while halting the core just briefly.

The Instruction Buffer is also important in achieving fast download speeds.

Rationale

The debugger should be able to access anything that software can access. Instead of adding extra data path logic to make that happen, the simplest way is to allow the debugger to execute instructions on the core. That’s sufficient for all debugging, including custom/future extensions.

Direct Design

Rationale

One of the beauties of the RISC-V spec is that it’s implementation independent. It defines the feature set and the requirements, but leaves the implementation up to the user (implementer). The debug specification should follow the same route. It should describe the base and extensions registers and leave the actual implementation up to the user.

There are many ways to implement this specification, e.g. using a debug RAM/ROM, stalling the entire execution pipeline (for tiny implementations), or stalling the current hart while keeping the execution units running in a multi- threaded implementation.

Considerations

The main considerations while writing the Lightweight spec:

  1. The specification should be interface agnostic.

  2. It should use a very simple interface

  3. It should have minimal implications on the implementation of the actual CPU.

  4. Reset, Halt, Single Step, Hardware Breakpoints, Address breakpoints, Interrupt catching should all be supported.

Comparison

In each of the tables below, Memory-Mapped is on the left, and Instruction is on the right.

Table 2. Ease of hardware implementation

Simple for simple designs, where the debugger reaches directly into the GPR and CSR files. When designs get more complex (out-of-order execution, register renaming), PC/GPRs may not exist in a single place. Implementing this interface in terms of instruction feeding is one option to deal with that.

Simple for simple designs. The DM can reach directly into the core to push instructions and exchange data. More complex for more complex designs. In that case the simple implementation is to have the DM be memory-mapped, so the processor can fetch instructions and perform accesses as usual even when halted.

Table 3. Ease of hardware verification

In an implementation where the register files are accessed directly, verification could be a significant burden.

Requires relatively few changes to the CPU, which is the hardest part of the design to verify.

Table 4. Hardware cost

Per hart: 629 LUTs and 264 DFF. This is for a full featured debug unit, with 3 hardware breakpoints

JTAG/Debug interface: 1180LUTs and 815DFFs. This is not really a fair comparison as my jtag/debug interface handles burst mode accesses (it can read an entire address range in one burst), handles multiple harts, and has a JSP (emulated serial port).

SiFive Coreplex, 0.11 spec

3,900 gates plus 600 gates per hart (RV32, no triggers)

SiFive FE310 silicon, 0.11 spec

46,935 square µm, or 1.17% of the total area of the SoC.

IQ-Analog NanoRisc5, 0.11 spec

628 flops plus 1479 LUTs for one RV32, no triggers

Numbers include everything from JTAG TAP down to core changes.

Table 5. Scalability

If core registers are added (eg. vector), debug address space needs to be reserved, and hardware needs to be taught how to access the registers.

The spec can be easily be extended to RV32/RV64/RV128.

As long as any new state that’s added can be moved into GPRs, then the debug hardware does not need to be modified.

The spec already takes RV32/RV64/RV128 into account.

Table 6. Security

No clear security impact.

No clear security impact. The spec does mention options for authenticating a debug user. The fact that memory accesses go through the core is helpful if a designer wants to allow eg. U-mode debugging but not M-mode debugging.

Table 7. Speed of debugging

Should be equivalent.

Download speed reported by gdb using 10MHz TCK, talking USB from inside a VM to an Olimex USB-JTAG adapter:
Start address 0x80000000, load size 16100
Transfer rate: 121 KB/sec, 4025 bytes/write.

Table 8. Ease of software implementation

Should be easy.

Basic debugging is easy. Making the debugger perform well will require some careful coding. [1]

Table 9. Quick access

Can be implemented in hardware, but will require a second port to register files, or some way to stall the processor while a value is accessed.

With optional Instruction Buffer and Scratch RAM in the Debug Module, a debugger can set up arbitrary short instruction sequences to be execute as fast as the processor can manage.

Table 10. Usefulness for self-hosted debug

A hart can access its own debug resources if it has access to the debug bus, but there doesn’t seem to be a way to trap instead of stall. Once stalled the hart cannot revive itself.

Software breakpoints and hardware triggers can be used. Needs work for single step.

Table 11. Usefulness for hardware bringup

In direct implementations, it’s likely still possible to access GPRs and CSRs even when there are other issues with the core.

If the core cannot execute instructions, it’s not possible to debug it in any useful fashion.

Table 12. Ease of software verification

Should be a bit easier, since the software needs to be less smart.

More complex since the software contains more of the debug logic.


1. The simple implementation involves one round trip to the debug device (USB-JTAG adapter) per scan, which tends to have a lot of overhead. Multiple scans are required to perform almost any operation: at least 1 for an instruction, and at least 1 for data. The only way to make the debugger responsive is to combine multiple JTAG scans into a single request to the debug device. This isn’t rocket science but it does require some thought.