

Until recently, embedded processors were almost always simple lowest-cost devices (8-bit microcontrollers, etc.)



But it is changing!... AdvCompArch — Digital Signal Processors



#### 1998 1999 2000 2001 2002

AdvCompArch - Digital Signal Processors

100

0

© lenne 2003-05

#### **High-end Embedded Processors**

- Networking, wireless communication, printer and disk controllers, DVDs and video, digital photography, medical devices...
- Computing power ever growing
  - Sometimes products with 5-10 Digital Signal Processors on a single die (e.g., xDSL, VoIP)
  - Multimedia, cryptographic capabilities, adaptive signal processing, etc.



```
AdvCompArch — Digital Signal Processors
```

## General Purpose Processors Cost and Pricing Policy



## Specificities of Embedded Processors

- Cost used to be the only concern; now performance/cost is at premium and still not performance alone as in PCs (Intel model); performance is often a constraint
- Binary compatibility is less of an issue for embedded systems
- Systems-on-Chip make processor volume irrelevant (moderate motivation toward single processor for all products)



lap

© lenne 2003-05

AdvCompArch — Digital Signal Processors



**MPR 2002** 

0

Report,

Source: Microprocessor

#### General Purpose Processors Costs and Pricing Policy



### **Cost and Performance**

#### Performance is a design constraint

- General Purpose: new high-end processor must be tangibly faster than best in class
- Embedded: adequately programmed, must just satisfy some minimal performance constraints

#### Cost is the optimisation criteria

- ◆General Purpose: must be within some typical range (e.g., 50-120 USD) → profit margin can be as high as some factor (2-3x)
- ◆ Embedded: must be minimal → economic margin on the whole product can be as low as a few percent points



```
AdvCompArch — Digital Signal Processors
```

## **Types of Embedded Processors**

#### Microcontrollers

- Relatively slow, microprogrammed, CISC processors
- Typically derivates of old or very old general purpose processor families (68k, 8051, 6502, etc.)

#### RISC Processors

- Pipelined, relatively simple RISCs, often with special architectural features for the embedded market
- Typical representatives: ARM7 and ARM9

#### Digital Signal Processors

- Special family of processors with peculiar architectures for arithmetic intensive, signal processing applications
- \* Typical representatives: TI C620, DSP56k, etc.
- Multimedia Processors...



© lenne 2003-05

### Cost Is Not Just the Processor...

- Processors have tangible induced costs
- Some could require:
  - Larger memories
  - More expensive memories (e.g., dual-port)
  - Caches (I and/or D)
  - Peripherals and accelerators
  - Faster clock rate
  - **\***...
- System cost can be extremely influenced by the architecture of the processor



AdvCompArch — Digital Signal Processors

© lenne 2003-05

••

## **Completely Different Benchmarks**

#### General Purpose (Word, Powerpoint, gcc,...)

- ♦ SPEC → Commercial
  - Scientific computing
  - Regular and irregular typical user applications...

#### DSPs (IIR, FIR, FFT,...)

- ♦ DSPstone (Aachen Uni) → Academic
- ◆EEMBC (pronounced "embassy") → Commercial
  - IIR, FIR, IDCT, FFT, IFFT, PWM,...
  - Matrix arithmetic, bit manipulation, table lookup, interpolation,...
  - Jpeg, RGB-to-CYMK, RGB-to-YIQ, Bezier curves, rotations,...
  - Viterbi, autocorrelation, convolutional encoders,...



### **Trends in Computing?**



## **Typical Features of DSPs**

- Arithmetic and Datapath
  - Fixed-point arithmetic support
  - MAC = multiply-accumulate instruction
  - Special registers, not directly accessible
- Memory Architecture
  - Harvard architecture
  - Multiple data memories
- Addressing Modes
  - Special address generators
  - Bit-reversed addressing
  - Circular buffers
- Optimised Control
  - Zero-overhead loops
- □ Special-purpose peripherals...



### **Pressure on the Compilers**

#### Performance

Squeeze out every possible MIPS of performance from irregular architectures

#### Code Size

- Memory is a key cost factor in embedded systems, <u>much</u> more than in general purpose systems
- Power Consumption
  - Important metric in embedded systems, hardly of any relevance in general purpose computing (i.e., not even considered by compilers)



AdvCompArch — Digital Signal Processors



## DSP Arithmetic: Fixed-Point Vs. Floating-Point

- Typical example of embedded processor economics: much more complexity in designing the algorithm (NRE cost) and in programming to get much less complexity in the hardware (mfg. cost)
- □ Floating-point DSP ~2-4x cost of Fixed-point DSP and much slower...
- ❑ Very poor support in automatic tools yet → decisions taken by algorithm analysis, simulation, and compliance tests (e.g., accumulated error over a test set below some value)



#### **Fixed Point**

In principle, if one adds a fractional point in a fixed position, hardware for integers works just as well and there are no additional ISA needs

| 0   | 1              | 0                     | 0                     | $1_2 \rightarrow$     | <b>9</b> <sub>10</sub> | 0   | 1. | 0   | 0           | <b>1</b> <sub>2</sub> → <b>1</b> .125 <sub>10</sub> |
|-----|----------------|-----------------------|-----------------------|-----------------------|------------------------|-----|----|-----|-------------|-----------------------------------------------------|
| + 0 | 0              | 0                     | 1                     | $1_2 \rightarrow$     | <b>3</b> <sub>10</sub> | + 0 | 0. | 0   | 1           | $1_2 \rightarrow 0.375_{10}$                        |
| 0   | 1              | 1                     | 0                     | $0_2 \rightarrow$     | 12 <sub>10</sub>       | 0   | 1. | 1   | 0           | $0_2 \rightarrow 1.500_{10}$                        |
| 24  | 2 <sup>3</sup> | <b>2</b> <sup>2</sup> | <b>2</b> <sup>1</sup> | <b>2</b> <sup>0</sup> |                        | 21  | 20 | 2-1 | <b>2</b> -2 | 2-3                                                 |

It's just a matter of representation! (I.e, implicit constant multiplicative coefficient)



AdvCompArch — Digital Signal Processors

## **Different Approximation Choices**

| <b>Truncate</b> : Discard | bits → Large | bias                  |
|---------------------------|--------------|-----------------------|
| ♦ 00.011 → 00             | and          | 01.011 🗲 01           |
| ♦ 00.100 → 00             | and          | 01.100 🗲 01           |
| ♦ 00.101 → 00             | and          | 01.101 🗲 01           |
| <b>Round</b> : <.5 round  | down, $>=.5$ | round up → Small bias |
| ♦ 00.011 → 00             | and          | 01.011 🗲 01           |
| ♦ 00.100 → 01             | and          | 01.100 <b>→</b> 10    |
| ♦ 00.101 → 01             | and          | 01.101 <b>→ 10</b>    |
| Convergent Roun           |              |                       |
| =.5 round to neares       | t even → No  | o bias                |
| ♦ 00.011 → 00             | and          | 01.011 🗲 01           |
| ♦ 00.100 → 00             | and          | 01.100 🗲 10           |
| ♦ 00.101 → 01             | and          | 01.101 🗲 10           |
|                           |              |                       |
|                           |              |                       |

### **Fixed Point Multiplication**

❑ Multiplication typically introduces the need of arithmetic rescaling with shifts to the right (multiplicative constant cannot be implicit anymore) → Choice of accuracy depending on how many bits one can keep...

|                                           |    |    |    |    | 0.      | 1        | 0           | 1        | <b>0</b> <sub>2</sub> | $\rightarrow$ | 0.625 <sub>10</sub>    |
|-------------------------------------------|----|----|----|----|---------|----------|-------------|----------|-----------------------|---------------|------------------------|
| ×                                         | (  |    |    |    | 0.      | 0        | 1           | 1        | <b>0</b> <sub>2</sub> | $\rightarrow$ | 0.375 <sub>10</sub>    |
|                                           | 0. | 0  | 0  | 1  | 1       | 1        | 1           | 0        | <b>0</b> <sub>2</sub> | $\rightarrow$ | 0.234375 <sub>10</sub> |
|                                           | →  | 0. | 0  | 0  | 1       | 1        | 1           | 1        | <b>0</b> <sub>2</sub> | $\rightarrow$ | 0.234375 <sub>10</sub> |
|                                           | →  | →  | 0. | 0  | 0       | 1        | 1           | 1        | <b>1</b> <sub>2</sub> | $\rightarrow$ | 0.234375 <sub>10</sub> |
|                                           | →  | →  | →  | 0. | 0       | 0        | 1           | 1        | <b>1</b> <sub>2</sub> | $\rightarrow$ | 0.21875 <sub>10</sub>  |
| (EPFL                                     | →  | →  | →  | →  | 0.      | 0        | 0           | 1        | <b>1</b> <sub>2</sub> | $\rightarrow$ | 0.1875 <sub>10</sub>   |
| ÉCOLE POLYTECHNIQU<br>FÉDÉRALE DE LAUSANN | 18 |    |    | Ac | dvCompA | Arch — D | Digital Sig | gnal Pro | ocessors              |               | © lenne 2003-05        |

## Fixed-Point Programming Example

| /*  | an excerpt from adpcm.c */                                                   |
|-----|------------------------------------------------------------------------------|
| /*  | adpcm_coder, mediabench */                                                   |
| /*  | Step 2 - Divide and clamp */                                                 |
| **  | This code *approximately* computes:                                          |
| **  | <pre>delta = diff*4/step;</pre>                                              |
| **  | <pre>vpdiff = (delta+0.5)*step/4;</pre>                                      |
| **  | but in shift step bits are dropped. The net result of this is                |
| **  | that even if you have fast mul/div hardware you cannot put it                |
| **  | into good use since the fixup would be too expensive.                        |
| */  |                                                                              |
| del | <pre>ta = 0; vpdiff = (step &gt;&gt; 3);</pre>                               |
| if  | <pre>( diff &gt;= step ) { delta = 4; diff -= step; vpdiff += step; }</pre>  |
| ste | p >>= 1;                                                                     |
| if  | <pre>( diff &gt;= step ) { delta  = 2; diff -= step; vpdiff += step; ]</pre> |
|     | p >>= 1;                                                                     |
| if  | <pre>( diff &gt;= step ) { delta  = 1; vpdiff += step; }</pre>               |



lap

© lenne 2003-05

20

# Fixed-Point Programming Example



#### **DSP Arithmetic Needs**

- Rather than having full floating-point support (expensive and slow), one wants in a DSP some simple and fast ad-hoc operations:
  - MUL + ADD in a single cycle (MAC)
  - Accumulation register after MAC (precision?)
  - \*Approximation mechanisms
- Nonuniform precision in the whole architecture (e.g., 24bit x 24bit + 56bit)



AdvCompArch — Digital Signal Processors

© lenne 2003-05



AdvCompArch

(PFI

22

AdvCompArch — Digital Signal Processors



## **Slower Clock Speed but 1-Cycle Multiply-Accumulate Instruction**

■ MAC operations tend to dominate DSP code (maybe 50% of critical code) → highly optimised MAC instruction



## **Example of Pipelined MAC Datapath**



### **Classic FIR Example**

#### Convolution:



## **Multiple Memory Ports**

#### □ Harvard architecture:

- Separate instruction memory
- I-Memory at times accessible as another D-Memory (e.g., TI C2000) to spare memory ports
- □ Multiple data memories:
  - X-Memory
  - Y-Memory
  - Sometimes more...

Multiple buses



### **Memory Bandwidth**

## The MAC instruction/unit is not enough...



### **RISC vs. DSP Organisation**

#### RISC:

- □ Von Neumann (Harvard but hidden from the user)
- ~1 access/cycle
- Heavily relies on caches to achieve performance
- Complex blend of on-chip SRAM/SRAM/DRAM

#### DSP:

- Harvard (architecturally visible)
- 1-4 memory accesses per cycle
- No caches
- SRAM



#### **Caches and DSPs**

- Importance of real-time constraints: no data caches...
- Sometimes caches on the instruction memory, but determinism is key in DSPs:
  - Caches under programmer control to "lock-in" some critical instructions
  - Turn caches into fast program memory

Once again, one is not after **highest performance** but just the **guaranteed minimal performance** one needs



```
AdvCompArch — Digital Signal Processors
```

© lenne 2003-05

## DSP vs. General Purpose Memory Systems



## Example Motorola DSP56600





**O**P

#### **Addressing Modes**

- To keep MAC busy all the time, with new data from memory, one needs to generate memory addresses
- Forget about Load/Store architectures
- Complex addressing is now fully welcome if
  - Allows automatic next address calculation
  - Does not require usage of the datapath (MAC is busy...)



MPYF3 \*AR0++%, \*AR1++%, R0
ADDF3 R0, R2, R2



AdvCompArch — Digital Signal Processors

© lenne 2003-05

lap

## **Typical Addressing Modes**

AR can be loaded with:

- Immediate load: constant from the instruction field loaded into the pointed AR
- Immediate modify: constant from the instruction field added to the pointed AR
- Autoincrement: small constant (typ. 1 and/or 2) added to the pointed AR
- **Automodify**: value of the pointed MR added to the pointed AR
- Bit Reversing: value of the pointed AR bit-reversed and loaded into the pointed AR
- **Modulo/Circular**: autoincrement/automodify with modulo
- Also decrement/subtract
- □ Sometimes pre- and/or post-modification





#### **Address Generation Units**

Dedicated simple datapaths to generate meaningful sequences of addresses—usually 2-4 per DSP



### Radix-2 FFT



#### **Circular Buffers**

DSPs deal with continuous I/O flows, often organised in circular buffers



AdvCompArch — Digital Signal Processors



## **Repeat/Loop Instructions**

□For loops made of a single instruction:

 RPTS
 N-1
 ; repeat next

 MPYF3
 \*AR0++%, \*AR1++%, R0

 ADDF3
 R0, R2, R2

□Zero-overhead Loop instruction:

Configures the Program Control Unit to generate the appropriate next address depending on a condition (e.g., autodecrement of an AR)

#### **Remove Control Bottlenecks**

- Remember typical goal: FIR with MAC busy 100% of the time...
- DSP code made essentially of tight loops, often with a statically determined number of iterations (coefficients of a filter, etc.)
- How can one make the branches "cost nothing"?
  - Repeat instructions
  - Zero-overhead loops



AdvCompArch — Digital Signal Processors

© lenne 2003-05

## **DSP World Is Slowly Changing**

In a sense DSPs have already the main features of VLIWs: explicit parallelism, static scheduling, no "dynamic" low predictability behaviour...

→Convergence?





lap

© lenne 2003-05

### **TI TMC320C64x**



## Direct Carmel Translation for G.723.1 DC Filter



#### **Infineon Carmel**



#### Optimal Carmel Code for G.723.1 DC Filter



#### Summary

- DSPs are very different from general-purpose computers
  - Dedicated to embedded applications
  - Cost and power consumption come into the picture (and cost is fundamental)
- Relatively narrow variety of applications
  - More application specialisation possible
- Development cost (programming) relatively irrelevant when compared to per-unit cost
  - The most awkward and hard-to-program solutions are ok if they bring enough savings
  - Compilers? Useful for 90% of the code, but the rest...



AdvCompArch — Digital Signal Processors

