# T800, THOR and SPARC, a performance analysis

Roger Johansson Department of Computer Engineering Chalmers University of Technology Göteborg

October 30, 1990

This document discusses six computer designs using the T800 Transputer, the SAAB Thor and the Cypress SPARC microprocessors respectively. The purpose is to evaluate hardware aspects of the three processors in two different configurations, a minimum configuration and a maximum configuration. The "paper designs" are indented to give an estimation of:

- maximum possible instruction execution rate
- required number of devices
- area of printed circuit board
- power consumtion
- failure rate

The following tables summarises the results:

SMALL CONFIGURATION

THOR SPARC T800 7.8 7.5 5.0 Mixed instruction execution rate (MmixedIPS) 32 24 31 Number of required devices 10307 7844 12134 Total area for devices (mm2) 11500 9000 13500 Total PCB area (mm2) 6605 7770 11914 Total power requirement (mW) 3079 2320 3453 Failure Intensity (FITS)

#### MAXIMUM CONFIGURATION

SPARC T800 THOR 8.5 10 23 Mixed instruction execution rate (MmixedIPS) 21 19 23 Number of Required Devices 7730 8289 12785 Total area for devices (mm2) 8500 9100 14100 Total PCB area (mm2) 26114 26020 36190 Total Power Requirement (mW) 119576 104767 169453 Failure Intensity (FITS)

#### 1 General notes on the designs

The three designs are intended to be comparable. In the schematics, readability are emphasised. The diagrams are not intended to be complete but rather focus on devices with major impact on the configuration performance. For each design; a description of a memory read cycle is given. There is only read/write memory included in the designs assuming no read cycle time penality for Read-Only memories. The T800 and SPARC designs both utilises an "error detection and correction unit" (EDAC). The introduced delay (36 ns, worst case for the EDAC in use) is inserted by the EDAC control and assures that memory "Ready" signal will not be asserted until correct data is guaranteed.

#### 2 The Minimum configurations

Special requirements for the minimum configuration are:

- microprocessor with 256kB primary memory
- only space qualified components
- low power consumtion
- small printed circuit board area

The small configuration design consists of:

- cpu
- 256 kB of static random access memory
- real time clock

#### **3** Execution Rate Estimation

The instruction mix is made up from:

- $x_1 = \text{percentage arithmetical/logical instructions}$
- $x_2 = \text{percentage jump/branch instructions}$
- $x_3 = \text{percentage load/store instructions}$
- $x_4 = \text{percentage floating/point instructions}$

x

as a consequense:

$$x_1 + x_2 + x_3 + x_4 = 1$$

for a large number of executed instructions. Other parameters are:

- $X_1$ , the number of processor cycles required to execute an arithmetical/logical instruction
- $X_2$ , is composed by:  $0.1X_{21} + 0.9X_{22}$  where
  - $X_{21}$  is the number of processor cycles required for a "branch not taken" instruction
  - $X_{22}$  is the number of processor cycles required for a "branch taken" instruction
- $X_3$ , denotes the number of processor cycles required to execute a load/store instruction. For simplicity these are considered equal in this sense.
- $X_4$ , denotes the number of processor cycles required for the execution of a floating point instruction.

These parameters are estimated averages from each group. Data is obtained from the manufacturers documentation.

- W denotes the number of wait states required for a read bus cycle, determined by the system configuration.
- Y describes the instruction fetch rate assuming an uniform instruction stream. The data bus with is assumed 32 bits.

Since instruction fetch and execution is performed simultanously (assuming a pipe-lined architecture) we write:

$$Z_1 = max[X_1, Y(W)]$$

$$Z_2 = max[X_2, Y(W)]$$
$$Z_3 = X_3 + W$$
$$Z_4 = max[X_4, Y(W)]$$

We obtain an expression for the Execution Rate Estimation, ERE:

$$ERE = Z_1 x_1 + Z_2 x_2 + Z_3 x_3 + Z_4 x_4 (cycles)$$

since parameters are normalised, ERE denotes the average number of cycles required to execute one instruction. Including the cycle time CT in seconds, we arrive at a final expression for the execution rate:

$$ER = \frac{1}{ERE \quad CT} \frac{instructions}{second}$$

### 4 Average Bus Activity

The Average Bus Activity, ABA is a component in the memory power requirement estimation. It is considered a function of:

- 1. Instruction Fetch Rate
- 2. Instruction Mix
- 3. Instruction Execution Timing

Factors with a major impact on the ABA is:

The instruction format: For example, with an instruction format of 32 bits and assuming single cycle execution of all instructions, the bus will be occupied 100 % with instruction fetches.

Short execution times: The fact that all instructions do not execute in one cycle will reduce the need for 100 % instruction fetches. Thus the higher execution times, the lower the ABA.

Load/store: extra bus accesses initiated by load/stores will occupy the bus, thus increasing ABA.

Here, the ABA is estimated by:

$$ABA = \frac{x_1}{X_1} + \frac{x_2}{X_2} + \frac{x_3}{X_3} + \frac{x_4}{X_4} \ (\%)$$

### 5 Memory Power Consumtion

The memory used (64k nibble) Cypress CY7C194 is a 24 pin device with access time 35 ns. The memory power consumtion is estimated by:

$$P_{average} = ABA P_{active} + (1 - ABA) P_{standby}$$

For this memory device:

 $P_{active} = 660 \, mW$  $P_{standby} = 192 \, mW$ 

### 6 Instruction Mix

The following instruction mix is assumed:

- 50% arithmetical/logical instructions
- 25% jump/branch instructions
- 10% load/store instructions
- 15% floating point instructions

### 7 Notes on the Failure Rate estimation

Failure rate estimation is carried out according to the MIL- HDBK-217-E. The following assumtions were made:

- Quality Factor = S(0.25)
- Voltage Factor = 1
- Application Environment Factor = Space Flight (0.9)

For temperature acceleration factor calculation the thermal resistivity factor were used whenever it was available from manufacturers documentation. However, this was rare, so there had to be assumitons made about the junction temperature.

For complex circuits, such as CPU:s and FPU a junction temperature 110 degrees Celsius was assumed. For all others, a junction temperature 80 degrees Celsius was assumed.

## 8 Inmos T800 small configuration

The T800 has on chip timer, thus no such peripheral device. Component list

|                     | Device      | Qty  | Power [mW] | Area [mm2] | FITS |
|---------------------|-------------|------|------------|------------|------|
| U1                  | T800-G17S   | 1    | 1200(1     | 1451       | 532  |
| U2-U5               | 74ACT245    | 4    | 1200(1     | 220        | 3    |
| U6                  | 74ACT08     | 1    | 9          | 154        | 3    |
| U7                  | 74ACT244    | 1    | 12         | 220        | 3    |
| U8,U9               | 74HCT373    | 2    | 11         | 220        | 3    |
| U11                 | 74ACT04     | 1    | 10         | 154        | 3    |
| U12                 | 0T05        | 1    | 100        | 270        | 27   |
| U13,U14             | 54HCT393    | 2    | 20         | 220        | 3    |
| MU1-MU10            |             |      |            |            |      |
|                     | CY7C194(35) | 10   | 366(2      | 255        | 218  |
| EU 1                | IDT49C460B  | 1    | 625        | 1944       | 92   |
| EU2 CYC7C361-L66DMB |             | MB 1 | 750        | 280        | 170  |
| EU3                 | 74ACT32     | 1    | 9          | 154        | 3    |
| EU4                 | OT050       | 1    | 100        | 270        | 27   |
| EU5-EU8             | 74ACT245    | 4    | 12         | 220        | 3    |
| EU9                 | 74ACT244    | 1    | 12         | 220        | 3    |

1) Estimated for the current application

2) Average according to ABA

### 9 T800 Read memory cycle (external memory)

- T1: Address setup time before address valid strobe
- T2: Address hold time after address valid strobe
- T3: Time for the bus to go to tristate on a read cycle, or to present valid data on a write cycle
- T4,T5: Time for the read or write data pulse
- T6: Time for the bus to remain in tristate after the end of read, or for data to remain valid after the end of write

For the selected device, 1 Tm = 28.5 ns.

- 1. Address is latched at the falling edge of T1. Address setup time is "a-8" = 20.5 ns. The 373 requires typically 5 ns, thus it is sufficient with T1 = 1 Tm.
- 2. Address hold after falling edge of T1 is "b-9" = 19.5 ns. The 373 needs typically 6 ns, thus T2 = 1 Tm.

- 3. For T3,T4 and T5, CS\* is asserted at the end of T1, during a read cycle, data is latched at the falling edge of T5. Buffer propagation delay is 11 ns. T800 needs stable data 25 ns before it is latched, memory require 35 ns from CS\*, the EDAC is 36 ns , Hence: (35+11+36+25) = 107 ns violates T3=T4=T5 = 1Tm (85.5 ns), and there is two extra Tm:s required.
- 4. With T6 = 1 Tm we arrive at a total of 8 Tm, ie 228 ns for an external memory cycle. Thus a memory read bus cycle is equivalent to 228/57 = 4 processor cycles.

### 10 T800 Estimation of Performance

The following parameters were choosed to describe the T800 configuration:

$$X_1 = 2$$
  
 $X_{21} = 2, X_{22} = 4, X_2 = 3.8$   
 $X_3 = 2$   
 $X_4 = 8$ 

A T800 instruction may be encoded in 8 bits. With respect to the instruction mix, an average of 2 instructions/32 bit fetch is assumed, therefore:

$$Y = 0.5(1+W) (normalised)$$

As concluded in the previous section, W = 3 and, Y(W)(normalised) = 2 thus:

$$Z_1 = X_1 = 2$$
  
 $Z_2 = X_2 = 3.8$   
 $Z_3 = 5$   
 $Z_4 = X_4 = 8$ 

leading to:

$$ERE = 3.65 \ cycles$$

 $\operatorname{and}$ 

$$ER = \frac{1}{3.65\ 57} \frac{1}{ns} = 4.8\ M\ mixedIPS$$

For the bus activity we obtain:

$$ABA = 0.38$$

The total memory power requirement: 370 mW/device.

## 11 Thor, small configuration

The Thor has on-chip timer, thus no such peripheral device. Furthermore, THOR has a built in EDAC. Thus no such peripheral device either. The chip is not yet available. Actual figures concerning the THOR chip are estimations. Component list

|          | Device      | Qty | Power [mW] | Area [mm2] | FITS |  |  |
|----------|-------------|-----|------------|------------|------|--|--|
| U1       | Thor        | 1   | 1500       | 2450       | 78   |  |  |
| U2-U6    | 74ACT245    | 5   | 29         | 220        | 3    |  |  |
| U7       | 74ACT138    | 1   | 33         | 220        | 3    |  |  |
| U8-U10   | 74ACT244    | 3   | 29         | 220        | 3    |  |  |
| U11      | OT016       | 1   | 100        | 270        | 26   |  |  |
| U12      | 74ACT04     | 1   | 24         | 154        | 3    |  |  |
| U13,U14  | 54HCT393    | 2   | 20         | 220        | 3    |  |  |
| MU1-MU10 |             |     |            |            |      |  |  |
|          | CY7C194(35) | 10  | 584(*      | 255        | 218  |  |  |

\*) Average according to ABA

## 12 THOR Read memory Cycle

Assuming a need for 5 ns setup before data is latched. Taking into account the delay introduced by the '138, 16 ns. Memory requires 35 ns from  $CS^*$  to valid data.Data bus buffers delay data by 11 ns. Thus were need a cycle time:

15 + 16 + 35 + 11 + 5 = 82ns

The Thor cycle time is 83 ns and therefore, no wait states required.

## **13 THOR Estimation of Performance**

The following parameters were chosen to describe the THOR configuration:

$$X_1 = 1$$
$$X_2 = 1$$
$$X_3 = 2$$
$$X_4 = 4$$

A THOR instruction may be encoded in 16 or 32 bits. With respect to the instruction mix and W = 0 from above:

$$Y(W) = 0.75$$

Thus:

$$Z_1 = X_1 = 1$$
  
 $Z_2 = X_2 = 1$   
 $Z_3 = 2$   
 $Z_4 = X_4 = 4$ 

leading to:

$$ERE = 1.55 cycles$$

and:

$$ER = \frac{1}{1.55 \ 83} \frac{1}{ns} = 7.8 \ MmixedIPS$$

For the bus activity

$$ABA = 0.8375$$

The total memory power requirement: 584 mW/device.

## 14 SPARC small configuration

component list

|          | Device      | Qty | Power [mW] | Area [mm2] | FITS |  |  |
|----------|-------------|-----|------------|------------|------|--|--|
| U1       | CY7C601     | 1   | 1750       | 1998       | 365  |  |  |
| U2       | CY7C344     | 1   | 1000       | 289        | 170  |  |  |
| U3(1     | CY7C602     | 1   | 1750       | 1600       | 358  |  |  |
| U4-U6    | 74ACT244    | 3   | 12         | 220        | 3    |  |  |
| U7       | 74ACT04     | 1   | 10         | 154        | 3    |  |  |
| U8-U11   | 74HCT373    | 4   | 11         | 220        | 3    |  |  |
| U12      | MC146818    | 1   | 20         | 255        | 49   |  |  |
|          |             |     |            |            |      |  |  |
| MU1-MU10 |             |     |            |            |      |  |  |
|          | CY7C194(35) | 10  | 576(2      | 255        | 218  |  |  |
| EU 1     | IDT49C460B  | 1   | 625        | 1944       | 92   |  |  |
| EU2      | CYC7C361    | 1   | 750        | 280        | 170  |  |  |
| EU3      | 74ACT32     | 1   | 9          | 154        | 3    |  |  |
| EU4      | 0T050       | 1   | 100        | 270        | 27   |  |  |
| EU5-EU8  | 74ACT245    | 4   | 12         | 220        | 3    |  |  |
| EU9      | 74ACT244    | 1   | 12         | 220        | 3    |  |  |

1) Not Available in mil spec

2) Average according to ABA

## 15 SPARC Read Cycle

Delays:

- A2-A17 to CS\* PLD decoder 20 ns
- memory data setup time 35 ns
- $\bullet\,$ edac delay 36 ns
- data bus buffer 11 ns

Required: From stable address to data latched:

$$20 + 35 + 36 + 11 = 102ns$$

Available (3 processor cycles):

$$120 + 7 - 3 = 124ns$$

Therefore, a bus read cycle will require 3 processor cycles.

### 16 SPARC Estimation of Performance

The following parameters were chosen to describe the SPARC configuration:

$$X_1 = 1$$
$$X_2 = 1$$
$$X_3 = 3$$
$$X_4 = 4$$

A SPARC instruction is encoded in 32 bits. For the uniform instruction flow, W = 2, and: Y(W) = 3

thus:

$$Z_1 = Y(W) = 3$$
$$Z_2 = Y(W) = 3$$
$$Z_3 = 5$$
$$Z_4 = X_4 = 4$$

leading to

 $ERE = 3.35 \ cycles$ 

and:

$$ER = \frac{1}{3.35 \ 40} \frac{1}{ns} = 7.5 \ MmixedIPS$$

For the bus activity:

$$ABA = 0.82$$

The total memory power requirement: 576 mW/device

### 17 The maximum configurations

The maximum configuration is intended to estimate peak performance for computer systems with 1 MByte of memory. It consists of:

- cpu
- 1 MByte of static random access memory

### 18 General Notes on the maximum configurations

The maximum configuration is accomplished by eliminating the EDAC circuitry and changing the memory devices from the minimum configuration. Glue logic, except from address decoding and bus buffers was implemented using macro cells.

The memory is built from eight 64k\*16 bit, 25 ns static rams. Address decoding is performed by special dedicated high speed PAL devices, eliminating any address bus skew which otherwise may arise in high clock frequency systems.

Failure Rate Estimations assumes commercial quality components and a "Ground, benign" environment.

### **19 T800** maximum configuration components

| U1       | T800-G30S | 1 | 1200 | 1451 | 13907 |
|----------|-----------|---|------|------|-------|
| U2       | CY7C343   | 1 | 775  | 311  | 4527  |
| U3-U7    | 74ACT245  | 5 | 71   | 220  | 490   |
| U8-U11   | 74ACT244  | 4 | 71   | 220  | 490   |
| MU1-MU8  | CYM1624   | 8 | 2750 | 442  | 11242 |
| MU9-MU10 | CY7C338   | 2 | 750  | 226  | 3398  |

Power requirement calculations performed assuming a 30 MHz clock.

### 20 T800 maximum configuration execution rate

From the T800 read cycle diagram, and with the chosen configuration, we conclude that an external memory read cycle may be performed without wait state penality. This also implies that there is nothing to gain from a cache memory. It should, however, be emphasised that the T800 internal memory (4 kByte) is not considered.

Hence W = 2 and:

$$Z_1 = 2$$
  
 $Z_2 = 3.8$   
 $Z_3 = 4$   
 $Z_4 = 8$ 

leading to:

$$ER = \frac{1}{3.55 \ 33} \ \frac{1}{ns} = 8.5 \ MmixedIPS$$

## 21 THOR maximum configuration components

| U1        | THOR     | 1 | 1500 | 2450 | 78    |
|-----------|----------|---|------|------|-------|
| U2        | CY7C343  | 1 | 775  | 311  | 4527  |
| MU1-MU8   | CYM1624  | 8 | 2750 | 442  | 11242 |
| MU9-MU10  | CY7C338  | 2 | 750  | 226  | 3398  |
| MU11-MU14 | 74ACT245 | 4 | 35   | 220  | 490   |
| MU15-MU17 | 74ACT244 | 3 | 35   | 220  | 490   |

Power requirement calculations performed assuming a 15  $\rm MHz$  clock.

### 22 THOR maximum configuration execution rate

In the proposed configuration, THOR (15 MHz) does not require any wait states, thus the calculations from previous sections may be reused and we conclude:

$$ER = \frac{1}{1.5 \ 67} \ \frac{1}{ns} = 10 \ MmixedIPS$$

### 23 SPARC maximum configuration components

| U1        | CY7C601  | 1 | 3250 | 1998 | 14063 |
|-----------|----------|---|------|------|-------|
| U2        | CY7C602  | 1 | 2250 | 1600 | 13979 |
| U3-U4     | CY7C157  | 2 | 1250 | 397  | 11303 |
| U5        | CY7C604  | 1 | 3250 | 2554 | 14116 |
| U6        | CY7C343  | 1 | 775  | 311  | 4527  |
| MU1-MU8   | CYM1624  | 8 | 2750 | 442  | 11242 |
| MU9-MU10  | CY7C338  | 2 | 750  | 226  | 3398  |
| MU11-MU14 | 74ACT245 | 4 | 95   | 220  | 490   |
| MU15-MU17 | 74ACT244 | 3 | 95   | 220  | 490   |

Power requirement calculations performed assuming a 40  $\rm MHz$  clock.

### 24 SPARC maximum configuration execution rate

The SPARC configuration utilises a 64 kByte cache memory. Experience has shown, that for a cache of this size, a hit rate of 90 % is probable.

Denoting a 32-bit word fetched from the cache  $Z_x(C)$  we write:

$$ERE = (Z_1x_1 + Z_2x_2 + Z_3x_3 + Z_4x_4) \ 0.10 + (Z_1(C)x_1 + Z_2(C)x_2 + Z_3(C)x_3 + Z_4(C)x_4) \ 0.9$$

Timing analysis (carried out as in paragraph 15) shows that a cache miss will cost one wait state. An access whithin cache may be done without wait state. Hence:

$$Z_{1} = 2$$
$$Z_{2} = 2$$
$$Z_{3} = 4$$
$$Z_{4} = 4$$
$$Z_{1}(C) = 1$$
$$Z_{2}(C) = 1$$
$$Z_{3}(C) = 3$$

and:

The maximum configuration runs at 40 MHz and from this:

$$ER = \frac{1}{1.735\ 25}\ \frac{1}{ns} = 23\ MmixedIPS$$

 $Z_4(C) = 4$ 

#### 25 Conclusions

The maximum configurations results in fewer components, more power requirement and a considerable larger expected failure rate than from the small configuration.

The T800, just as the THOR will gain just slightly in performance from the maximum configuration. It just do not seems suitable to use these processors in this configuration.

The SPARC performance, however, increases a lot compared to it's minimum configuration. This is not suprising. The SPARC design is intended for systems which will take advantage from high speed cache memories. As a consequense it will suffer more from slow memories and error detection circuitry.