DSP Mappings (Task 1.3a)
for the
Tool for Automating Estimation of DSP
Resource Statistics for Waveform Components
Submitted
under Subcontract FP-19738-430292
An
Integrated Tool for SCA Waveform Development, Testing, and Debugging and a Tool
for Automated Estimation of DSP Resource Statistics for Waveform Components
Version
1.1
Revision
History
|
Version |
Summary of Changes |
Date |
|
0.1 (JN) |
Internal Release |
|
|
1.0 (JN) |
Release as initial documentation for DSP files |
|
|
1.1 (JN) |
Validation errors fixed: C55 (2 ops/cycle) |
|
TABLE OF CONTENTS
1 Introduction and Methodologies
2.1 Information from General
Conventions
2.2 Information from DataSheet
3.1 Information from General
Conventions
3.2 Information from DataSheet [1]
4.1 Information from General
Conventions
4.2 Information from DataSheet [1]
5.1 Information from General
Conventions
5.2 Information from DataSheet [1]
6.1 Information from General
Conventions
6.2 Information from DataSheet [1]
7.1 Information from General
Conventions
7.2 Information from DataSheet [1]
8.1 Information from General
Conventions
8.2 Information from DataSheet [1]
9.1 Information from General
Conventions
9.2 Information from DataSheet [1]
10.1 Information from General Conventions
10.2 Information from DataSheet [1]
11.1 Information from General Conventions
11.2 Information from DataSheet [1]
13.1 Information from General Conventions
13.2 Information from DataSheet [1-5]
14.1 Information from General Conventions
14.2 Information from DataSheet [1],[2]
15.1 Information from General Conventions
15.2 Information from DataSheet [1]
16.1 Information from General Conventions
16.2 Information from DataSheet [1]
17.1 Information from General Conventions
17.2 Information from DataSheet [1]
18.1 Information from General Conventions
18.2 Information from DataSheet [1]
19.1 Information from General Conventions
19.2 Information from DataSheet [1]
20.1 Information from General Conventions
20.2 Information from DataSheet [1]
21.1 Information from General Conventions
This document is intended to document the steps used to generate the DSP files for the “Tool for Automating Estimation of DSP Resource Statistics for Waveform Components” and to provide enough detail for others to be able to create similar files for their processors.
This
section gives an overview of the methodology for writing a DSP file and details
on key processes associated with this methodology. This is followed by 20
example applications of this process to a variety of different processors.
A DSP file is generated by first collecting the following documentation on the processors:
These documents are then used to determine the following parameters:
With this information, a DSP file for the processor can be written according to the format specified in ]. These steps are then documented for each processor and then validated using mapped components and vendor measured statistics.
This project is mapping the 20 DSPs shown in Table 1 which reflects the addition of the PowerPC MPC8540
and removal of the Intel Xscale processor per the discussions on February 5th.
All DSPs have been mapped and documented. Validation will commence after
component mappings are complete.
To gather the data for he following are links to the data sources referenced in the DSP mappings. The data sheets are typically sufficient to identify the processor’s precision.
Table 1: Hyperlinks to data Files used in DSP file generation process
|
DSP |
Datasheet |
Instruction Set |
Library |
Power |
|
SPRU972 ( SPRU973 (TCP2) |
||||
|
SPRA750d ( SPRA749b (TCP) |
||||
|
|
86 W ( |
|||
|
Benchmarks
(56300 Family Reference, App B) |
|
|||
|
|
||||
|
MSC8101RM
(reg) |
??? |
|||
|
??? |
||||
|
??? |
||||
|
??? |
||||
|
??? |
||||
|
??? |
|
Manufacturer |
Website |
|
|
|
|
Analog Devices |
|
|
Intel |
|
|
Freescale |
|
|
ARM |
Characteristics are intended to model features of a DSP that when present will so dramatically alter the code that the typical modifier equations is not appropriate. In mappings, if the component lists a requirement which the DSP does not have as a characteristic, then the mapping will fail. For example if a component assumes that the support for floating point operations, the engine will not attempt to map the component onto a DSP that can only implement fixed point operations. Likewise for a component which assumes fixed point operation, the engine will not attempt to map the component onto a DSP that supports only floating point operations (as it may not have sufficient precision). If the distinction does not matter to the implementation, then the component file should not list the DSP characteristic as a requirement.
Table 2 provides a listing of DSP characteristics which were identified and used in Release 1.1. In general, these should be identified from supported addressing modes and data formats. The exceptions to this rule are the following:
Table 2: Characteristics identified and used in Release 1.1
|
Characteristic |
Description |
|
Fixed |
Supports fixed point processing |
|
Round |
Processor includes instructions for single-cycle rounding of results in various Q notations. |
|
Float |
Supports floating point processing |
|
Double |
Supports double floating point processing |
|
Circular |
Supports circular addressing modes |
|
Reverse |
Supports bit-reverse addressing |
|
RISC |
DSP is a RISC processor. Note that the cycle estimation process is not as robust for RISC processors because of the variable time to implement instructions and the high probability of stalls due to inter-instruction contention for resources and data dependencies. |
|
Saturate |
Supports saturation arithmetic (eliminates overflow condition) |
|
Scaling |
Implements rounding with scaling (shift left, shift right) |
Modifiers are intended to model specialized hardware within
a DSP that permits multiple operations to be executed in a single cycle, e.g.,
a
Identified instruction modifiers for Release 1.1 are summarized in Table 3. Note that some modifiers are catch-alls for a collection of instruction modifiers that would be too tedious to model individually. For example, EXTRACT models various instructions that support various combinations of typical bit manipulation operations (e..g, shifting, bit extraction and insertion, and masking).
Table 3: Identified instruction modifiers in Release 1.1
|
Instruction Modifier |
Operation |
Modeled Effect |
|
ABSALU |
A typical |
Cycles associated with
subsequent abs eliminated |
|
ADDSUB |
The |
Consecutive adds/subtracts
eliminated |
|
AVG |
The processor implements
(A+B)/2 |
Eliminates a left shift
when following an add |
|
BDEC |
The processor decrements a
counter and branches |
Counter decrement cycles
eliminated in loops |
|
BPOS |
The processor branches on
general conditions, e.g., R>0 |
Cycle eliminated for
generating branch condition |
|
BITR |
The processor reverses the
bits of a register in a single cycle |
Cycles for process to
bit-reverse an indexed array are eliminated. 1 cycle per array element added
back in. |
|
COND_EXEC |
All instructions are
executed conditionally |
Cycles consumed for short
control branches are eliminated |
|
COND_MOV |
All memory/move operations
can be executed conditionally |
Cycles consumed for short
control branches related to memory are eliminated |
|
CPX_MPY |
A single cycle complex
multiplication. Note that several DSPs have an instruction which implements a
complex multiplication (or |
Complex multiplication
cycles (6) reduced to 1 per complex multiplication. |
|
EXTRACT |
The processor is capable of
detailed bit manipulation in a single cycle. This takes various forms in
different instructions, but a minimum requirement is the ability to extract
out a specified set of bits from a word and then pack them into bytes |
Bit manipulation cycles cut
in half. |
|
GMPY |
The processor supports
Galois Field arithmetic (useful in some error correction codes). |
Cycles required to mimic
Galois Field arithmetic are eliminated with one cycle per Galois Field
arithmetic operation added in. |
|
INDEX |
The processor supports
indexed addressing, e.g., *R1++R2 where value in R1 is loaded and then R1 is
offset by R2. |
Cycles used to separately
add together register with an address register eliminated. |
|
MAC |
A single-instruction
(functional unit) multiplication and accumulation. Note that when both a
multiplier and an |
Accumulate cycles
eliminated |
|
MAX |
The processor does the
following (and the reverse for a MIN): if A>B, A-> dst |
Cycles required to perform
move following comparison operation eliminated |
|
MAX2 |
The process performs the
MAX operation for two pairs of words of native precision. |
Cycles required to perform
both moves and one comparison eliminated |
|
MEM2 |
The data bus width of a
processor is such that a single instruction fetches 2 words. |
Memory cycles cut in half
(rounded up). |
|
MEM4 |
The data bus width of a
processor is such that a single instruction fetches 4 words. |
Memory cycles cut in fourth
(rounded up). |
|
NOREG |
The processor memory maps
all registers so there’s no need for an instruction to load registers from
memory. Processors which do this, however, tend to clock much slower |
All cycles used to move |
|
SAD |
Sum of absolute
differences. |
Absolute values removed. A
special case of ABSALU. |
|
|
Adds up bytes in a word. |
Eliminate cycles of sum of
byte-packed words |
|
VECT |
A single cycle complex
multiplication. Note that several DSPs have an instruction which implements a
complex multiplication (or |
Complex multiplication
cycles (6) reduced to 1 per complex multiplication. |
|
VSL |
A process by which a
register is shifted left and an input 1 or 0 is appended to the right most
bit. Useful for keeping track of paths (saves an instruction). |
Cycle saved per path update
in |
|
ZOL |
The processor supports a
form of zero-overhead (hardware) looping wherein loop instructions are placed
in a hardware buffer and repeated a specified number of times. |
All loop control cycles
eliminated (branch, compare, decrement). One cycle added to set loop counter. |
Very-Long-Instruction-Word (VLIW) is how super-scalar DSP
architectures are implemented. In effect in a single VLIW fetch, multiple
instructions are returned and dispatched. This is accomplished by having
multiple functional units on a processor which execute the instructions.
For example a DSP that has both a multiplier and an ALU could support
simultaneous execution of instructions for both the ALU and the multiplier. In
general the number of instructions that can be simultaneously executed will be
limited by the number of functional units or the program bus width. For
example, a DSP might have two multipliers and two
So to model a DSP that supports VLIW, the following steps should be taken. First, it should be determined if the DSP has multiple functional units. These functional units should be classified as memory units, ALU units, and multiplier units. For our purposes, a shifter is considered an ALU unit (it’s not, but situations where this assumption doesn’t work should be rare). Note that DSP architectures vary significantly, so some discretion will be needed when making these classifications.[1]
Then the maximum number of instructions that can be executed
should be determined as well as the number of instructions of the each function
unit classification (Multiplier,
Note that some DSPs have wider than normal data-memory paths
that allow a greater bandwidth when moving data between processing registers
and memory. However, being able to fetch 2 or 4 or 8 words with a single
instruction should not be modeled as VLIW as distinct instructions are not
being used in this process. The key phenomenon being modeled is the capacity to
simultaneously execute multiple distinct instructions.
When a DSP supports Single-Instruction-Multiple-Data instructions, the same instruction can be applied to multiple data elements in a single cycle. Generally, this is realized by having functional units that can operate with word-widths smaller than the native word width. For example a DSP in the TI 6455 family natively works with 32-bit words, but its ALUs can also simultaneously add 4-pairs of 8-bit words.[2] However, there are processors that include secondary functional units that execute the same instructions as the primary functional units when an SIMD bit is set.
Because we are not creating a detailed computational model for each DSP (which would be almost equivalent to creating a compiler for each DSP), SIMD has to be applied somewhat coarsely. Thus we only treat a DSP as a SIMD processor if it supports SIMD operations in its multipliers, ALUs, and memory units (typically only the multipliers or ALUs will be limiting). Note that some DSPs have wider than normal data-memory paths that allow a greater bandwidth when moving data between processing registers and memory. In this situation, rather than (or in addition to) modeling the DSP as SIMD, the DSP is modeled as having instruction modifiers MEM2, MEM4, or MEMx to model the double, quadruple, or x-times memory fetched with a single instruction.
In this implementation, the engine estimates energy consumption by multiplying the peak power draw (peak core power + typical I/O values) by the estimated processing time. This is inline with an assumption that components are implemented for rapid execution which entails maximum utilization of chip resources. I/O power should be left at typical values when the data is available.
The kinds and quality of documentation available for estimating peak power draw vary significantly from processor to processor so various methods are needed to estimate power consumption. The following are some of the more typical methods.
Change the CPU utilization rate to 100% while leaving the other values at the levels initially set by the manufacturer (for peak core power + typical I/O). While not seen in this survey, there may be processor spreadsheets that ask for estimations of bit toggling rates (a common practice in FPGA data sheets). Bit toggling rates should generally not be set above 50% (meaning that half the time bits flip).
Some vendors will give measured power consumptions values with different levels of CPU activity/utilizations (e.g., 1 W at 25% and 2.5 W at 75%). A reasonable first-order approximation to estimating peak power consumption in this situation is to perform a linear extrapolation to 100% utilization. To make this extrapolation, you should be cognizant that power consumption is the sum of dynamic power consumption (power consumed while transistors are actively switching) and static power consumption (power consumed due to leakage). Sometimes a static power consumption value is given and sometimes not. If it is not given, then you need to solve the implied linear system of equations using at least to different activity levels.
For example the following set of equations would yield peak_dyn = 3 W, static = 0.25 W for a maximum power level of 3.25 W.
1 = peak_dyn * 0.25 + static
2.5 = peak_dyn *0.75 + static
Sometimes the only documentation available gives peak current draws and supply voltages for different parts of the chip. In such a situation peak power consumption is simply the sum of the products of peak current draws and supply voltages. In some situations, it may be necessary to extrapolate current draws for chip components as we did with extrapolation of peak power consumption.
Sometimes power estimates are known for one DSP in a family, but not another. When the same fabrication technology is used for both DSPs (e.g., both are 0.13 mm processes). In this case, you should leverage the fact that P µ Cv2f where C is capacitance, v is voltage, f is frequency, and P is power consumption. By assuming that C and the proportionality factor remains the same for the same DSP family with the same fabrication technology, the unknown power estimation can be calculated from the known DSP as
Punknown = Pknown x (vunknown / vknown)2 x (funknown / fknown)
When fabrication technologies change (but the chip architecture remains the same), the methods used in [2] should be applied.
Note that this approach should not be used if the chip
architectures are different.
[1]
[2] J. Neel, S. Srikanteswara, J. Reed, P. Athanas, “A
Comparative Study of the Suitability of a Custom Computing Machine and a VLIW
DSP for Use in 3G Applications,” IEEE
Workshop on Signal Processing Systems SiPS2004, Oct 13-15, 2004, pp.
188-193.
The following provides details on how parameters for the data file for the TMS3206205-200 DSP were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS3206205-200
Native Precision: 16 (multiplication limiting)
Instruction Bits: 32
Clock Rate (MHz): 200
[2] gives two estimates of power consumption, one at 75% “high” / 25% “low”, and one at 50% “high” / 50% “low”. Forming a linear extrapolation as suggested in the document, from 50% high at 1.15 W to 75% high at 1.32 W yields an estimate of 2 x (1.32-1.15) + 1.15 = 1.49W.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Saturate
Circular
Indexed
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction (excepting saturation and sign manipulation).
BPOS
Positive register values can be used for registers (not an actual BPOS instruction, but a conditional branch can be made equivalently)
Cond_Exec
All instructions on the TMS3206205 can be executed conditionally.
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
VLIW is at the heart of the TMS3206205-200 architecture. According to [1], it supports up to 8 simultaneous instructions executed over 6 ALUs and two multipliers. Two of these ALUs, however, are primarily used to support memory loading/storing operations and as such are scored as memory units instead. The assigned parameters are summarized as follows.
VLIW_flag: 1
VLIW_max: 8
VLIW Memory: 2
VLIW
VLIW Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
According to [3], the TMS3206205-200 has two instructions which effect SIMD-like operation (ADD2, SUB2). However, broader support is not available (e.g., a MPY2) so SIMD is flagged as 0 for this processor.
Autocorrelation (factor of 8)
Complex Bit-Reverse
Complex Forward FFT (radix-2)
Complex Forward FFT (radix-4)
Mixed radix FFT (scaled, rounded, 16 bit)
Complex FIR
Filter
FIR Filter
FIR Filter (radix
4)
FIR Filter (radix
8)
FIR Filter
(symmetric)
IIR Filter (5 Coefficients)
IIR Filter (all pole lattice)
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
[1] TMS320C6205 Fixed Point Digital Signal Processor, SPRS106G,
July 2006
[2] Kyle Castille, TMS320C62x/C67x
Power Consumption Summary, SPRA486c, July 2002. Available online: http://focus.ti.com/dsp/docs/dspsupporttechdocsc.tsp?sectionId=3&tabId=409&familyId=326&abstractName=spra486c
[3] TMS320C62x DSP CPU and
Instruction Set Reference Guide, SPRU731, July 2006.
[4] TMS320C62x DSP Library Programmer’s Reference, SPRU402B, October 2003.
The following provides details on how parameters for the data file for the TMS320C6713B-300 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS320C6713B-300
Native Precision: 32
Instruction Bits: 32
Clock Rate (MHz): 300
[2] provides a spreadsheet for calculating power consumption. At 100% CPU activity and the remaining parameters left at their defaults, the spreadsheet gives a total power consumption of 1.76 W.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Float
Double
Saturate
Circular
Indexed
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction (excepting saturation and sign manipulation).
BPOS
Positive register values can be used for registers (not an actual BPOS instruction, but a conditional branch can be made equivalently)
Cond_Exec
All instructions on the TMS320C6713B-300 can be executed conditionally.
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
VLIW is at the heart of the TMS320C6713B-300 architecture. According to [1], it supports up to 8 simultaneous instructions executed over 6 ALUs and two multipliers. Two of these ALUs, however, are primarily used to support memory loading/storing operations and as such are scored as memory units instead. The assigned parameters are summarized as follows.
VLIW_flag: 1
VLIW_max: 8
VLIW Memory: 2
VLIW
VLIW Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
According to [3], the TMS320C6713B-300 has two instructions which effect SIMD-like operation (ADD2, SUB2). However, broader support is not available (e.g., a MPY2) so SIMD is flagged as 0 for this processor.
(16-bit fixed, single and double precision for all of the
following)
Autocorrelation (factor of 8)
Complex Bit-Reverse
Complex Forward FFT (radix-2)
Complex Forward FFT (radix-4)
Mixed radix FFT (scaled, rounded, 16 bit)
Complex FIR
Filter
FIR Filter
FIR Filter (radix
4)
FIR Filter (radix
8)
FIR Filter
(symmetric)
IIR Filter (5 Coefficients)
IIR Filter (all pole lattice)
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
[5] TMS320C6713B Floating Point Digital
Signal Processor, SPRS294B, June 2006
[6]
Ivan Garcia, TMS320C6711D,
C6712D, C6713B Power Consumption
Summary, SPRA889c, August 2005.
[7]
TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide, SPRU733a,
November 2006
[8] TMS320C67x DSP Library Programmer’s Reference, SPRU657B, March 2006.
The following provides details on how parameters for the data file for the TMS320C6455-850 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS320C6455-850
Native Precision: 32
Instruction Bits: 32
Clock Rate (MHz): 850
The TMS320C6455-850 is a 64x+ architecture (some instructional differences between 64xx and 64x+ architectures)
[2] provides a spreadsheet for calculating power consumption. At 100% CPU activity and the remaining parameters left at their defaults, the spreadsheet gives a total power consumption of 2.20 W.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Saturate
Circular
Indexed
Reverse
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ADDSUB
Using its
AVG
Using the multiplier units, the TMS320C6455-850 supports an averaging operation which is equivalent to an addition and a right shift.
BDEC
Using its
BITR
Reverses the bits of a register in a single instruction.
BPOS
Using its
Cond_Exec
All instructions on the TMS320C6455 can be executed
conditionally.
EXTRACT
Using its multiplier units, the TMS320C6455-850 supports several operations which collectively simplify the process of extracting/packing bit fields into registers. In general, this is treated as cutting bit manipulation cycles in half, but the exact savings by supporting detailed bitfield manipulation operations will vary from component to component.
GMPY
The 6455 supports the calculation of Galois Field multiplications dual 16x16 (and 8x8) MACs on their .M (multiplier) units.
MAX
Using its
ZOL
According to [3], the TMS320C6455-850 implements what they call SPLOOP instruction (Software Pipelined Loop Buffer) wherein looped code is stored in a buffer and repeated. This eliminates the need for using a conditional instruction to control looping with each iteration and reduces number of program memory fetches (reduces power). For the purposes of this project, the TMS320C6455-850’s SPLOOP capability will be treated like a zero-overhead-loop instruction.
VLIW Information
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
VLIW is at the heart of the TMS320C6455-850 architecture. According to [1], it supports up to 8 simultaneous instructions executed over 6 ALUs and two multipliers. Two of these ALUs, however, are primarily used to support memory loading/storing operations and as such are scored as memory units instead. The assigned parameters are summarized as follows.
VLIW_flag: 1
VLIW_max: 8
VLIW Memory: 2
VLIW
VLIW Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
According to [3] the TMS320C6455-850 supports a wide variety of SIMD instructions with a minimum word size of 8.
(16-bit fixed, single and double precision for all of the
following)
Autocorrelation (factor of 8)
Various FIR/IIR Filters
Various FFT Implementations
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
Additionally, the TMS320C6455-850 has Viterbi and Turbo-Code
co-processors which are treated as benchmarked routines for the purposes of
this project.
[9] TMS320C6455 Fixed Point Digital
Signal Processor, SPRS276H, October 2007
[10]
Gustavo Martinzez, TMS320C6455, C6455 Power
Consumption Summary
Summary, SPRAAE8B, October 2007.
[11]
TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide, SPRU732e,
November 2007
[12]
TMS320C64x+ DSP Little-Endian DSP Library Programmer’s Reference, SPRUE8A, September
2007.
The following provides details on how parameters for the data file for the TMS320C6416T-850 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS320C6416T-850
Native Precision: 32
Instruction Bits: 32
Clock Rate (MHz): 850
The TMS320C6416T-850 is a 64xx architecture (some instructional differences between 64xx and 64x+ architectures)
[2] provides a spreadsheet for calculating power consumption. At 100% CPU activity and the remaining parameters left at their defaults, the spreadsheet gives a total power consumption of 1.74 W.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Indexed
Saturate
Circular
Reverse
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
(NO) ADDSUB
Unlike the C6455, the TMS320C6416T-850 does not implement ADDSUB.
AVG
Using the multiplier units, the TMS320C6416T-850 supports an averaging operation which is equivalent to an addition and a right shift.
BDEC
Using its
BITR
Reverses the bits of a register in a single instruction.
BPOS
Using its
Cond_Exec
All instructions on the TMS320C6416T can be executed conditionally.
EXTRACT
Using its multiplier units, the TMS320C6416T-850 supports several operations which collectively simplify the process of extracting/packing bit fields into registers. In general, this is treated as cutting bit manipulation cycles in half, but the exact savings by supporting detailed bitfield manipulation operations will vary from component to component.
(NO)
Unlike the C6455, the TMS320C6416T-850 does not implement Galois Field Multiplications.
MAX/MIN
Using its
(NO) ZOL
Unlike the C6455, the TMS320C6416T-850 does not implement SPLOOP.
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
VLIW is at the heart of the TMS320C6416T-850 architecture. According to [1], it supports up to 8 simultaneous instructions executed over 6 ALUs and two multipliers. Two of these ALUs, however, are primarily used to support memory loading/storing operations and as such are scored as memory units instead. The assigned parameters are summarized as follows.
VLIW_flag: 1
VLIW_max: 8
VLIW Memory: 2
VLIW
VLIW Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
According to [3] the TMS320C6416T-850 supports a wide variety of SIMD instructions with a minimum word size of 8.
Autocorrelation (factor of 8)
Various FIR/IIR Filters
Various FFT Implementations
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
Additionally, the TMS320C6416T-850 has Viterbi and
Turbo-Code co-processors which are treated as benchmarked routines for the
purposes of this project.
[13] TMS320C6416T Fixed Point Digital
Signal Processor, SPRS226J, July 2006
[14]
Todd Hiers and Matthew Webster, TMS320C6414T/15T/16T
Power Consumption Summary, SPRAA45, August 2004.
[15]
TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide, SPRU732e,
November 2007
[16] TMS320C64x DSP Little-Endian DSP Library Programmer’s Reference, SPRU565b, October 2003.
The following provides details on how parameters for the data file for the TMS320VC5502-300 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS320VC5502-300
Native Precision: 16
Clock Rate (MHz): 300
The C55 provides two 17x17 bit MAC units, a 40-bit
Because of its data memory management unit and architectural implementation, all instructions can access data memory and manipulate data address registers and local registers are generally not used for computation (the exception being two accumulators and a temporary register). Because of this, cycles are not consumed in the reading and writing and are instead embedded in most instructions (though data addresses can be separately modified as needed). Note that in practice, this has the effect of slowing down the overall clock rate as more functions need to be accomplished in a single cycle. This behavior will be modeled as a synergistic cycle modifier (with itself) to avoid miscalculations in data memory requirements that eliminates all data memory cycles (e.g., target – target).
Note that unlike the TI 6000 series processors, only branching-type instructions are executed conditionally.
[2] provides a spreadsheet for calculating power consumption. At 100% CPU activity and the remaining parameters left at their defaults, the spreadsheet gives a total power consumption of 488 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Saturate
Circular
Reverse
Indexed
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ABDST
Accumulates the absolute distance between two vectors, X, Y. Specifically, subtracts Y from X, evaluates the absolute value and accumulates the result with a stored result. However, this “instruction” uses two units and as such is already modeled in the VLIW estimate.
ADDSUB
Using its 40-bit
BPOS
Using its
EXTRACT
The TMS320VC5502 supports several operations (e.g.,
MAC
A single-cycle multiply and accumulate
NOREG
Because it memory maps all registers, there’s no need to use
instructions to move data from memory to the inputs of functional units. Note
that moving operands between internal registers (e.g., loading a counter or
shifting data between accumulators) still requires the use of a move operator.
ZOL
According to [3], the TMS320VC5502 implements a block-repeat operation.
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
VLIW is not exactly natural to the TMS320VC5502-300 architecture, but it does support some computational parallelism by virtue of having three data paths, two MAC units and an ALU. However, only two instructions can be executed at a time. This will be modeled as follows.
VLIW_flag: 1
VLIW_max: 2
VLIW Memory: 2 (as it relates to move instructions)
VLIW
VLIW Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
According to [3] the TMS320VC5502-300 does not support SIMD.
(16-bit fixed, single and double precision for all of the
following)
Autocorrelation (factor of 8)
Various FIR/IIR Filters
Various FFT Implementations
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
Additionally, the TMS320VC5502-300 has Viterbi and
Turbo-Code co-processors which are treated as benchmarked routines for the
purposes of this project.
[17] TMS320VC5502 Fixed Point Digital
Signal Processor Data Manual, SPRS166J, August 2006.
[18] Gustavo Martinzez, TMS320VC5501/02
Power Consumption Summary,
[3] TMS320C55x
DSP Mnemonic Instruction Set Reference Guide, SPRU 374g, October 2002.
[5]
TMS320C55x Technical Overview SPRU 393, February, 2000.
The following provides details on how parameters for the data file for the TMS320VC5416-160 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: TMS320VC5416-160
Native Precision: 16
Clock Rate (MHz): 160
The C54 provides a 17x17 bit multiplier, a 40-bit
Because of its data memory management unit and architectural implementation, all instructions can access data memory and manipulate data address registers and local registers are generally not used for computation (the exception being two accumulators and a temporary register). Because of this, cycles are not consumed in the reading and writing and are instead embedded in most instructions (though data addresses can be separately modified as needed). Note that in practice, this has the effect of slowing down the overall clock rate as more functions need to be accomplished in a single cycle. This behavior will be modeled as a synergistic cycle modifier (with itself) to avoid miscalculations in data memory requirements that eliminates all data memory cycles (e.g., target – target).
[3] estimates typical power consumption for the TMS320VC5416 at 160 mW. (No spreadsheets were available from TI for this processor.) Assuming the same static/dynamic power ratio seen in other processors, this implies a peak power rating of 270 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Saturation arithmetic
Circular addressing
Bit-reversed addressing (for FFTs)
Indexed addressing (e.g., *R1+R2 where value in R1 is loaded and then R1 is offset by R2)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ABDST
Accumulates the absolute distance between two vectors, X, Y. Specifically, subtracts Y from X, evaluates the absolute value and accumulates the result with a stored result. However, this “instruction” uses two units and as such is already modeled in the VLIW estimate.
ADDSUB
Using its 40-bit
BPOS
Using its
MAX/MIN
Using its
MAC
A single-cycle multiply and accumulate
NOREG
Because it memory maps all registers, there’s no need to use
instructions to move data from memory to the inputs of functional units. Note
that moving operands between internal registers (e.g., loading a counter or
shifting data between accumulators) still requires the use of a move operator.
ZOL
According to [3], the TMS320C5416-160 implements a block-repeat operation.
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
For the purposes of this project, VLIW will not be considered part of the TMS320VC5416-160 architecture although a handful of instructions can make use of the multiple computational units.
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
For the purposes of this project, the TMS320VC5416-160 architecture will not be considered to implement SIMD. There is some SIMD like capabilities supported by the accumulators, but this is intended to be reflected in the ADDSUB modifier.
(16-bit fixed, single and double precision for all of the
following)
Autocorrelation (factor of 8)
Various FIR/IIR Filters
Various FFT Implementations
Vector Dot Product
Vector Dot Product (and Square)
Maximum Value of a Vector
Index of Maximum Element of Vector
Minimum Value of Vector
32-bit Vector Multiply
32-bit Vector Negate
16-bit Reciprocal
Sum of Squares
Weighted Vector Sum
Matrix Multiplication
Matrix Transpose
Max Exponent of a Vector
Move Block of Memory
Q15 to Float Conversion
Float to Q15 Conversion
Additionally, the TMS320VC5416-160 has Viterbi and
Turbo-Code co-processors which are treated as benchmarked routines for the
purposes of this project.
[19]
TMS320VC5416 Fixed Point Digital Signal Processor Data Manual,
[2] TMS320C54x DSP Reference Set Volume 1: CPU and
Peripherals, SPRU131G, March 2001.
[3]
R. Chembil, J. Kim, J. Lee, D. Ha, C. Patterson, J. Reed, “Reconfigurable Modem
Architecture for CDMA Based 3G Handsets,” SDR Forum 2005.
[4]
TMS320C54x DSP Reference Set Volume 2: Mnemonic Instruction Set, SPRU172C,
March 2001.
[5] TMS320C54x DSP Library Programmer’s Reference, SPRU518d October 2004.
The following provides details on how parameters for the
data file for the
DSP ID:
Native Precision: 16
Clock Rate (MHz): 400 (only clock rate supported by BF532)
The
BF532 provides two 16-bit multipliers, two 40-bit accumulators, two 40-bit
[3] gives a typical dynamic internal power drain of 159 mA for a voltage of 1.2 V and frequency of 400 MHz (the configuration assumed in this document). That corresponds to a dynamic internal power consumption of 191 mW. Static current draw is estimated at 130 mA for a static power consumption of 156 mW for a total estimate of 347 mW. Typical external power consumption is then estimated at 171 mW.
Peak power is then 347 mW (internal) + 171 mW (external) = 518 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Rounding
Saturation arithmetic
Circular addressing
Bit reversed addressing (for FFTs)
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ADDSUB
Using its
BPOS
The
EXTRACT
The
MAX/MIN
Using its
MAC
A single-cycle multiply and accumulate
ZOL
According to [3], the
Very Long Instruction Word (VLIW) is a characteristic of a superscalar architecture wherein multiple instructions can be simultaneously executed across several functional units.
Technically, the BF532 is not considered to be a superscalar architecture. However, it can formally execute a 32-bit instruction with two 16-bit instructions. In practice, this generally means 2 arithmetic instructions and 2 data memory management instructions. Accordingly, this will be treated as having VLIW with:
2 Multipliers
2 ALUs
2 Memory
with a maximum of 4 instructions executed per cycle. Technically,
this will create somewhat misleading results as it is not possible to execute 2
multiplier instructions with 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
For the purposes of this project, the
FFTs
Convolutional
Encoders
Interpolation/Decimation
Filters
Digital
Modulation
[20]
[2]
[3] Joe
B., “Estimating Power for
[4] http://www.analog.com/processors/blackfin/technicalLibrary/manuals/codeExamples.html
The following provides details on how parameters for the
data file for the
DSP ID:
Native Precision: 64 (technically 32, but needed to fit into current SIMD model).
Clock Rate (MHz): 333 (only clock rate supported by
The
For a clock rate of 333 MHz, [3] gives a peak dynamic internal current drain of 848 mA and a static current drain of 550 mA which corresponds to 1.68 W at a voltage of 1.2 V. [3] then gives an average external power consumption value of 99 mW. This gives a total power consumption of 2.67 W.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Float
Rounding
Saturation arithmetic
Circular addressing
Bit reversed addressing (for FFTs)
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ADDSUB
Using its
AVG
Using its
BPOS
The
Cond_Exec
Instructions are executed based on evaluation of a condition
– simplifies short control structures. Many instructions on the
EXTRACT
Using its shifters, the
MAX
Using its
MAC
A single-cycle multiply and accumulate.
ZOL
According to [3], the
Each processing element can execute a multiplication
instruction with an
VLIW_max: 4
Memory: 2
Mult: 1
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
The
To model this, we’re treating the architecture as a native 64-bit architecture with a minimum word size of 32 bits.
Block FIR
Complex FIR
Complex Vector Add
Vector Complex Dot Product
Complex Radix 2
FFT
Complex Radix-4
FFT
Single Sample FIR
Vector Maximum
[21]
[2]
[3] C.
Coughlin, “Estimating Power for the
[4] http://www.analog.com/processors/platforms/sharcSoftwareModules.html
The following provides details on how parameters for the data
file for the
DSP ID:
Native Precision: 64 (32x32 bit multiplication is largest multiplication, but can do 4 16x16 for 8 different 16 bit input words)
Clock Rate (MHz): 600
The
Up to four instructions can be executed per cycle, which can
be executed conditionally. Combined with SIMD, this means that the
Using the spreadsheet provided with [3] and keeping all
defaults except changing
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Float
Rounding
Saturation arithmetic
Circular addressing
Bit reversed addressing (for FFTs)
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ABSALU
Executes a typical
ADDSUB
Using its
AVG
Using its
BPOS
The
COND_EXEC
Instructions are executed based on evaluation of a condition
– simplifies short control structures. Many instructions on the
CPX_MPY
Implements a single cycle complex multiplication. Can save
several cycles.
EXTRACT
Using its shifters, the
MAX
Using its
MAC
A single-cycle multiply and accumulate.
Adds up bytes in a word (can save accumulation cycles).
MEM2
Because of its 128 bit bus, only half as many cycles need to be consumed in memory operations as would be expected from the “native” 64 bit size.
ZOL
According to [3], the
The
VLIW_max: 4
Memory: 2
Mult: 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
Typical word size of 64 bits
Minimum word size of 8 bits.
Floating point FIR, IIR
32-bit FFT (real, complex, float)
16-bit Fixed FFT (256,512)
[22]
[2]
[3] Greg
F., “Estimating Power for
[4] http://www.analog.com/processors/tigersharc/technicalLibrary/codeExamples/codeExamples.html
The following provides details on how parameters for the
data file for the
DSP ID:
Native Precision: 16
Clock Rate (MHz): 160
The
Analog Devices did not have a power estimate for the
0.5 dynamic + static = 426 mW
dynamic = 1.54 static
static = 241 mW
peak dynamic = 371 mW
This gives a total power consumption of 711 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Rounding
Saturation arithmetic
Circular addressing
Bit reversed addressing (for FFTs)
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The
COND_EXEC
Instructions are executed based on evaluation of a condition
– simplifies short control structures. Many instructions on the
MAC
A single-cycle multiply and accumulate.
ZOL
Though it doesn’t call it as such, the
Each processing element can execute a multiplication
instruction with an
VLIW_max: 3
Memory: 2
Mult: 1
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width.
The
FFT
Discrete Cosine Transform
GSM Codec
IIR
Linear Predictive Coding
Biquad IIR filter
FIR
Transversal
Filter
DTMF
encoder/decoder
Sine approximation
Companding
Division
Arctangent
Log
Square root
Euclidean distance
[1]
[2]
[3] http://www.bdti.com/procsum/adi219x.htm
[4] http://www.analog.com/processors/adsp/technicalLibrary/codeExamples/applicationsHandbook.html
DSP ID: P4-661
Native Precision: 128 (enabled with SSE2 which extends all MMX 64-bit ops to 128-bits)
Clock Rate (MHz): 3600
The P4-661 is a member of the Pentium IV family. This means it supports the following technologies:
Note this means it does not support Virtualization Technology (which came in with 662), SSE4, and multi-core technologies. In terms of instruction sets (see Table 5-1 in [2]), this means the P4-661 supports the following instruction sets:
The
NetBurst Microarchitecture supports branch prediction, superscalar operation
(three instructions can be retired per cycle), out-of-order execution cores,
supports fixed and floating-point processing, and binary coded decimal. It
consists of 2 fast ALUs (clock at twice main clock rate), 1 slow
SIMD operations are somewhat different than other processors’ SIMD operations in that there can be some staggered latency in computing all of the results (See http://swox.com/doc/x86-timing.pdf). However, in general, the P4-661’s SIMD pipelined throughput is approximately similar to what we get from other processors. Note that SIMD multiplication only goes down to 16 bits, so we will model the processor as having a minimum SIMD width of 16-bits, even though it can do a SIMD add down to 8-bit words (e.g., PADDB). As the SSE2 instructions extend these SIMD operations to work on the 128 bit XMM registers, significant parallelism is possible from SIMD.
Hyper-threading is used to manage multiple threads by
maintaining independent architectural states (thereby effecting multiple
logical cores) within the same processor core. Each logical processor consists
of a full set of IA-32 data registers (including interrupt and debug registers)
so multiple local copies of local processor memory are maintained, but the
actual execution resources are not increased and are instead shared between the
logical processors. Whether or not hyperthreading improves efficiency is
irrelevant to this project (there’s arguments on both sides and it likely
depends on the specific application) as for our purposes hyperthreading will
not impact execution predictions as the modules we map are assumed to keep
execution resources fully occupied. It should also be noted that the shared
caching used in hyperthreading can be used to by a malicious thread to steal
cryptographic keys from another thread.
[2] gives a peak power measurement (thermal design power) of
86 W or 86,000 mW.
Fixed
Float
Branch prediction
Rounding
Saturation
Binary coded decimal
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded
and then R1 is offset by R2 called “base-plus index addressing” by Intel)
AVG
The P4-661 supports an averaging operation in its SIMD unit.
BPOS
The P4-661 supports numerous different conditional branches.
COND_MOV
All memory operations can be executed conditionally
EXTRACT
Due to its implementation of various bit/byte manipulation instructions (e.g., BSF, BTS), the P4-661 will be considered as implementing the EXTRACT capability.
The P4-661 supports a unique loop structure in which the loop instruction compares the loop counter with zero, branches as needed and decrements the counter.
MAX
In its SIMD unit, the P4-661 supports instructions whereby
words are compared and the greater (lesser) is stored.
SAD
Sum of absolute differences. Useful in a vector distance metric.
VECT
The P5-661 has an instruction (PMADD) wherein pairs of words are multiplied and added. For four pairs of words, 2 results are made. This is slightly different from a complex multiplication and is more akin to an operation to support a dot product of two vectors.
The P4-661 can issue four instructions in a cycle and each
of its fast ALUs can complete two operations per clock cycle. However, the only
factor that matters is that only 3 instructions can complete per cycle. Note
that only one unit is available for SIMD arithmetic operations (and one for
SIMD memory operations). This will have a bigger impact on multiplications than
VLIW_flag: 1
VLIW_max: 3
VLIW Memory: 2
VLIW
VLIW Mult: 1
Single-Instruction-Multiple-Data (SIMD) is a characteristic
of some DSP architectures wherein a single instruction can be simultaneously
applied to two or more words smaller than the native data width. The P4-661
implements SIMD via its MMX and SSE extensions which allow operations down to
the byte level on 128 XMM registers for additions and down to the integer word
level (16-bit) for multiplications. Because of the importance of
multiplications to our process, we’ll model the P4-661 as supporting SIMD down
to a minimum word size of 16 with a maximum width of 128. Note that every SIMD
instruction takes multiple cycles to complete, but these are pipelined so that
in general a throughput of one SIMD instruction per cycle is achieved (see http://www.tommesani.com/P4MMX.html)
SIMD_flag: 1
SIMD_min: 16
There exists standard library code, but it’s not focused on DSP-type operations (e.g., memcpy, strlen).
[1] Intel®
Pentium®
4 Processor 6x1Δ
Sequence Datasheet, Document
Number: 310308-002, December
2006.
[2] http://processorfinder.intel.com/Details.aspx?sSpec=SL94V
[3]
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture
[4] Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 2A: Instruction
Set Reference, A-M, February 2008.
[5]
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set
Reference, N-Z, February 2008.
G.
Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. kyker, P. Roussel, The
Microarchitecture of the Pentium 4 Processor, Intel Technology Journal Q1,
2001, pp. 1-13
DSP ID: MPC8540
Native Precision: 64 (really 32, but needed to model the way registers and data accesses are implemented)
Clock Rate (MHz): 1000 (this is the highest value)
The MPC8540 combines a 32-bit PowerPC processor core (e500)
with extensive peripherals for supporting networking protocols (e.g., RapidIO,
ethernet). For our purposes, we’re concerned with the processing core – the
e500. The e500 implements an enhancement on the PowerPC Book E instruction set
architecture and can complete 2 instructions per clock cycle as implemented
over it five functional units – a branch unit, a load/store unit, a multiple
cycle unit, and two single cycle units. For our purposes, the two single-cycle
units will be modeled as
The three computational units are extended via a number of
auxiliary processing units (
The e500 is a RISC architecture which has taken significant
steps to provide capabilities to synchronize operations and limit memory race
conditions. Additionally, most instructions can be completed in a single cycle
and those that do not are pipelined so that other than some initial latency
(which we’re ignoring other processors), so it is likely that most of the
timing irregularities of RISC processors will not be experienced on the MPC8540
(cannot be certain without actual handcoding, however).
Additionally, the SPE APU defines instructions which permit 32-bit
SIMD operation on 64-bit registers for effectively all multiplication, memory,
and
Table 4 in [1] gives a peak power consumption of 15,900 mW according to the configuration assumed in this mapping (1000 MHz).
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Float
RISC
Branch prediction
Rounding
Saturation
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Reverse
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The MPC8540 supports an operation whereby control branches if a register is positive (equivalent to a branch and a comparison).
BDEC
The MPC8540 supports an operation whereby control branches
if the counter register is positive and decrements. To make behavior this
conform to other modeled processors, these are broken into BPOS and BDEC steps.
BITR
Reverses the bits of a register in a single instruction. Technically, however, the MPC8540 is implementing this a bit-reversed carry instruction.
EXTRACT
Detailed bit manipulation is possible via various rotate and mask insert instructions.
ISEL
An instruction wherein one of two different registers is moved into a register based on a condition bit (set previously – so unlike a MAX operation).
MAC
A single-cycle multiply and accumulate.
The MPC8540 can complete two instructions in a cycle and has two ALUs, a multiplier, and a data memory unit. This will be modeled as follows.
VLIW_flag: 1
VLIW_max: 2
VLIW Memory: 1
VLIW
VLIW Mult: 1
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. The MPC8540 implements SIMD via its SPE APU wherein 64 bit registers are treated as pairs of 32-bit registers and the same operation is performed. This will be modeled as follows.
SIMD_flag: 1
SIMD_min: 32
There exists standard library code, but it’s not focused on DSP-type operations and instead does GPP-type operations (e.g., memcpy, strlen).
[1] MPC8540
Integrated Processor Hardware Specifications, MPC8540EC
Rev.
4.1, July 2007.
[2] MPC8540 PowerQUICC III™ Integrated Host Processor
Reference Manual, MPC8540RM Rev.
[3] http://www.arm.com/products/CPUs/ARM968E-S.html
PowerPC™ e500 Core Family Reference Manual, E500
[4] Book E: Enhanced PowerPCÔ Architecture Version 1.0 May 7, 2002
[5] EREF: A Reference for Freescale Book E and the e500
Core, Revision 2.0, January 2004.
The following provides details on how parameters for the data file for the DSP56321 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: DSP56321
Native Precision: 24
Clock Rate (MHz): 275
The DSP56321 combines the DSP5600 core with an Enhanced
Filter Coprocessor. The DSP5600 core includes a Data ALU unit which has a MAC,
bit field unit, and an ALU, and a data address generation unit which supports
two word loads in parallel to an operation by the Data
Note that the enhanced filter coprocessor [3] runs at the same clock rate as the DSP56321 and supports various filter structures. However, it does not support instruction based processing and thus will be treated as a coprocessor which means that its cycle benefits would only appear in benchmarked routines.
[4] states that the DSP56371 (same core) typically consumes about 124 mW in terms of core and memory consumption. However the DSP56371 operates at 180 MHz and 1.25 V and the DSP56321 operates at 275 MHz at 1.6 V. The DSP56371 measurement can be applied to the DSP56321 by scaling (275/180)*(1.6/1.25)^2 or about 2.5. Thus the typical internal core / memory power can be estimated as 2.5*124 mW = 310 mW.
This, however, misses out on the power consumption of the Enhanced Filter Coprocessor. Loosely, well model this as consuming the same fraction of internal power as the ARM1136J(F)-S consumed by the floating point coprocessor which we had estimated at 17% of internal power, or 310*0.17 = 53 mW for a total of 363 mW.
Assuming this represents 50% loading and the 99mW of the
0.5 dynamic + static = 363 mW
dynamic = 1.54 static
static = 205 mW
peak dynamic = 316 mW
This gives a total power consumption of 620 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [1].
Fixed
Round
Saturation arithmetic
Circular
Reverse (called reverse-carry in FreeScale lingo)
Internal Scaling (automated shifting)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The DSP56321supports conditional execution of all instructions, including on positive values.
COND_EXEC
Instructions are executed based on evaluation of a condition – simplifies short control structures. The DSP56321 supports conditional execution of its operations; however this cannot be done with parallel memory accesses. This, however, will be ignored.
INDEX
Indexed addressing, e.g., *R1++R2 where value in R1 is
loaded and then R1 is offset by R2.
MAC
A single-cycle multiply and accumulate.
VSL (Viterbi Shift Left)
A process by which a register is shifted left and an input 1 or 0 is appended to the right most bit. Useful for keeping track of paths (saves an instruction).
ZOL
According to [3], the DSP56321 implements hardware do-loops which we’ll model as implementing zero-overhead looping.
The DSP56321 can do two memory accesses while performing an arithmetic operation. This will be modeled as follows.
VLIW_flag: 1
VLIW_max: 3
VLIW Memory: 2
VLIW
VLIW Mult: 1
This is slightly misleading as it implies that a MULT,
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. The DSP56321 does not support SIMD.
Appendix B in [2] provides numerous programs for
benchmarking the operation of a DSP in the DSP56300 family, but does include
execution statistics to go with this.
[24]
DSP56321 Technical Data, Rev. 11, 2/2005
[2] DSP56300 Family Manual: 24-Bit Digital Signal
Processors, DSP56300FM Rev. 5, April 2005
[3] T. Redheendran, Programming the DSP56300 Enhanced Filter Coprocessor (EFCOP), APR39, Rev 1, August 2005.
[4] http://www.arm.com/products/CPUs/ARM968E-S.htmlhttp://www.bdti.com/procsum/mot563xx.htm
The following provides details on how parameters for the data file for the DSP56321 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: DSP56F8367
Native Precision: 16
Clock Rate (MHz): 60
The DSP56F8367 is built around the DSP56800E core which
includes a Data ALU unit which has a MAC, bit field unit, and an
Unlike the DSP56321, it does not implement the VSL instruction [2], nor have a filter coprocessor, nor do bit-reverse, and does not support general conditional execution of operations (though moves and branches can be conditional).
[3] states that the DSP56F8011 (which uses the same core as the DSP56F8367) consumes a maximum of 59 mA. According to [4], the DSP56F8011 operates at a clock rate of 32 MHz and a voltage of 3.6 V for a total power of 212 mW. [5] notes that the DSP56F8367 operates at 2.5 V and a clock rate of 60 MHz. Scaling by (60/30)*(2.5/3.6)2 gives 205 mW.
Assuming this represents 50% loading and the 99mW of the
0.5 dynamic + static = 205 mW
dynamic = 1.54 static
static = 116 mW
peak dynamic = 179 mW
This gives a total power consumption of 394 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [2].
Fixed
Round
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Circular (calls modulo)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The DSP56321 supports conditional branching on positive values.
MAC
A single-cycle multiply and accumulate.
ZOL
According to [2], the DSP56321 implements hardware do-loops which we’ll model as implementing zero-overhead looping.
The DSP56F8367 can do two memory accesses while performing an arithmetic operation. This will be modeled as follows.
VLIW_flag: 1
VLIW_max: 3
VLIW Memory: 2
VLIW
VLIW Mult: 1
This is slightly misleading as it implies that a MULT,
Single-Instruction-Multiple-Data (SIMD) is a characteristic
of some DSP architectures wherein a single instruction can be simultaneously
applied to two or more words smaller than the native data width. The DSP56F8367
does not support
Appendix B in [2] provides numerous programs for
benchmarking the operation of a DSP in the DSP56300 family, but does include
execution statistics to go with this.
[1]
56F8367/56F8167 Data Sheet, Rev 8, January 2007.
[2]
DSP56800E Reference Manual, DSP56800ERM, Rev. 2.16, Nov 2005.
[3]
C. Wu, “Understanding of 56F800E DSC Motor Control Peripherals and
Applications,” PZ308, November 2007 Available online: http://mcuol.com/download/upfile/PZ308.pdf
[4]http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=56F8011&nodeId=01279562921379
[5]http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=56F8367&nodeId=01279562921379&tab=Buy_Parametric_Tab&fromSearch=false#2
The following provides details on how parameters for the data file for the MSC8101 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: MSC7116
Native Precision: 16
Clock Rate (MHz): 266
The MSC7116 is based on a single SC1400 core for which little direct documentation exists. However as noted in [2]:
“The SC140 and SC1400 cores are functionally identical, and the information in this document applies to both cores. For simplicity, the SC140 core is referenced throughout this application note”
Thus characteristics are taken from the SC140 reference manual [3] as opposed to a non-existent SC1400 reference manual.
The SC140 has 4 Data ALUs, each of which contains an
The data sheet [1] provides an example power calculation which gives the following values:
Peripheral power = 15.3 mW
Static (leakage) power = 64 mW
External 326.3 mW (memory)
Core = 287 mW
Taking this core value to be “typical” (i.e., 50%), we
double for a peak of 574 mW for a total of 1052 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [2].
Fixed
Round
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Circular (calls modulo)
Reverse (called reverse carry)
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ADDSUB
The SC140 includes instructions (e.g., ADD20 which we’ll
model as an A
BPOS
Using its
EXTRACT
Using its multiplier units, the SC140 supports several operations which collectively simplify the process of extracting/packing bit fields into registers. In general, this is treated as cutting bit manipulation cycles in half, but the exact savings by supporting detailed bitfield manipulation operations will vary from component to component.
An instruction, which if true, results in execution of remaining instructions in a group. This will not be modeled as in the current model VLIW + a branching instruction effects that behavior.
MAC
A single-cycle multiply and accumulate.
MAX
Using its
MAX2
Like the MAX operation, but performed on two word pairs
simultaneously.
MEM4
The 64-bit loads effectively act as a quad load.
VSL (Viterbi Shift Left)
A process by which a register is shifted left and an input 1 or 0 is appended to the right most bit. Useful for keeping track of paths (saves an instruction).
ZOL
According to [3], the SC140 implements hardware do-loops which we’ll model as implementing zero-overhead looping.
Each of the 4 data ALUs and the 2 data address generator units can operate independently.
VLIW_flag: 1
VLIW_max: 6
VLIW Memory: 2
VLIW
VLIW Mult: 4
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. The SC140 does not support SIMD.
None available??
[1] MSC7116 Document Number: MSC7116 Rev. 11, July 2007.
[3] SC140 DSP Core Reference Manual Revision 4.1, September 2005
[4] MSC711X Reference Manual: MSC711xRM Rev. 1, November
2006.
[5]
D. Simon, MSC711x Overview, AN3056, Rev. 0, December 2005.
The following provides details on how parameters for the data file for the MSC8101 were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: MSC8101
Native Precision: 16
Clock Rate (MHz): 300
The MSC8101 is based on a single SC140 core. The SC140 has 4
Data
The MSC8101 includes an enhanced filter coprocessor [3] which runs at the same clock rate as the MSC8101 and supports various filter structures. However, it does not support instruction based processing and thus will be treated as a coprocessor which means that its cycle benefits would only appear in benchmarked routines. Additionally, the MSC8101 contains dedicated circuitry for most network interfaces (e.g., IP, ATM Ethernet).
The data sheet [1] provides an example power calculation assuming a core clock rate of 300 MHz and gives the following values:
Core = 450 mW
PCM = 163 mW
PSIU = 41 mW
PI/O = 67 mW
Taking this core value to be “typical” (i.e., 50%), we double for a peak of 900 mW for a total of 1171 mW.
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
ADDSUB
The SC140 includes instructions (e.g., ADD2 which we’ll
model as an A
BPOS
Using its
EXTRACT
Using its multiplier units, the SC140 supports several operations which collectively simplify the process of extracting/packing bit fields into registers. In general, this is treated as cutting bit manipulation cycles in half, but the exact savings by supporting detailed bitfield manipulation operations will vary from component to component.
An instruction, which if true, results in execution of remaining instructions in a group. This will not be modeled as in the current model VLIW + a branching instruction effects that behavior.
MAC
A single-cycle multiply and accumulate.
MAX
Using its
MAX2
Like the MAX operation, but performed on two word pairs
simultaneously.
MEM4
The 64-bit loads effectively act as a quad load.
VSL (Viterbi Shift Left)
A process by which a register is shifted left and an input 1 or 0 is appended to the right most bit. Useful for keeping track of paths (saves an instruction).
ZOL
According to [3], the SC140 implements hardware do-loops which we’ll model as implementing zero-overhead looping.
Each of the 4 data ALUs and the 2 data address generator units can operate independently.
VLIW_flag: 1
VLIW_max: 6
VLIW Memory: 2
VLIW
VLIW Mult: 4
Single-Instruction-Multiple-Data (SIMD) is a characteristic
of some DSP architectures wherein a single instruction can be simultaneously
applied to two or more words smaller than the native data width. The SC140 does
not support SIMD.
None available??
[1]
MSC8101 Data Sheet Rev. 18, August 2005.
[2] SC140 DSP Core Reference Manual Revision 4.1, September 2005 .
[3] MSC8101
Reference Manual: MSC8101RM Rev. 4, December 2005.
The following provides details on how parameters for the data file for the ARM968E-S were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: ARM968E-S
Native Precision: 32
Clock Rate (MHz): 530
The ARM 968E-S is a member of the ARM9 Thumb family and implements the ARMv5TE architecture. It supports the 32-bit ARM instruction set and the 16-bit Thumb instruction set. While the amount of time required to execute cycles is dependent on the following instructions (interlock), we’ll in general ignore this except for the fact that most cycle savers do not actually save cycles (e.g., the double word store instruction takes 2 cycles instead of 1).
In general, it doesn’t support detailed bit manipulation, but it does allow you to count leading zeros.
For operation at 530 MHz, [3] gives a core power consumption
of 0.11 mw/MHz as typical power consumption or 58.3 mW. Assuming this
represents 50% loading and that the 99mW of the
0.5 dynamic + static = 58.3 mW
dynamic = 1.54 static
static = 32.9 mW
peak dynamic = 50.7 mW
This gives a total power consumption of 183 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [3].
Fixed
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
RISC
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The ARM968E-S supports conditional execution of its branches.
COND_EXEC
Instructions are executed based on evaluation of a condition – simplifies short control structures. Many instructions on the ARM968E-S can be conditionally executed, so it is modeled as supporting conditional execution.
MAC
A “single-cycle” multiply and accumulate. Note the instructions on ARMs are generally not single-cycle, but we’re not modeling to that level of detail.
In ARM-speak, VLIW is called Instruction Level Parallelism (ILP). The ARM 968E-S does not support ILP.
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. ARM began to support SIMD with ARM11.
None available??
[1]
ARM9E-S Core™ Revision: r2p1Technical Reference Manual, July 2004, DDI 0240B.
[2] ARM
Architecture Reference Manual, July 2000, DDI0100E.
[3] http://www.arm.com/products/CPUs/ARM968E-S.html
[4]
J. Goodacre, A. Sloss, “Parallelism
and the ARM instruction set architecture,” Computer, Volume
38, Issue
7, July 2005 Page(s):42 – 50.
The following provides details on how parameters for the data file for the ARM1020E were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID:
Native Precision: 32
Clock Rate (MHz): 325
The ARM1020E
is a member of the ARM10 Thumb family and implements the ARMv5TE architecture.
It supports the 32-bit ARM instruction set and the 16-bit Thumb instruction
set. While the amount of time required to execute cycles is dependent on the
following instructions (interlock), we’ll in general ignore this except for the
fact that most cycle savers do not actually save cycles and that [1] says “it delivers a
peak throughput approaching one instruction per cycle.”
For our modeling purposes the primary difference between the ARM1020E and the ARM968E-S is the independent load/store unit which can load/store 2 registers per cycle in parallel to data processing.
For operation at 325 MHz, [2] gives a core power consumption
of 0.6 mW/MHz as typical power consumption or 195 mW. Assuming this represents
50% loading and the 99mW of the
0.5 dynamic + static = 195 mW
dynamic = 1.54 static
static = 110 mW
peak dynamic = 170 mW
This gives a total power consumption of 379 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [1].
Fixed
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
RISC
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The
Conditional Execution
Instructions are executed based on evaluation of a condition – simplifies short control structures. All instructions on the ARM1020E can be conditionally executed, so it is modeled as supporting conditional execution.
MAC
A “single-cycle” multiply and accumulate. Note the instructions on ARMs are generally not consistently single-cycle (depends on following instruction), but we’re not modeling to that level of detail.
In ARM-speak, VLIW is called Instruction Level Parallelism (ILP). The ARM1020E does not support ILP, but the load/store unit can operate independently of the Integer processing unit. This is modeled as follows.
VLIW_flag: 1
VLIW_max: 3
VLIW Memory: 2
VLIW
VLIW Mult: 1
This is slightly misleading as it implies that a MULT,
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. ARM began to support SIMD with ARM11.
None available??
[1]
ARM9E-S Core™ Revision: r1p7Technical Reference Manual,
[2] http://www.arm.com/products/CPUs/ARM1020E.html
[3]
J. Goodacre, A. Sloss, “Parallelism
and the ARM instruction set architecture,” Computer, Volume
38, Issue
7, July 2005 pp. 42-50.
The following provides details on how parameters for the data file for the ARM1136J(F)-S were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: ARM1136J(F)-S
Native Precision: 32
Clock Rate (MHz): 620
The ARM1136J(F)-S supports the ARM6 instruction set, the Thumb instruction set, and the Jazelle (Java) instruction set (the J in ARM1136J(F)-S). Additionally, it supports DSP extensions in which SIMD is supported (to 16). The basic ARM consists of an ALU, a MAC, a shifter and a load/store unit which can load/store two words per cycle independently of the ALU/MAC operation.
Additionally, it also has a dedicated vector float coprocessor (the F in ARM1136J(F)-S). The VFP11 has 3 independent processes – a MAC pipeline, a load/store pipeline, and a divide and square root pipeline. These actually are single-cycle processes (except for divide and square), LS can supply two single-precision operands. The divide and square root does 2 instructions, divide and square root. This will be treated as a coprocessor potentially modeled in benchmarked functions as none of our applications require these instructions.
The combination of ARM and VFP11 will be treated as a VLIW architecture capable of supporting 4 simultaneous memory operations, 2 MAC and/or two ALU.
For operation at 620 MHz, [2] gives a core power consumption
of 0.45 mW/MHz as typical power consumption or 279 mW. Assuming this represents
50% loading and the 99mW of the
0.5 dynamic + static = 279 mW
dynamic = 1.54 static
static = 158 mW
peak dynamic = 243 mW
This gives a total power consumption of 500 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [1].
Fixed
Float
Round
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
Note we’re not calling this a RISC processor because we expect the majority of the processing to occur on the floating-point co-processor.
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The ARM1136J(F)-S supports conditional execution of all instructions, including on positive values.
Conditional Execution
Instructions are executed based on evaluation of a condition – simplifies short control structures. Many instructions on the ARM1136J(F)-S can be conditionally executed, so it is modeled as supporting conditional execution.
MAC
A “single-cycle” multiply and accumulate. Note the instructions on ARMs are generally not consistently single-cycle (depends on following instruction), but we’re not modeling to that level of detail.
SAD
Sum of absolute differences. Unable to find decent documentation on this, so will be ignored.
In ARM-speak, VLIW is called Instruction Level Parallelism (ILP). The combination of ARM and VFP11 will be treated as a VLIW architecture capable of supporting 4 simultaneous memory operations, 2 MAC and/or two ALU.
VLIW_flag: 1
VLIW_max: 6
VLIW Memory: 4
VLIW
VLIW Mult: 2
This is slightly misleading as it implies that a 2 MULT, 2
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. ARM began to support SIMD with ARM11[3]. The ARM1136J(F)-S implements SIMD down to 8-bit in its ALUs (in add/sub routines) and down to 16 bits in its multiplier. We are ignoring the 8-bit capability in this model.
SIMD_flag: 1
SIMD_min: 16
None available??
[25]
ARM1136JF-S™ and ARM1136J-S™ Revision: r1p5 Technical Reference Manual, DDI0211J,
[2] VFP11™
Vector Floating-point Coprocessor for ARM1136JF-S processor r1p5 Technical
Reference Manual, DDI 0274H
[3] http://www.arm.com/products/CPUs/ARM1020E.html
[4]
J. Goodacre, A. Sloss, “Parallelism
and the ARM instruction set architecture,” Computer, Volume
38, Issue
7, July 2005 pp. 42-50.
The following provides details on how parameters for the data file for the ARM1136J(F)-S were generated for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.
DSP ID: ARM1136J-S
Native Precision: 32
Clock Rate (MHz): 620
The ARM1136J-S supports the ARM6 instruction set, the Thumb instruction set, and the Jazelle (Java) instruction set (the J in ARM1136J-S). Additionally, it supports DSP extensions in which SIMD is supported (to 16). The basic ARM consists of an ALU, a MAC, a shifter and a load/store unit which can load/store two words per cycle independently of the ALU/MAC operation.
Unlike the ARM1136J(F)-S, it does not have a dedicated vector float coprocessor.
ARM [2] does not give a typical estimate for the ARM1136J-S,
but it does for without memory cache which decreases core power consumption of 0.45
mW/MHz to 0.37 mW/MHz which we’ll assume for operation without the floating
point unit. This gives a typical power consumption or 229 mW at 620 MHz.
Assuming this represents 50% loading and the 99mW of the
0.5 dynamic + static = 229 mW
dynamic = 1.54 static
static = 130 mW
peak dynamic = 180 mW
This gives a total power consumption of 409 mW.
Characteristics are DSP capabilities which might be assumed by a programmer when implementing a waveform component. The following were identified via a review of [1].
Fixed
Round
Saturation arithmetic
Indexed addressing (e.g., *R1++R2 where value in R1 is loaded and then R1 is offset by R2)
RISC
Modifiers are instructions which permit the simultaneous execution of two or more operations in a single instruction cycle (excepting saturation and sign manipulation).
BPOS
The ARM1136J-S supports conditional execution of all instructions, including on positive values.
COND_EXEC
Instructions are executed based on evaluation of a condition – simplifies short control structures. Many instructions on the ARM1136J-S can be conditionally executed, so it is modeled as supporting conditional execution.
MAC
A “single-cycle” multiply and accumulate. Note the instructions on ARMs are generally not consistently single-cycle (depends on following instruction), but we’re not modeling to that level of detail.
SAD
Sum of absolute differences. Unable to find decent documentation on this, so will be ignored.
In ARM-speak, VLIW is called Instruction Level Parallelism (ILP). On the ARM1136J-S the load/store unit can operate independently of the Integer processing unit. This is modeled as follows.
VLIW_flag: 1
VLIW_max: 3
VLIW Memory: 2
VLIW
VLIW Mult: 1
This is slightly misleading as it implies that a MULT,
Single-Instruction-Multiple-Data (SIMD) is a characteristic of some DSP architectures wherein a single instruction can be simultaneously applied to two or more words smaller than the native data width. ARM began to support SIMD with ARM11[3]. The ARM1136J-S implements SIMD down to 8-bit in its ALUs (in add/sub routines) and down to 16 bits in its multiplier. We are ignoring this capability in this model.
SIMD_flag: 1
SIMD_min: 16
None available??
[1] ARM1136JF-S™
and ARM1136J-S™ Revision: r1p5 Technical Reference Manual, DDI0211J,
[2] VFP11™
Vector Floating-point Coprocessor for ARM1136JF-S processor r1p5 Technical
Reference Manual, DDI 0274H
[3] http://www.arm.com/products/CPUs/ARM1020E.html
[4]
J. Goodacre, A. Sloss, “Parallelism
and the ARM instruction set architecture,” Computer, Volume
38, Issue
7, July 2005 pp. 42-50.
[1] This classification step is the greatest source of inaccuracy in the estimation process. More accurate estimates could be formed by creating a separate execution model for each DSP, but this would be dramatically more complicated and time-consuming and would require an entirely different software architecture for the computational engine.
[2] In practice, this is accomplished by stopping the carry operation at 8-bit boundaries instead of propagating the carry bit through the entire 32 bits.
[3] Details to be inserted as components are identified
[4] Details to be inserted as components are identified
[5] Note that there are some other instructions which effectively pipeline sub-instructions and reduce the number of required fetches from program memory, but these types of modifiers are not the intended study of this project. For example, the TMS320C6455-850 implements a complex multiplication instruction which performs the our multiplications involved with a complex multiplication. However, this is performed over four cycles so no cycles are saved by using this instruction.
[6] Details to be inserted as components are identified
[7] Note that there are some other instructions which effectively pipeline sub-instructions and reduce the number of required fetches from program memory, but these types of modifiers are not the intended study of this project. For example, the TMS320C6455-850 implements a complex multiplication instruction which performs the our multiplications involved with a complex multiplication. However, this is performed over four cycles so no cycles are saved by using this instruction.
[8] Details to be inserted as components are identified
[9] Note that there are some other instructions which effectively pipeline sub-instructions and reduce the number of required fetches from program memory, but these types of modifiers are not the intended study of this project. For example, the TMS320VC5502-300 implements an LMS instruction which occurs over numerous cycles and no cycles are saved by using this instruction versus hand-coding the equivalent operations.
[10] Details to be inserted as components are identified
[11] Details to be inserted as components are identified
[12] Details to be inserted as components are identified
[13] Details to be inserted as components are identified
[14] The
[15] Details to be inserted as components are identified
[16] Details to be inserted as components are identified