Components Mappings (Task 1.3b)
for the
Tool for Automating Estimation of DSP
Resource Statistics for Waveform Components
Submitted
under Subcontract FP-19738-430292
An
Integrated Tool for SCA Waveform Development, Testing, and Debugging and a Tool
for Automated Estimation of DSP Resource Statistics for Waveform Components
Version
1.1
Revision
History
|
Version |
Summary of Changes |
Date |
|
0.1 (JN) |
Internal Release |
|
|
1.0 (JN) |
Initial Release |
|
|
1.1 (JN) |
Revised mappings to reflect accounting shift of stack
operations from memory to Minor typos |
|
1 Introduction and Methodology
1.2.1 Use instructions consistent with the least
complex DSP
1.2.3 Clearly identify assumptions
1.2.5 Reset internal settings before exiting
1.2.6 Group operations that combine together
1.2.7 Attempt to parameterize as much as possible
2.3 Impact of Specialized Instructions
3.3 Impact of Specialized Instructions
4 Fast Fourier Transform (FFT)
4.3 Impact of Specialized Instructions
5.3 Impact of Specialized Instructions
6.3 Impact of Specialized Instructions
7.3 Impact of Specialized Instructions
8.3 Impact of Specialized Instructions
9.3 Impact of Specialized Instructions
10 CIC Filter (Interpolator, M=1)
10.3 Impact of Specialized Instructions
11.3 Impact of Specialized Instructions
12.3 Impact of Specialized Instructions
13.3 Impact of Specialized Instructions
14 Viterbi Decoder (rate 1/r, hard decisions,
traceback = 32)
14.3 Impact of Specialized Instructions and
Synergistic
14.3.3 Add Compare Select (N*num_states)
14.3.5 Hard metric (N*num_states*2)
15.3 Impact of Specialized Instructions
16.3 Impact of Specialized Instructions
17.3 Impact of Specialized Instructions
18.3 Impact of Specialized Instructions
19.3 Impact of Specialized Instructions
20.3 Impact of Specialized Instructions
21.3 Impact of Specialized Instructions
22.3 Impact of Specialized Instructions
23.3 Impact of Specialized Instructions
24.3 Impact of Specialized Instructions
This document is intended to document the steps used to generate the component files created for the “Tool for Automating Estimation of DSP Resource Statistics for Waveform Components” and to provide enough detail for others to be able to create similar files for their components.
This
section gives an overview of the methodology for writing a component file and
details on key processes associated with this methodology. This is followed by
22 example applications of this process to a variety of different component
implementations.
A
component file is intended to provide a base equation which details all of the
instructions necessary to implement a component with the simplest possible
instruction set (generally the ARM RISC set). To create this base equation,
pseudo-code for the component is written which reflects an anticipated
implementation of the component using this instruction set.
In
general different DSPs will include specialized instructions and architectural
characteristics which permit multiple of these simple instructions to execute
in a single cycle. These include instructions that explicitly combine pairs of
instructions (e.g., a MAC is a multiplication and an accumulation), ones that
obviate the need for instructions (e.g., a block repeat eliminates the need for
loop control instructions) and architectural optimizations (e.g., SIMD, VLIW)
that permit multiple instructions to be executed in a single cycle. To model
these conditions, additional equations are introduced which modify the original
base equation.
Note that
implementing a component in a different manner (e.g., an algorithmic
optimization) will yield different numbers. The advantage of the approach
adopted in this project is that rather than having to do a detailed analysis of
a component for each DSP, a single component analysis suffices for all mapped
DSPs - a significant time savings (e.g., for 20 DSPs, only 1/20th
the time is required). Other advantages of this automated process include
ensuring that others can leverage the results without having to duplicate the
efforts, the separation of DSP analysis from component analysis means that not
all systems engineers need to be an expert on all DSPs (each can be an expert
on 1 or 2 and write those files), nor do they have to be an expert on all
possible waveform components.
A
disadvantage to this approach is that it implicitly assumes every instruction
takes a single cycle to execute thereby overlooking latency and delay which can
vary significantly from DSP to DSP and instruction to instruction. For
well-designed pipelined code, this should have minimal impact on estimations as
these should fall outside the loop kernel, but is likely the largest a source
of error in the estimation method (and means that it should be expected that fewer
cycles are reported than would b actually required.
The first step in
estimating the number of cycles required to implement a component is estimating
the number of instructions required to implement the component. To ensure that
this is done in a manner that facilitates subsequent steps, the following
conventions are used.
In general there’s a
wide variation in the capabilities of DSPs, but all DSPs will have to implement
the same set of operations to implement the same process whether it’s done in a
single instruction or 10. To appropriately model this, all pseudo-code instructions
should be written in a manner consistent with the least complex DSP. In theory,
this could vary from instruction to instruction, but using instructions
consistent with the ARM9 instruction set will generally suffice.
Key examples to be
aware of include the way conditional branches are taken (e.g., some require a
flag be set, some can be done by directly examining the content of a register)
as well as explicit instruction combinations (e.g., a MAC).
In general, the basic
equation is modified by eliminating instructions that on a processor will
either be unnecessary or combined with another instruction. However, it will
sometimes be the case that multiple modifiers for a processor will eliminate
the same line. To identify these situations in the equation formulation step,
we will associate the eliminated instruction lines with the modifier so that
when a duplication occurs, it can be handled as a “synergistic” modifier. Also
a different appellation should be adopted for instructions in each loop.
Examples used in the following include labeling lines as 1,2,3,… for
instructions that occur outside a loop, L1, L2, L3, … for instructions inside a
loop, OL1, OL2,… for instructions in an outer-loop and so on.
It is sometimes hard
to implement a perfectly generalizable component so some assumptions about a processor’s
capabilities must be made (e.g., floating point or 32-bit). When this occurs,
make certain that this is clearly identified so it can be built into the
component file and thereby excluded from implementation on processors that do
not support those assumptions.
While some DSPS (e.g.,
C54) can directly manipulate elements in its caches in its instructions, this
is the exception. Thus before performing an operation on an element in memory,
it should be moved to the local register file (e.g., R1-R16). For processors
without register files, these extra cycles will be handled via a modifier that
eliminates memory accesses.
Many DSPs assign
special meanings to specific registers in their register files (e.g., to store
the branch back address, input data pointers, output data pointers, for
circular buffering). The specific registers should be ignored in this
pseudo-code because of the significant variation from DSP to DSP. Also the
pseudo-code should ignore limitations on the number of available registers. In
practice this would require additional cycles to interface with cache, but this
will be both component and DSP-specific. If need be, a specialized component
could be written to address a register-constrained condition.
However, code should
account for some instruction(s) to put data into an appropriate register when
entering or leaving the procedure. Also when calling another procedure, there
should also be an instruction to place the address of where the procedure
should return to into the appropriate register or stack.
It is a good coding
practice that whenever a procedure changes an internal setting, this setting
should be reset before exiting the procedure. So, for example, if a circular addressing
mode is needed in a particular component, the existing mode should be saved
upon entry to the procedure and restored upon exit from the procedure.
To simplify the
generation of modifier equations, instructions which group together in known
modifiers should be placed sequentially in the pseudo-code. If they cannot be
placed sequentially, that is a good indication that the modifier would not
apply in that situation.
One of the goals of
this tool is to allow for component file reuse as much as possible. For
example, there should be no need to redo the analysis when moving from a filter
of length 31 to a filter of length 63. Thus when possible, operations which
depend on typical parameterizations of the component should be identified and
expressed in terms of that parameter.
Even when the original
coder is unaware of a parameterization, loop counters will generally serve as a
parameterizable value (e.g., filter length). Note that a loop counter is not
necessary for an equivalent parameterization. In fact, on processors where loop
control is a significant burden, it is a common practice to unroll loops. In
such a case, the pseudo-code might include a comment that the following should
be repeated r times or some such. Again to help identify exactly what
will be repeated (or looped) it is helpful to use meaningful labels.
Reading assembly code
is hard (though not as hard as reading machine code!). To promote the sharing
of component files and documentation and to make it possible for the original
coder to perform validation weeks, if not merely hours, after writing the
pseudo-assembly, the pseudo-code should be commented as much as possible. This
also has the additional benefit that if a DSP is added to the suite later that
has new capabilities, there may be sufficient context to evaluate where it
could be applied to previously generated component files.
Note that lines used
for commenting should not be added to the instruction / cycle count.
The primary goal of
performing the component analysis is to generate the set of equations that
define the component file. The basic operations equation is defined by counting
up the number of instructions required to implement the component,
parameterized as appropriate. This count should then be subdivided into memory
operations, multiplication operations, and other (
Modifier equations should
be generated by reviewing the modifiers listed in Section 1.4 and identifying when the requisite conditions are
satisfied in the pseudo-code. When this occurs, this should be noted along with
the number of instructions that would be eliminated by the presence of that
modifier. Each of these modifier equations should be associated with the
specific instruction lines that would be eliminated.
Synergistic equations
are created by reviewing the list of modifier equations and identifying where
modifiers target the same instruction. The synergistic equation should undue
the effects of instructions that were double-counted (or triple-counted) by the
modifiers.
Table 1 reproduces the table of modifiers identified in the
document entitled “DSP Mappings (Task 1.3a) for the Tool
for Automating Estimation of DSP Resource Statistics for Waveform Components.”
In general when pseudo-code is identified as exhibiting the operations listed
in the middle column, the effect in the right column is applied to generate a
modifier equation. Note that in practice many of these instruction modifiers
actually capture multiple instructions (e.g., ABS
Table 1: Instruction / Cycle modifiers identified from DSP
analysis
|
Instruction Modifier |
Operation |
Modeled Effect |
|
ABSALU |
A typical |
Cycles associated with
subsequent abs eliminated |
|
ADDSUB |
The |
Consecutive adds/subtracts of
the same registers eliminated |
|
AVG |
The processor implements
(A+B)/2 |
Eliminates a left shift
when following an add |
|
BDEC |
The processor decrements a
counter and branches if the counter is non zero |
|
|
BPOS |
The processor branches on the
conditions of registers rather than requiring an instruction to set a flag. |
Cycle eliminated for
generating branch condition |
|
BITR |
The processor reverses the
bits of a register in a single cycle. Frequently implemented as bit reverse incrementing/arithmetic.
|
Cycles for process to
bit-reverse an indexed array are eliminated. 1 cycle per array element added
back in. |
|
COND_EXEC |
All instructions are
executed conditionally |
Cycles consumed for short
control branches are eliminated |
|
COND_MOV |
All memory/move operations
can be executed conditionally |
Cycles consumed for short
control branches related to memory are eliminated |
|
CPX_MPY |
A single cycle complex
multiplication. Note that several DSPs have an instruction which implements a
complex multiplication (or |
Complex multiplication
cycles (6) reduced to 1 per complex multiplication. |
|
EXTRACT |
The processor is capable of
detailed bit manipulation in a single cycle. This takes various forms in
different instructions, but a minimum requirement is the ability to extract
out a specified set of bits from a word and then pack them into bytes |
Bit manipulation cycles cut
in half. |
|
GMPY |
The processor supports
Galois Field arithmetic (useful in some error correction codes). |
Cycles required to mimic
Galois Field arithmetic are eliminated with one cycle per Galois Field
arithmetic operation added in. |
|
INDEX |
Of form *R4[R6]++ |
Cycles used to offset a
register for a memory operation are eliminated |
|
MAC |
A single-instruction
(functional unit) multiplication and accumulation. Note that when both a
multiplier and an |
Accumulate cycles
eliminated |
|
MAX |
The processor does the
following (and the reverse for a MIN): if A>B, A-> dst |
Cycles required to perform
move following comparison operation eliminated |
|
MAX2 |
The process performs the
MAX operation for two pairs of words of native precision. |
Cycles required to perform
both moves and one comparison eliminated |
|
MEM2 |
The data bus width of a
processor is such that a single instruction fetches 2 words. |
Memory cycles cut in half
(rounded up). |
|
MEM4 |
The data bus width of a
processor is such that a single instruction fetches 4 words. |
Memory cycles cut in fourth
(rounded up). |
|
NOREG |
The processor memory maps
all registers so there’s no need for an instruction to load registers from
memory. Processors which do this, however, tend to clock much slower |
All cycles used to move |
|
SAD |
Sum of absolute
differences. |
Absolute values removed. A
special case of ABSALU. |
|
|
Adds up bytes in a word. |
Eliminate cycles of sum of
byte-packed words |
|
VSL |
A process by which a
register is shifted left and an input 1 or 0 is appended to the right most
bit. Useful for keeping track of paths (saves an instruction) and some bit
manipulation operations. |
Cycle saved per path update
in |
|
ZOL |
The processor supports a
form of zero-overhead (hardware) looping wherein loop instructions are placed
in a hardware buffer and repeated a specified number of times. |
All loop control cycles
eliminated (branch, compare, decrement). One cycle added to set loop counter. |
This
component represents the implementation of a real filter without assumptions
about the number or symmetry of the taps and computes a single output for a set
of inputs. Note that different structures (e.g., block FIR, symmetric FIRs)
will be generally coded in a different manner. Because the process will vary
from implementation to implementation, all scaling is assumed to occur outside
of this function.
Requirements: circular
Parameters: N (length)
y=fir(coef, data, length, offset)
//Set circular buffer params
1 (instruction
to store previous setting in local register)
2 (instruction
to store buff length)
3 (instruction
to turn on circ buff)
4 (instruction
to set buffer length)
//Move input parameters to local
registers
5 R1
= coef (address)
6 R2
= data (address)
7 R2
= data + offset // needed for circular buffering
8 R3
= length (actual #)
//zero accumulator (typically done
by subtracting a register from itself)
9 acc
= 0
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R4 = *R1++ // I don’t know of a DSP that doesn’t support
postfix addressing, // but if there is
one, a cycle would need to be added here
L2 R5
= *R2++ (
L3 R6
= R5 * R4
L4 acc
= acc + R6
L5 R3 = R3 – 1
L6 flag = cmp(R3,0)
L7 if
flag (R3==0), branch to loop
// Move
result to output register
10 R_out
= acc
// Restore
stuff
11 (instruction
to turn reset addressing mode)
12 (instruction
to reset buffer length)
13 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
13+ 7 * N |
|
Memory |
2*N |
|
Multiplication |
N |
|
|
4*N + 13 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L5, L6 eliminated |
( |
|
BPOS |
L6 eliminated |
( |
|
ZOL |
1 cycle to set register, L5, L6, L7 eliminated |
( |
|
MAC |
L4 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) –(2*N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(2*N)/4 |
|
NOREG |
MEM eliminated |
(MEM) –(2*N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L5,L6 added back in |
( |
|
BPOS, BDEC |
L6 added back in |
( |
|
BPOS, ZOL |
L6 added back in |
( |
|
BPOS, BDEC, ZOL |
L6 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (2*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(2*N)/4 |
This
component represents the implementation of a complex filter without assumptions
about the number or symmetry of the taps and computes a single output for a set
of inputs. Note that different structures (e.g., block FIR, symmetric FIRs)
will be generally coded in a different manner. Because the process will vary
from implementation to implementation, all scaling is assumed to occur outside
of this function.
Requirements: circular
Parameters: N (length)
y=fir(coef_real, coef_imag, data_real,
data_imag, length, offset)
//Set circular buffer params
1 (instruction
to store previous setting in local register) (real)
2 (instruction
to store previous setting in local register) (imag)
3 (instruction
to store previous setting in local register) (real)
4 (instruction
to store previous setting in local register) (imag)
5 (instruction
to turn on circ buff) (real)
6 (instruction
to turn on circ buff) (imag)
7 (instruction
to set buffer length) (real)
8 (instruction
to set buffer length) (imag)
//Move input parameters to local
registers
9 R1
= coef_real (address)
10 R2
= coef_imag (address)
11 R3
= data_real (address)
12 R4
= data_imag (address)
13 R3
= data_real + offset // needed for circular buffering
14 R4
= data_real + offset // needed for circular buffering
15 R5
= length (actual #)
//zero accumulators (typically done
by subtracting a register from itself)
16 acc_real
= 0
17 acc_imag=
0
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R6 = *R1++
L2 R7
= *R2++
L3 R8
= *R3++
L4 R9
= *R4++
//end memory fetches
L5 R10
= R1 * R3
L6 R11
= R2 * R4
L7 R12
= R2 * R3
L8 R13
= R1 * R4
L9 acc_real
= acc_real + R10
L10 acc_real
= acc_real – R11
L11 acc_imag
= acc_imag + R12
L12 acc_imag
= acc_imag + R13
// loop
control
L13 R5
= R5 – 1
L14 flag
= cmp(R5,0)
L15 if
flag (R5!=0), branch to loop
// Move
result to output register
18 store
real
19 store
imag
// Restore
stuff
20 (instruction
to turn reset addressing mode) (real)
21 (instruction
to turn reset addressing mode) (imag)
22 (instruction
to reset buff) (real)
23 (instruction
to reset buff) (imag)
24 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
24+ 15 * N |
|
Memory |
4*N |
|
Multiplication |
4*N |
|
|
7*N + 24 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L13-14 eliminated |
( |
|
BPOS |
L14 eliminated |
( |
|
ZOL |
1 cycle to set register, L13-15 eliminated |
( |
|
MAC |
L9-12 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) –(4*N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(4*N)/4 |
|
NOREG |
MEM eliminated |
(MEM) –(4*N) |
|
CMPX_MPY |
L5-L12 eliminated |
(MULT) -3*N, ( |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L13,L14 added back in |
( |
|
BPOS, BDEC |
L14 added back in |
( |
|
BPOS, ZOL |
L14 added back in |
( |
|
BPOS, BDEC, ZOL |
L14 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (4*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(4*N)/4 |
|
CMPX_MPY, MAC |
L9-12 added back in |
( |
This
component represents the implementation of a computation-in-place radix-2
complex FFT (decimation in time) where all twiddle factors have been
precomputed and stored. Note that relevant counters are assumed hard-coded. The
first stage of the FFT accesses the input data in bit-reversed fashion,
subsequent stages in normal fashion. In practice this means the first pass
should be outside of the main loop. For space, this is not actually coded
though it is reflected in the equations. Also note that slightly high SNR could
be achieved if the FFT input and the output for each stage is scaled by a
factor of ½ as opposed to the implicit
pre-stage scaling used here. However, this will add 4*log2(N) cycles
(right shift and store rather than store).
Requirements: bit-reverse
addressing, circular addressing
Parameters: N (length)
(everything dependent on this is assumed hard coded in)
y=fir(real_data, imag_data, twid_real,
twid_imag)
1 R1
= real_data (address)
2 R2
= imag_data (address)
3 R3
= twid_real (address)
4 R4
= twid_imag (address)
//first pass through the input data should be
accessed in bit reverse fashion
//after that, should be in normal linear
fashion
//that is reflected here for R1, R2
5 (instruction
to store previous setting in local register)
6 (instruction
to store previous setting in local register)
7 (instruction
to turn on bit-reverse add)
8 (instruction
to turn on bit-reverse add)
9 (instruction
to turn reset addressing mode)
10 (instruction
to turn reset addressing mode)
//also there are 3 less instructions used (one
less pass through outerloop loop control)
//note that will have a negative effect on
modifier equations
11 num_stages
= log_2 (N) // outer loop counter (stage), hardcoded
12 data_step
= 1
13 num_DFT
= N/2 //hard coded
14 offset
= 1 //used for many things
//
defines # butterflies in DFT
//
also 2*offset*DFT_counter is start address for DFT
//
e.g., in stage 0, DFT2, start address is 2*1*2 = 4 (bit reversed since stage 0)
//
//**********************
//OUTER
OL1 DFT_count
= num_DFT //set up middle loop counter
//**********************
//MIDDLE (DFT)
//Point to twiddle to W^0
ML1 twid_real_reg
= R3 // wrap around
ML2 twid_imag_reg
= R4 // wrap around
//calculate initial index for butterfly
(k*offset)
ML3 temp1
= Num_DFT - DFT_count //(k=0,1,2,…)
ML4 temp
= temp1 * offset
ML5 temp
= temp << 1
//initial a address
ML5 a_real_reg
= temp
ML6 a_imag_reg
= temp
//initial b address
ML7 b_real_reg
= a_real_reg + offset
ML8 b_imag_reg
= a_imag_reg + offset
ML9 butterfly_count
= offset //inner loop counter
//**********************
//INNER
label: INNER
//START BUTTERFLY
//A FFT butterfly is implemented as
// A = a + twid*b (complex multiplication)
// B = a – twid*n (complex multiplication)
//butterfly (there are some chips where the
following is collapsed down to 2-4 cycles, though not included //in this
survey. Labeled separately to make it easier to adjust later
B1 twid_real_val
= *twid_real_reg
B2 twid_imag_val
= *twid_imag_reg
B3 a_real
= *a_real_reg
B4 a_imag
= *a_imag_reg
B5 b_real
= *b_real_reg
B6 b_imag = *b_imag_reg
//complex multiplication b*twid
B7 R1
= b_real*twid_real_val
B8 R2
= b_imag*twid_imag_val
B9 b_mod_real
= R1-R2
B10 R1
= b_real*twid_imag
B11 R2
= b_imag*twid_imag
B12 b_mod_imag
= R1 + R2
//A
B13 temp_a_real
= a_real + b_mod_real
B14 temp_a_imag
= a_imag +b_mod_imag
B15 *a_real_reg++
= temp_a_real
B16 *a_real_imag++
= temp_a_imag
//B
B17 temp_b_real
= a_real - b_mod_real
B18 temp_b_imag
= a_imag -b_mod_imag
B19 *b_real_reg++
= temp_b_real
B20 *b_real_imag++
= temp_b_imag
//END BUTTERFLY
//**********************
IL1 twid_real_reg
= twid_real_reg + num_DFTs // wrap around
IL2 twid_real_imag
= twid_real_imag + num_DFTs // wrap around
IL3 butterfly_count
= butterfly_count – 1
IL4 flag
= cmpeq(butterfly_count,0)
IL5 if
flag(butterfly _count !=0), branch INNER LOOP
//END INNER
//**********************
ML10 DFT_count
= DFT_count – 1
ML11 flag
= cmpeq(DFT_count,0)
ML12 if
flag(DFT _count !=0), branch MIDDLE LOOP
//END MIDDLELOOP
//**********************
//resize for next stage
OL2 num_DFT = num_DFT >> 1; //e.g.,
8,4,2,1
OL3 offset
= offset << 1;
OL4 stage_size
= stage_size << 1;
OL5 stage_counter = stage_counter – 1
OL6 flag = cmpeq(stage_counter,0)
OL7 if
flag(stage_counter !=0), branch OUTER LOOP
//END OUTER
//**********************
// Restore
stuff
16 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
15+ 7*log2(N) + (N-1)* 12+ (20 + 6)*log2(N) |
|
Memory |
10*log2(N) |
|
Multiplication |
(N-1)*1 + 4*log2(N) |
|
|
15+7*log2(N) + (N-1)*11+12*log2(N) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL5,6 ML10,11, IL3,4 eliminated |
( |
|
BPOS |
OL6 ML11, IL4 eliminated |
( |
|
ZOL |
OL5-7 ML10-12, IL3-5 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) -(10*log2(N))/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(10*log2(N))/4 |
|
NOREG |
MEM eliminated |
(MEM) -(10*log2(N)) |
|
CMPX_MPY |
B7-11 eliminated |
(MULT) -3* log2(N), ( |
|
MAC |
B9, B12 eliminated |
( |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL5,6 ML10,11, IL3,4 added back in |
( |
|
BPOS, BDEC |
OL6 ML11, IL4 added back in |
( |
|
BPOS, ZOL |
OL6 ML11, IL4 added back in |
( |
|
BPOS, BDEC, ZOL |
OL6 ML11, IL4 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (10*log2(N))/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(10*log2(N))/4 |
|
CMPX_MPY, MAC |
B9, B12 added back in |
( |
This
component represents the implementation of a real least-mean-squares equalizer
which generates outputs on a symbol by symbol basis and updates the
coefficients with the error calculated externally. Because the process will
vary from implementation to implementation, all scaling is assumed to occur
outside of this function. Note that circular buffering is not generally done
with
Parameters: N (filter length)
y=lms(coef, data, length, error, step)
//Move input parameters to local
registers
1 R1
= coef (address)
2 R2
= data (address)
3 R3
= length (actual #)
4 R4
= error
5 R5
= step
//zero accumulator (typically done
by subtracting a register from itself)
6 acc
= 0 //an
//filtering operation
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
//(loop1
label)
L1,1 R7 = *R1++
L1,2 R8
= *R2++
L1,3 R9
= R7 * R8
L1,4 acc
= acc + R9
L1,5 R3
= R3 – 1
L1,6 flag
= cmp(R3,0)
L1,7 if
flag (R3!=0), branch to loop1
//adjust coefficients
7 R6
= step * error // no need to calculate this everytime
8 R6
= R6 * acc //(y*err * weight)
9 R1
= coef (address)//reset pointers
10 R2
= data (address)
11 R3
= length
//Loop through and update
coefficients
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
//loop2
L2,1 R7 = *R1 // note no postfix adjustment
yet
L2,2 R8
= *R2++ //x[k]
L2,3 R9
= R8 * R6 //x [k] * y* err * weight
L2,4 R7
= R9 + R7 //h[k] + x [k] * y* err * weight
L2,5 *R1++
= R7 // h[k] + x [k] * y*err * weight
L2,6 R3
= R3 – 1
L2,7 flag
= cmp(R3,0)
L2,8 if
flag (R3!=0), branch to loop2
// Move
result to output register
12 R_out
= acc
// Restore
stuff
13 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
13+ N * 15 |
|
Memory |
5*N |
|
Multiplication |
2*N + 2 |
|
|
8*N + 11 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L1,5-6; L2,6-7 eliminated |
( |
|
BPOS |
L1,6; L2,7 eliminated |
( |
|
ZOL |
L1,5-7; L2,6-8 eliminated + setup |
( |
|
MAC |
L1,4; L2,4 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) -(5*N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(5*N)/4 |
|
NOREG |
MEM eliminated |
(MEM) -(5*N ) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L1,5-6; L2,6-7 added back |
( |
|
BPOS, BDEC |
L1,6; L2,7 eliminated |
( |
|
BPOS, ZOL |
L1,6; L2,7 eliminated |
( |
|
BPOS, BDEC, ZOL |
L1,6; L2,7 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (5*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(5*N)/4 |
This
component represents the implementation of a
Parameters: N (terms)
y=cos_tayl(x, rcp, length)
//Move input parameters to local
registers
1 R1
= x (value)
2 R2
= rcp (address)
3 R3
= length (actual #)
//zero accumulator (typically done
by subtracting a register from itself)
4 acc
= 1 //scaled as need be
//calculate output
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R7 = *R2++ //load rcp
L2 R1
= R1*R1 //even exponents
L3 R9
= R1 * R7
L4 R1
= R1*R1 //odd exponent
L5 acc
= acc + R9
L6 R3 = R3 – 1
L7 flag = cmp(R3,0)
L8 if
flag (R3==0), branch to loop
// Move
result to output register
5 R_out
= acc
6 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
6 + N*8 |
|
Memory |
N |
|
Multiplication |
3*N |
|
|
4*N + 6 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L,6-7 eliminated |
( |
|
BPOS |
L7 eliminated |
( |
|
ZOL |
L6-8 eliminated + setup |
( |
|
MAC |
L5 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) -(N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(N)/4 |
|
NOREG |
MEM eliminated |
(MEM) -(N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L6-7 added back |
( |
|
BPOS, BDEC |
L7 added back |
( |
|
BPOS, ZOL |
L7 added back |
( |
|
BPOS, BDEC, ZOL |
L7 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N)/4 |
This
component represents the implementation of a CORDIC algorithm operating in
normal mode (as opposed to hyperbolic or linear modes) where arctangent values
have been precomputed and stored. Note K is given by
which is assumed to have been precalculated external to this
function (resolution is not frequently changed from call to call).
Parameters: N (length,
i.e., number of iterations)
result=CORDIC(theta, K, length, atan)
//Move input parameters to local
registers
1 z
= theta (value)
2 R2
= K
3 R3
= length (actual #)
4 R4
= atan (address)
//assign initial values (registers)
4 x
= K
5 y
= 0
6 iter
= 0
//calculate output
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R5 = RSH (y,iter)
L2 R6 = RSH (x,iter)
L3 R7 = *R4++
L4 flag = cmp(R3,0)
L5 if flag (R3<0), branch to label 2
L5a x = x – R6
L6a y = y + R5
L7a z = z – R7
L7.5a Branch to label 3
(label2)
L5b x = x + R6
L6b y = y – R5
L7b z = z + R7
(label 3)
L8 R3 = R3 – 1
L9 flag = cmp(R3,0)
L10 if
flag (R3!=0), branch to loop
// Move
result to output register
7 store
x (cos(theta))
8 store
y (sin(theta))
9 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
9 + N*10.5 (L7.5a
only executed half of times) |
|
Memory |
N |
|
Multiplication |
0 |
|
|
9.5*N + 9 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L8,9 eliminated |
( |
|
BPOS |
L4, L9 eliminated |
( |
|
ZOL |
L8-10 eliminated + setup |
( |
|
COND_EXEC |
L7.5 eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) –(N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) –3*(N)/4 |
|
NOREG |
MEM eliminated |
(MEM) –(N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L8-9 added back |
( |
|
BPOS, BDEC |
L9 added back |
( |
|
BPOS, ZOL |
L9 added back |
( |
|
BPOS, BDEC, ZOL |
L9 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N)/4 |
This
component represents the implementation of a block interleaver which is given
an input linear array of data, an output linear array of data and a mapping
array.
Parameters: N (length,
i.e., number of iterations)
interleaver(x, y, length, map)
//Move input parameters to local
registers
1 R1
= x (address)
2 R2
= y (address)
3 R3
= length (actual #)
4 R4
= map (address)
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R5 = *R1++ (load x)
L2 R6 = *R4++ (load map – address
offset)
L3 R7 = R2 + R6 (offset the address)
L4 *R7 = R5 (store x[k] in y[map])
L5 R3 = R3 – 1
L6 flag = cmp(R3,0)
L7 if
flag (R3==0), branch to loop
5 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
5 + N*7 |
|
Memory |
3*N |
|
Multiplication |
0 |
|
|
4*N+5 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L5,6 eliminated |
( |
|
BPOS |
L6 eliminated |
( |
|
ZOL |
L5-6 eliminated + setup |
( |
|
INDEX |
L3 eliminated |
(MEM) -N |
|
MEM2 |
MEM cut in half |
(MEM) -(3*N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) -3*(3*N)/4 |
|
NOREG |
1-4; L1, L4 eliminated |
(MEM) -3*N - 4 |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L5-6 added back |
( |
|
BPOS, BDEC |
L6 added back |
( |
|
BPOS, ZOL |
L6 added back |
( |
|
BPOS, BDEC, ZOL |
L6 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target +(3*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(3*N)/4 |
This
component represents the implementation of a block deinterleaver which is given
an input linear array of data, an output linear array of data and a mapping array.
Uses the same map as for the interleaver, i.e., moves y into x.
Parameters: N (length,
i.e., number of iterations)
deinterleaver(x, y, length, map)
//Move input parameters to local
registers
1 R1
= x (address)
2 R2
= y (address)
3 R3
= length (actual #)
4 R4
= map (address)
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
L1 (loop label) R6 = *R4++ (load map – address
offset)
L2 R7 = R2 + R6 (offset the address)
L4 R5 = *R7 (fetch y[map])
L4 (loop label) *R1++ = R5 (store in x[k])
L5 (label 3) R3 = R3 – 1
L6 flag = cmp(R3,0)
L7 if
flag (R3==0), branch to loop
5 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
5 + N*7 |
|
Memory |
3*N |
|
Multiplication |
0 |
|
|
4*N+5 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L5,6 eliminated |
( |
|
BPOS |
L6 eliminated |
( |
|
ZOL |
L5-6 eliminated + setup |
( |
|
INDEX |
L3 eliminated |
(MEM) -N |
|
MEM2 |
MEM cut in half |
(MEM) -(3*N)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) –3*(3*N)/4 |
|
NOREG |
1-4; L1, L4 eliminated |
(MEM) –3*N |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L5-6 added back |
( |
|
BPOS, BDEC |
L6 added back |
( |
|
BPOS, ZOL |
L6 added back |
( |
|
BPOS, BDEC, ZOL |
L6 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(3*N )/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(3*N)/4 |
This
component represents the implementation of a block CIC interpolator with
differential delay of 1 (M=1 in traditional notation). Note that doing more
than M=1 will generally significantly increase the cycle count (gotta calculate
address offsets), but that increasing M beyond 1 will not increase cycles
(though it does increase memory requirements).
Note:
most CIC implementations have their stage loops unrolled (typically length
<5 well within typically available # registers) and this is reflected in the
implementation. Also note that while we’re still assuming that scaling occurs outside of the
CIC filter (this can be quite large for CIC decimators).
Parameters: N (block
length of input data)
S (number of stages)
R (upconversion rate)
cic_interpolator(x, y, length, I_reg, C_reg, R)
//Move input parameters to local
registers
1 R1
= x (address)
2 R2
= y (address)
3 R3
= length (actual #)
4 R4
= I_reg (address)
5 R5
= C_reg(address)
6 R6
= R (actual #)
//Note inherent assumption that
length > 0
//Note for loops are implemented as
conditional branches in assembly
OL1 (loop label) R7 = *R1++ (load x)
//Comb
Stage (showing 3 unrolled could be S)
OL2,1 R8 = *R4++ (get stored value stage 1)
OL2,2 R9 = *R4++ (get stored value stage 2)
OL2,3 R10 = *R4++ (get stored value stage 3)
OL3,1 R10 = R9 - R10 (evaluate comb stage 3)
OL3,2 R9 = R8 - R9 (evaluate comb stage 2)
OL3,3 R8 = R7 – R8 (evaluate comb stage 1)
OL4,1 *R4-- = R10 (store comb stage 1)
OL4,2 *R4-- = R9 (store comb stage 2)
OL4,3 *R4-- = R8 (store comb stage 3)
//Integrator
Stage (showing 3 unrolled could be S)
//There’s some accumulated
zeros from zero-stuffing that can be saved in the first integrator with a less
//
general implementation
OL5 R7 = R6
(loop2)
IL1,1 R8 = *R5++ (get stored value stage 1)
IL1,2 R9 = *R5++ (get stored value stage 2)
IL1,3 R11 = *R5++ (get stored value stage 3)
IL2,1 R11 = R11 + R9 (evaluate comb stage 3)
IL2,2 R9 = R8 + R9 (evaluate comb stage 2)
IL2,3 R8 = R10 + R8 (evaluate comb stage 1)
IL3,1 *R5-- = R11 (store comb stage 3)
IL3,2 *R5-- = R9 (store comb stage 2)
IL3,3 *R5-- = R8 (store comb stage 1)
IL4 *R2+=
= R11 (store output)
IL5 R7 = R7 – 1
IL6 flag
= cmp(R7,0)
IL7 if
flag (R7==0), branch to loop2
//outer loop control
OL6 (label
3) R3 = R3 – 1
OL7 flag
= cmp(R3,0)
OL8 if
flag (R3==0), branch to loop
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
7+ N*(3*S +5) +
N*R*(3*S +4) |
|
Memory |
N*(2*S+1) + N*R(2*S+1) |
|
Multiplication |
0 |
|
|
7+N*(S+4) + N*R*(S+3) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
IL5,6, OL6,7 eliminated |
( |
|
BPOS |
IL6, OL7 eliminated |
( |
|
ZOL |
IL5-7, OL6 eliminated + setup |
( |
|
MEM2 |
MEM cut in half |
(MEM) –N*S*(1+R) -N/2*(1+R) |
|
MEM4 |
MEM cut in fourth |
(MEM) – 3*(N*(2*S+1) + N*R(2*S+1))/4 |
|
NOREG |
MEM eliminated |
(MEM) -(N*(2*S+1) + N*R(2*S+1)) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
IL5,6, OL6,7 added back |
( |
|
BPOS, BDEC |
IL6, OL7 added back |
( |
|
BPOS, ZOL |
IL6, OL7 added back |
( |
|
BPOS, BDEC, ZOL |
IL6, OL7 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target +N*S*(1+R)
+N/2*(1+R) |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N*(2*S+1) + N*R(2*S+1))/4 |
This
component represents the implementation of a
Parameters: M (length
of message)
remainder = crc_encoder(message, p, length,
output,)
//Move input parameters to local
registers
1 R1
= message
2 R2
= p (encoding polynomial)
3 R3
= length (#16 bit words)
4 R10
= output
5 R6
= 0 //relax CRC register
6 R7
= 0 //output register
loop1:
L1 R5
= *R1++ //load first 16-bit word
//unrolled 16 times
L1R R9
= R5&(2^16-1) //left most
L2R flag
= cmpgt(R9,0)
L3R if
flag(R9==0) branch label1
L3.5R R6 = R6 XOR R2
Label 1:
L4R R7
<<1
L5R R9
= R6&1
L6R R7
= R7 XOR R9
L7R R5
= R5 << 1
L8R R6
= R6 >> 1 //update shift register
L2 *R10++
= R7 //output word
//Note inherent assumption that
length > 2
L3 R3 = R3 – 1
L4 flag = cmp(R3,0)
L5 if
flag (R3==0), branch to loop
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
7 +8.5*M*16 + M*5 |
|
Memory |
2*M |
|
Multiplication |
0 |
|
|
7+8.5*M*16 + 3*M |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L3,4 eliminated |
( |
|
BPOS |
L4, L2R eliminated |
( |
|
ZOL |
L2-4 eliminated + setup |
( |
|
COND_EXEC |
L3.5R eliminated |
( |
|
EXTRACT |
L5R, L7R eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) – (2*M)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) – 3*(2*M)/4 |
|
NOREG |
MEM eliminated |
(MEM) -(2*M) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L2,3 added back |
( |
|
BPOS, BDEC |
L3 added back |
( |
|
BPOS, ZOL |
L3 added back |
( |
|
BPOS, BDEC, ZOL |
L3 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(2*M)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(2*M)/4 |
This
component represents the implementation of a
Parameters: M (length
of message)
remainder = crc_encoder(message, p, length,
output,)
//Move input parameters to local
registers
1 R1
= message
2 R2
= p (encoding polynomial)
3 R3
= length (#16 bit words)
4 R10
= output
5 R6
= 0 //relax CRC register
6 R7
= 0 //output register
loop1:
L1 R5
= *R1++ //load first 16-bit word
//unrolled 16 times
L1R R9
= R5&(2^16-1) //left most
L2R flag
= cmpgt(R9,0)
L3R if
flag(R9==0) branch label1
L3.5R R6 = R6 XOR R2
Label 1:
L4R R7
<<1
L5R R9
= R6&1
L6R R7
= R7 XOR R9
L7R R5
= R5 << 1
L8R R6
= R6 >> 1 //update shift register
L2 *R10++
= R7 //output word
//Note inherent assumption that
length > 2
L3 R3 = R3 – 1
L4 flag = cmp(R3,0)
L5 if
flag (R3==0), branch to loop
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
7 +8.5*M*16 + M*5 |
|
Memory |
2*M |
|
Multiplication |
0 |
|
|
7+8.5*M*16 + 3*M |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L3,4 eliminated |
( |
|
BPOS |
L4, L2R eliminated |
( |
|
ZOL |
L2-4 eliminated + setup |
( |
|
COND_EXEC |
L3.5R eliminated |
( |
|
EXTRACT |
L5R, L7R eliminated |
( |
|
MEM2 |
MEM cut in half |
(MEM) – (2*M)/2 |
|
MEM4 |
MEM cut in fourth |
(MEM) – 3*(2*M)/4 |
|
NOREG |
MEM eliminated |
(MEM) -(2*M) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L2,3 added back |
( |
|
BPOS, BDEC |
L3 added back |
( |
|
BPOS, ZOL |
L3 added back |
( |
|
BPOS, BDEC, ZOL |
L3 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(2*M)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(2*M)/4 |
This
component represents the implementation of a convolutional encoder of a message
with length M 16-bits words of constraint length K (K<=16) and
rate r (of form 1/r – puncturing can occur in a separate process). The encoding
is assumed to be hardcoded to minimize cycles. Note: we’re encoding 16 bits per loop.
Parameters: M (# of 16-bit words in message –
round up)
K (constraint
length)
taps (total # of XOR taps
for encoding polynomials)
r (rate)
crc_encoder(message, length, g0, …, gr)
//Move input parameters to local
registers
1 R1
= message (address)
2 R3
= length (#16 bit words)
//Note inherent assumption that
length > 2
3 R4 = 0 //buffer is initially empty
// set pointers to output registers (r
lines)
4 Rx1 = g0
5 Rx2 = g1
L1 loop1: R5 = *R1++ // storing word to be shifted in
//repeat K times (here we’ll assume K = 3)
L21 R6
= R4 >> 1
L22 R2
= R5 & 1 //these three cycles are easily eliminated collapsed to one with
good bit control
L23 R2
= R2 << 15
L24 R6
= R6 XOR R2
L31 R7
= R4 >> 2
L32 R2
= R5 & 3 //these three cycles are easily eliminated collapsed to one with
good bit control
L33 R2
= R2 << 14
L34 R7
= R7 XOR R2
L41 R8
= R4 >> 3
L42 R2
= R5 & 7 //these three cycles are easily eliminated collapsed to one with
good bit control
L43 R2
= R2 << 13
L44 R6
= R6 XOR R2
//repeat r times
//repeat taps -1 times
L51 R9=
R8 XOR R6 // g0 = 1+x+x^2
L52 R9
= R9 XOR R5
L53 *g0++
= R9 //store 16-bit result in output word
// note total # operations will be
equal to taps
L6 R4
= R5
//loop control
L7 R3=
R3 – 1 //decrement count
L8 flag = Compare R3,0
L9 If flag, branch
//Epilog
6 R5 = 0
(effectively repeat the loop except
for L1, L6-L9)
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
5 + r +
(M-1)*(5+4*K+r*taps) + (4*K+r*taps) |
|
Memory |
r*(M) + M-1 |
|
Multiplication |
0 |
|
|
5 + r+ M*(K*4+(taps-1)*r) + (M-1)*4 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L7,L8 eliminated |
( |
|
BPOS |
L8 eliminated |
( |
|
ZOL |
L7-9 eliminated + setup |
( |
|
EXTRACT |
L22-24 et al collapsed to single cycle |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (r*(M) + M-1)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(r*(M) + M-1)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – (r*(M) + M-1) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L7,L8 added back |
( |
|
BPOS, BDEC |
L8 added back |
( |
|
BPOS, ZOL |
L8 added back |
( |
|
BPOS, BDEC, ZOL |
L8 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (r*(M) + M-1)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(r*(M) + M-1)/4 |
This
component represents the implementation of a Viterbi decoder with hard
decisions for a rate 1/r code without puncturing. The transition matrix (which
states go to which states) and the output word matrix (what words would be
output by those transitions) should be predefined and passed in and all memory
allocations performed externally. For simplicity, the input receive vector is
assumed to be grouped into words = of r bits without packing. Note that traceback is
dramatically simplified by assuming that the traceback length is less than the
register width. Finally note that this should not be used with codes with
K>5 (there’s a rule of thumb for traceback that
path lengths > 5.8xK for good SNR)
Parameters: num_states
N (length > 32)
r
A Viterbi decoder is a
very involved piece of software. For any hope of readability, the following
conventions were changed to improve readability. First, variable names instead
of generic register names were used. Second, portions of the code were
separated out into distinct subsections. None of these subsection are intended
to be treated as independent functions/procedures and are instead intended to
be substituted in where indicated.
=======================
viterbi_decoder(rcv_vect,
length, output_vect, transition_matrix, input_output_matrix)
//initialize metrics and paths
VD1 v = base_address
VD2 *v++ = 0 //0 node is
initial node
VD3 reg_bank = base_address
VD4 *reg_bank++ = 0 // initial bit is 0
VD5 rcv_reg = rcv_vect
//VD4 output_reg = output_vect // assigned later when needed
VD6 transition_reg = transition_matrix
VD7 output_matrix = input_output_matrix
//***************
// INITIALIZATION LOOP
VD8 tmp = num_states-1
LABEL:
INL1 *v++ = MAX_NEG //hardcoded to largest negative
INL2 *reg_bank++ = 0
INL3 tmp = tmp-1
INL4 flag = cmpeq(tmp,0)
INL5 if !flag, branch init_loop
// END
//************
//****************
//
VD9 main_ctr = length
LABEL:
MNL1 v = 0 address //repoint to beginning
MNL2 reg_bank = 0 address // repoint to beginning
(Branch Metric Unit)
//advances rcv_sample pointer, assigns values to output_metric array
(Path Metric Unit)
//
//Copy results from Path
Metric Unit
MNL3 state_cnt = num_states
LABEL:
//END
MNL4 Flag = cmplt(main_ctr,31) // main_ctr > 31 (full register) ??
MNL5 If Flag, branch MNL6
(Traceback Unit)
MNL6 main_ctr = main_ctr -1
MNL7 flag = cmpeq(main_ctr,0)
MNL8 if !flag, branch init_loop
//
//****************
//****************
// FLUSH REGISTERS
FR1 FR_cnt = 31 //one less than reg width
FR2 temp_reg = reg_bank +
FR3 temp_val = *temp_reg
LABEL: FR_LOOP
FRL1 temp_val = temp_val << 1
FRL2 temp_val2 = temp_val
FRL3 temp_val2 = tempval2 >> 31
FRL4 temp = temp_val2 & 1
FRL5 *output_vect++ = temp
FRL6 FR_cnt = FR_cnt - 1
FRL7 flag = cmpeq(FR_cnt, 0)
FRL8 if !flag, branch FR_LOOP
VD10 (Instruction to branch back)
Hard_metric(rcv_samp, ideal)
HM1 Tmp = xor(rcv_word,ideal)
HM2 Acc = 0
//repeat #bits per word (cept
once)
W1 Acc = acc+tmp&1
W2 Tmp>>1
===============================
BRANCH METRIC
BMU1 state_cnt = num_states //hard coded
BMU2 Rcv_word = *rcv_samp++
BMU3 output_reg = input_output_matrix
//OUTER
LABEL: OUTER_LOOP
//INNER
//2x following
BMU IL1 ideal = *output_reg++
(Hard_metric)
BMU IL2 *output_metric++
= Acc
BMU OL 1 state_cnt = state_cnt – 1
BMU OL2 flag = cmpeq(state_cnt,0)
BMU OL3 if !flag, branch outer loop
====================
Add-Compare-Select
ACS1 Temp0 = V1 + M1 //metric for path with new
metric V1 and old M1
ACS2 Temp1 = V2 + M2//metric for path with new metric
V2 and old M2
ACS3 Flag
= cmp(temp0,temp1)
ACS4 If flag branch ACS8 //can’t quite do this
with just a MAX
ACS4.1 Rtn_v = temp0
ACS4.2 Rtn_index = 0
ACS5 branch ACS7
ACS6 Rtn_v = temp1
ACS6.1 Rtn_index = 1
ACS7 //really first thing out of ACS – not
counted as an actual instruction
===============================
PMU1 PMU_cnt = num_states //hard coded
PMU2 offset = 0
LABEL: PMU_LOOP
//GET INDICES FOR
PMU L1 ind1 = *temp_transition_matrix++
PMU L2 ind2 = *temp_transition_matrix--
//GET
PMU L3 V_temp = V + ind1 //INDEX
PMU L4 V1 = *V_temp
PMU L5 V_temp = V + ind2 //INDEX
PMU L6 V2 = *V_temp
PMU L8 M_temp = output_metrics + ind1
PMU L9 M_temp = output_metrics + offset //INDEX
PMU L10 M1 = *M_temp
PMU L11 M_temp = output_metrics + ind2
PMU L12 M_temp = output_metrics + offset //INDEX
PMU
L13 M2 = *M_temp
(ACS
CALCULATION)
//PMU L14 *temp_v++ =Rtn_v //actually done in
ACS. commented here for clarity
PMU
L14 temp_transition_matrix =
temp_transition_matrix + Rtn_index
PMU
L15 ind2 =
*temp_transition_matrix
PMU
L16 reg_bank_reg = reg_bank
+ ind2 // INDEX
PMU
L17 reg_bank_val =
*reg_bank_reg
PMU L18 reg_bank_val = reg_bank_val << 1 //(zero fill)
PMU L19 reg_bank_val = reg_bank_val OR offset //eliminated
with VSL instruction
PMU L20 *temp_reg_bank++ =reg_bank_val
PMU L21 offset = offset XOR 1 // note, this is very much not
an add as it’s supposed to toggle the value
PMU L22 PMU_cnt = PMU_cnt - 1
PMU L23 flag = cmpeq(PMU_cnt, 0)
PMU L24 if !flag, branch PMU_LOOP
============================
===============================
TRACEBACK
(FIND_MAX)
TBU1 temp_reg = reg_bank +
TBU2 temp_val = *temp_reg
TBU3 temp_val2 = temp_val >> 31 //extract left most bit
TBU4 temp_val = temp_val2 & 1
TBU6 *output_vect++ = temp_val
====================
===================
find_max(int *V_vect)
MAX1 max_cnt = num_states
MAX2 max = 0
MAX3
MAX4 V_vect = V //address
LABEL: MAX_LOOP
MAXL1 tmp = *V_vect++
MAXL2 flag = cmpgt(tmp,max)
MAXL3 if !flag, branch MAXL6
MAXL4
MAXL5 max = tmp
MAXL6 max_cnt = max_cnt – 1
MAXL7 flag = cmpeq(max_cnt, 0)
MAXL8 if !flag, branch MAX_LOOP
To make these
estimations readable, the following breaks down the Viterbi mapping operations
by module along with the number of times that module is called.
|
Class |
Equation |
|
# Called |
N-31 |
|
Raw |
4 + 8*num_states |
|
Memory |
num_states |
|
Multiplication |
0 |
|
|
4 + 7*num_states |
|
Class |
Equation |
|
# Called |
N - 31 |
|
Raw |
6 |
|
Memory |
2 |
|
Multiplication |
0 |
|
|
4 |
|
Class |
Equation |
|
# Called |
N * num_states |
|
Raw |
4+3*0.5+2*0.5 = 6.5 |
|
Memory |
0 |
|
Multiplication |
0 |
|
|
6.5 |
|
Class |
Equation |
|
# Called |
N |
|
Raw |
2+num_states*24 |
|
Memory |
num_states*10 |
|
Multiplication |
0 |
|
|
2+num_stats*14 |
|
Class |
Equation |
|
# Called |
N*num_states*2 |
|
Raw |
2+rate*2 -1 |
|
Memory |
0 |
|
Multiplication |
0 |
|
|
2+rate*2 -1 |
|
Class |
Equation |
|
# Called |
N |
|
Raw |
3+num_states*(4+3) |
|
Memory |
1 +num_states*4 |
|
Multiplication |
0 |
|
|
2+num_states*3 |
|
Class |
Equation |
|
# Called |
1 |
|
Raw |
13 +(num_states-1)*5
+ N*(8 + num_states*7) + 3 + 8*31 |
|
Memory |
3+ (num_states-1)*2 + N*(2 + num_states*4) + 0 + 1*31 |
|
Multiplication |
0 |
|
|
10 + (num_states-1)*3
+ N*(6 + num_states*3) + 31*7 |
|
Class |
Equation |
|
Raw |
[13
+(num_states-1)*5 + N*(8 + num_states*7) + 3 + 8*31] + N*[3+num_states*(4+3)]
+ N*num_states*2*[2+rate*2 -1] + N*[2+num_states*24] + N*num_states*6.5 +
6*(N-31) + (N-31)*(4+8*num_states) |
|
Memory |
3+ (num_states-1)*2 + N*(2 + num_states*4) + 1*31 + N*[1 +num_states*4] + N*num_states*2*0 + N * num_states*10 + 0 + (N-31)*2 + (N-31)*num_states |
|
Multiplication |
0 |
|
|
10 + (num_states-1)*3
+ N*(6 + num_states*3) + 31*7 + N*(2+num_states*3) + N*num_states*2*(2+rate*2-1) +
N*(2+num_states*14) + N*num_states*6.5 + (N-31)*4 + (N-31)*( 4 +
7*num_states) |
The following describe
the effects of the specialized operations except for MEM2, MEM4, and NOREG
which are handled in the integrated equations.
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
MAXL4,5 eliminated |
( |
|
BPOS |
MAXL5 eliminated |
( |
|
ZOL |
MAXL4-6 eliminated |
( |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
MAXL4,5 added back |
( |
|
BPOS, BDEC |
MAXL5 added back |
( |
|
BPOS, ZOL |
MAXL5 added back |
( |
|
BPOS, BDEC, ZOL |
MAXL,5 eliminated (again) |
( |
|
Instruction |
Impact |
Modifier Equation |
|
INDEX |
TBU1 eliminated |
( |
|
EXTRACT |
TBU3 eliminated |
( |
No synergies.
Subtractive Modifiers
|
Instruction |
Impact |
Modifier Equation |
|
BPOS |
PMUL23
eliminated |
( |
No synergies
Subtractive Modifiers
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
PMUL22,23 eliminated |
( |
|
BPOS |
PMUL23 eliminated |
( |
|
ZOL |
PMUL22-24 eliminated |
( |
|
VSL |
PMUL19 eliminated |
( |
|
INDEX |
PMUL3,5,9,12,16 |
( |
Synergistic Modifiers
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
PMUL22,23 added back |
( |
|
BPOS, BDEC |
PMUL23 added
back |
( |
|
BPOS, ZOL |
PMUL23 added back |
( |
|
BPOS, BDEC, ZOL |
PMUL23 eliminated (again) |
( |
Subtractive Modifiers
|
Instruction |
Impact |
Modifier Equation |
|
EXTRACT |
W1 eliminated |
( |
No synergies
Subtractive Modifiers
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
BMUOL1,2 eliminated |
( |
|
BPOS |
BMUOL2
eliminated |
( |
|
ZOL |
BMUOL1-3 eliminated |
( |
Synergistic Modifiers
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
BMUOL1,2 added back |
( |
|
BPOS, BDEC |
BMUOL2 added
back |
( |
|
BPOS, ZOL |
BMUOL2 added
back |
( |
|
BPOS, BDEC, ZOL |
BMUOL2 eliminated (again) |
( |
Subtractive
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
INL3,4, MNL6,7, CPYL5,6, FRL6,7 eliminated |
( |
|
BPOS |
INL4, MNL4,7, CPYL6, FRL7 eliminated |
( |
|
ZOL |
IN3-5, MNL6-8, CPYL5-7, FRL6-8 eliminated |
( |
|
EXTRACT |
FRL1-3 eliminated |
( |
|
INDEX |
FR2 eliminated |
( |
Synergistic Modifiers
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
INL3,4, MNL6,7, CPYL5,6, FRL6,7 added back |
( |
|
BPOS, BDEC |
INL4, MNL7, CPYL6, FRL7 added back |
( |
|
BPOS, ZOL |
INL4, MNL7, CPYL6, FRL7 added back |
( |
|
BPOS, BDEC, ZOL |
INL4, MNL7, CPYL6, FRL7 eliminated again |
( |
|
Class |
Equation |
|
Raw |
13 +(num_states-1)*5
+ N*(11 +2 + num_states*(7+7 +24+6.5+ 2*(2+r*2-1))) + 8*31 +
(N-31)*(6+4+6*num_states) |
|
Memory |
7 +(num_states-1)*2
+ N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states) |
|
Multiplication |
0 |
|
|
6 + (num_states-1)*3
+ N*(8 + 2+6.5+num_states*(3+3+14+2*(2+r*2-1))) +31*(7) +(N-31)*(4 + 5+6*num_states) |
Subtractive
Equations
Note that the impact
of these modifiers is described in the preceding and not duplicated here for
space considerations.
|
Instruction |
Modifier Equation |
|
BDEC |
( |
|
BPOS |
( |
|
ZOL |
( |
|
EXTRACT |
( |
|
INDEX |
( |
|
VSL |
( |
|
MEM2 |
(MEM) –(7 +(num_states-1)*2
+ N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/2 |
|
MEM4 |
(MEM) -3*(7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) +
1*31+(N-31)*(2+1+num_states))/4 |
|
NOREG |
(MEM) -(7 +(num_states-1)*2 + N*(3 + num_states*(4+4 +
10)) + 1*31+(N-31)*(2+1+num_states)) |
Synergistic
Modifiers
|
Modifiers |
Modifier Equation |
|
BDEC, ZOL |
( |
|
BPOS, BDEC |
( |
|
BPOS, ZOL |
( |
|
BPOS, BDEC, ZOL |
( |
|
MEM2, |
(MEM) target
+ (7 +(num_states-1)*2 + N*(3 +
num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/2 |
|
MEM4, |
(MEM) target
+3*(7 +(num_states-1)*2 + N*(3 +
num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/4 |
This component
represents the implementation of a block real polyphase interpolator with
upsampling factor r, original filter length N, to be calculated
for M input samples. N/r is assumed to be an integer. (If it’s not, the designer might as well use more coefficients
for less ripple as it’ll take
the same amount of time. However, coefficients can be padded.) Note that coef
is the address for the coefficients for the original filter, not the polyphase
subfilters. Subfiltering is taken care of automatically.
Parameters: M (block
length of input data)
N (filter length)
R (upconversion rate)
Requires: Circular
buffering
y=polyphase_interpolate(coef, data, length,
output_array)
//Move input parameters to local
registers
1 (instruction
to store previous setting in local register)
2 (instruction
to turn on circ buff)
3 (instruction
to set buffer length)
4 R11
= data
5 R3
= length (actual #)
6 R8
= output_array (address)
//*****************
// block loop
OL1 (outer loop) R2 = data (address) + R5
// block loop
OL2 R7 = r
//number of subfilters
OL3 R10
= coef (address) + r //points r above
h[0]
//*****************
// filter loop
ML1 (middle loop) R1 = R10– R7 //set
up pointer to base of appropriate subfilter
ML2 R2
= R11 //reset data pointer
ML3 acc
= 0 //zero accumulator (typically done by subtracting a register from itself)
ML4 R9
= N/r //length of subfilter, hard coded
//*****************
// subfilter loop
IL1 (inner loop) R4 = *R1 //get
coefficient p(R10-R7)[k]
IL2 R1 = R1 + N/r – 1 // hard
coded value, not two operations // eliminated by indexing
IL3 R5
= *R2++ // note: data steps at normal size
IL4 R6 = R5 * R4
IL5 acc = acc + R6
IL6 R9= R9 – 1 //decrement count
IL7 flag = Compare R9,0
IL8 If flag, branch inner
//******************
ML5 *R8++
= acc //store subfilter result
ML6 R7=
R7 – 1 //decrement count
ML7 flag = Compare R7,0
ML8 If flag, branch middle
// end filter loop
//****************
OL4 R11++ //increment data pointer
OL5 R3=
R3 – 1 //decrement count
OL6 flag = Compare R3,0
OL7 If flag, branch outer
// end block loop
//****************
// Restore
stuff
7 (instruction
to turn reset addressing mode)
8 (instruction
to turn reset buffer length)
9 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
9 + 8*(N/r)*r*M + 8*r*M + 7*M |
|
Memory |
2*(N/r)*r*M + 1*r*M +
3*M |
|
Multiplication |
(N/r)*r*M |
|
|
9+ 5*(N/r)*r*M + 7*r*M + 4*M |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
IL7,8, ML6,7, OL5,6 eliminated |
( |
|
BPOS |
IL8, ML7, OL6 eliminated |
( |
|
ZOL |
IL7-9, ML6-8, OL5-7 eliminated |
( |
|
MAC |
IL5 eliminated |
( |
|
INDEX |
IL2 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) -(2*(N/r)*r*M + 1*r*M + 3*M)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(2*(N/r)*r*M + 1*r*M + 3*M)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (2*(N/r)*r*M + 1*r*M + 3*M) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
IL7,8, ML6,7, OL5,6 added back |
( |
|
BPOS, BDEC |
IL8, ML7, OL6 added back |
( |
|
BPOS, ZOL |
IL8, ML7, OL6 added back |
( |
|
BPOS, BDEC, ZOL |
IL8, ML7, OL6 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (2*(N/r)*r*M + 1*r*M + 3*M)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target 3*(2*(N/r)*r*M + 1*r*M + 3*M)/4 |
This
component represents the implementation of a block AM modulator for DSB-SC AM with
a block of length N. The module takes as inputs a pointer to a sine
table, an increment factor (to define a carrier frequency), a pointer to a
message data block, and an output buffer.
Parameters: N (block
length of input data)
Requires: Circular
buffering
y=AM_modulate(message, sine, increment, offset,
output_array, length)
//Move input parameters to local
registers
1 (instruction
to store previous setting in local register)
2 (instruction
to turn on circ buff) // for sine buffer
3 (instruction
to set buffer length) // presumably known
4 R1
= message
5 R2
= sine
6 R3
= offset
7 R2
= R2 + R3
8 R3
= increment
9 R4
= length (actual #) //also loop counter
10 R5
= output_array
//*****************
// loop
L1 R6 = *R2 (sine sample)
L2 R2 = R2 + R3
L3 R7 = *R1++ //fetch message
L4 R8 = R6*R7 //modulate
L5 *R5++ = R8 // write to output
L6 R4=
R4 – 1 //decrement count
L7 flag = Compare R4,0
L8 If flag, branch
// end block loop
//****************
// Restore
stuff
11 (instruction
to turn reset addressing mode)
12 (instruction
to turn reset buffer length)
13 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
13 + 8*N |
|
Memory |
3*N |
|
Multiplication |
N |
|
|
13 + 4*N |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L6,7 eliminated |
( |
|
BPOS |
L7 eliminated |
( |
|
ZOL |
L6-8 eliminated |
( |
|
MAC |
L5 eliminated |
( |
|
INDEX |
L2 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (3*N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(3*N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – 3*N |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L6,7 added back |
( |
|
BPOS, BDEC |
L7 added back |
( |
|
BPOS, ZOL |
L7 added back |
( |
|
BPOS, BDEC, ZOL |
L7 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (3*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(3*N)/4 |
This
component represents the implementation of a simple Costas-loop PLL for DSB-SC
AM. The module takes as inputs a pointer to a sin/cos function, a phase
accumulator, I and Q branch accumulators, and an output_array, and a message
data block. All low-pass filters are implemented as integrators. Note a
slightly different structure is needed for samples generated from a complex
ADC.
Parameters: N (block
length of input data)
Ovsp_factor (ratio of carrier to
sampling rate)
C cycles to call sin function
y=AM_demodulate(message, sin_function, phase_acc,
phase_inc, I_acc, Q_acc, output_array, length)
//Move input parameters to local
registers
1 R1
= message
2 R2
= sin_function // function to call
// R3
= sin_val
// R4
= cos_val
3 R5
= length (actual #) //also loop counter
4 R6
= output
5 R7
= phase_acc
6 R8
= I_acc
7 R9
= Q_acc
8 R14
= phase_inc
//*****************
// loop
L1 R6 = *R1++
// Generate
L2 (write R7 to appropriate function
register)
LC call sin_function // puts sin_val +
cos_val into R3, and R4 (branch R2)
L3 R10 = R4*R1 // Generate I branch
L4 R8 = R8 + R10 // LPF acc
L5 R11 = R3*r1 // Generate Q branch
L6 R9 = R9 + R11 // LPF acc
L7 *R6++ = R8 //output I branch LPF as
message
L8 R12 = R8*R9 //Generate error term
L9 R7 = R7 + R12 // LPF error term acc
L10 R13 = R7 + R14 //phase_inc
L11 flag = cmp(R13, (value for 2pi)
L12 if flag (R13 < 2pi) branch L14
L13.25 R13=R13 – (value for 2pi) //assume oversampling
factor of 4 for labeling purposes
L14 R5=
R5 – 1 //decrement bit count
L15 flag = Compare R5,0
L16 If flag, branch
// end block loop
//****************
// Restore
stuff
9 (instruction
to store I_acc)
10 (instruction
to store Q_acc)
11 (instruction
to store phase)
12 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
11+ N*15 + N/
Ovsp_factor + C*N |
|
Memory |
2*N |
|
Multiplication |
3*N |
|
|
11+ 10*N + N/Ovsp_factor |
|
Function |
N*C (doesn’t get hit by VLIW or SIMD) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L14,15 eliminated |
( |
|
BPOS |
L15 eliminated (no good way to eliminate L12) |
( |
|
ZOL |
L14-16 eliminated |
( |
|
MAC |
L4,6,9 eliminated |
( |
|
COND_EXEC |
L13.25 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (2*N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(2*N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – (2*N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L14,15 added back |
( |
|
BPOS, BDEC |
L15 added back |
( |
|
BPOS, ZOL |
L15 added back |
( |
|
BPOS, BDEC, ZOL |
L15 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(2*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target +3*(2*N)/4 |
This
component represents the implementation of a block FM modulator with a block of
length N. The module takes as inputs a pointer to a sine table, an increment
factor (to define a carrier frequency), a pointer to a message data block, and
an output buffer. The frequency deviation constant is built in rather than
being passed in (difference of a cycle). Note that in practical
implementations, this would be coupled with a pre-emphasis filter (which adds
gain at higher frequencies relative to lower frequencies). Also note that the
increment here refers to the phase step for the carrier.
Parameters: N (block length of input data)
Ovsp_factor (ratio of carrier to sampling rate)
C: trig function time
y=FM_modulate(message, sin_function,
message_acc, phase_inc, output_array, length)
//Move input parameters to local
registers
1 R1
= message
2 R2
= sine_function
3 R3
= message_acc
4 R5
= phase_inc
5 R6
= length (actual #) //also loop counter
6 R7
= output_array
//*****************
// loop
L1 R8 = *R1++ //get message value
L2 R3 = R3 + R8 //accumulate message //
in the current form this should be forced to have zero DC // bias, otherwise some extra cycles will be
needed to make the abs of this value < 2pi
L3 R4 = R3 * scale // multiply by
frequency deviation constant
L4 R4 = R4 + phase_inc //not an
accumulate (could be done as a MAC, but then accumulator would
// have to be reset every time
through)
L5 flag = Compare R4 < (value for 2
pi)
L6 If flag, branch L7
L6.25 R4 = R4 – (value for 2 pi)
// Generate
modulated
L7 (write
R4 to appropriate function register)
LC call sin_function // puts sin_val
into R9
L8 *R7++ = R9 //write output
L9 R6
= R6 – 1 //decrement block count
L10 flag = Compare R6,0
L11 If flag, branch
//****************
// Restore
stuff
7 (store
message_acc)
8 (nstruction
to branch back)
|
Class |
Equation |
|
Raw |
8+ N*11 + N/
Ovsp_factor + C*N |
|
Memory |
2*N |
|
Multiplication |
N |
|
|
8+ 9*N + N/Ovsp_factor |
|
Function |
N*C (doesn’t get hit by VLIW or SIMD) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L9,10 eliminated |
( |
|
BPOS |
L10 eliminated (no good way to eliminate L6) |
( |
|
ZOL |
L9-11 eliminated |
( |
|
COND_EXEC |
L6.25 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) - (2*N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(2*N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (2*N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L9,10 added back |
( |
|
BPOS, BDEC |
L10 added back |
( |
|
BPOS, ZOL |
L10 added back |
( |
|
BPOS, BDEC, ZOL |
L10 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (2*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+3*(2*N)/4 |
This
component represents the implementation of a block FM demodulator with a block
of length N. The module takes as inputs a pointer to an arctangent
function, I and Q samples, stored loop and phase accumulators, length, and the
output array. Note that complex
input samples could possibly be created via an external Hilbert transform or
from a complex ADC.
Parameters: N (block length of input data)
Ovsp_factor (ratio of
carrier to sampling rate)
C: trig function time
FM_demodulate(sig_I, sig_Q, arctan_function,
loop_acc, phase_acc, output_array, length, sin_function)
//Move input parameters to local
registers
1 R1
= sig_I
2 R2
= sig_Q
3 R3
= length
4 R4
= output_array
5 R5
= phase_acc
6 R6
= loop_acc
//*****************
// loop
//generate appropriate real, imag
from
L1 (push
phase_acc to appropriate register for sin_cos call)
LC1 sin_cos_function
// function to call, put in R7,R8
//complex multiplication
// I
L2 R9
= R1 * R7
L3 R11
= R2*R8
L4 R9
= R9 – R11
// Q
L5 R10
= R2 * R7
L6 R11
= R1 * R8
L7 R10
= R10 + R11
L8 (push
R9 to appropriate register for atan call)
L9 (push
R10 to appropriate register for atan call)
LC2 atan_function
(assume ends up in R7)
L10 R8
= R7 * loop_constant //needed to tweak loop BW
L11 R6
= R6 + R8//accumulate loop
L12 R7
= R7 + R6 //actual output
L13 *R4++
= R7
//phase branch
L14 R8
= R7 * a different constant
L15 R5
= R5 + R8 //phase accumulate
L16 R5
= R5 + hard_code_carrier_step
L17 flag = Compare R5 < (value for 2
pi)
L18 If flag, branch L19
L18.25 R5 = R5 – (value for 2 pi)
L19 R3
= R3 – 1 //decrement block count
L20 flag = Compare R3,0
L21 If flag, branch
// end block loop
//****************
// Restore
stuff
7 (instruction
to store phase_acc)
8 (instruction
to store loop_acc)
9 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
9+ N*(21+C1+C2) + N/
Ovsp_factor |
|
Memory |
4*N |
|
Multiplication |
6*N |
|
|
9+ 11*N + N/Ovsp_factor |
|
Function |
N*(C1+C2) (doesn’t get hit by VLIW or SIMD) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L19,20 eliminated |
( |
|
BPOS |
L20 eliminated (no good way to eliminate L6) |
( |
|
ZOL |
L19-21 eliminated |
( |
|
MAC |
L4,7,11,15 |
( |
|
COND_EXEC |
L18.25 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (4*N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(4*N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – (4*N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L19,20 added back |
( |
|
BPOS, BDEC |
L20 added back |
( |
|
BPOS, ZOL |
L20 added back |
( |
|
BPOS, BDEC, ZOL |
L20 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (4*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+3*(4*N)/4 |
This
component represents the implementation of a block BPSK sine wave modulator. T
modulates N 16-bit words with M samples per symbol. Sine values are
generated by stepping through a sine table and phase shifts accomplished by
stepping half way through the sine table. To simplify the implementation, the
sine wave buffer is set up for circular addressing.
Parameters: N (number
input words – 16-bit)
M (samples / symbol)
Requires: Circular
buffering
BPSK_mod(word_ptr, sine_table, output_buffer,
increment)
//Move input parameters to local
registers
1 (instruction
to store previous settings)
2 (instruction
to store pervious settings)
3 (instruction
to turn on circ buff) (for sine_table)
4 (instruction
to set buffer length)
5 R1
= word_ptr
6 R2
= sine_table
7 R3
= output_buffer
8 R4
= N
9 R5
= increment //(defines frequency)
10 R6
= 0 //used to store old bit
//*****************
// word loop
OL1 (outer loop) R7 = *R1++ //fetch word
OL2 R8= 16 // set bit counter
//*****************
// bit loop
ML1 R9 = R7 &1
ML2 R7 = R7 >> 1
ML3 R10 = CMPEQ (R9, R6)
ML4 if !R10 branch ML6
ML5 R2 = R2 + pi (equivalent in index)
ML6 R11 = M
ML7 R6 = R9
//*****************
// sine loop
IL1 R12 = *R2
IL2 R2 = R2 + R5 //indexed addressing
IL3 *R3++ = R12
IL4 R11=
R11 – 1 //decrement bit count
IL5 flag = Compare R11,0
IL6 If flag, branch sine
// END
SINE
//*******************
ML8 R8=
R8 – 1 //decrement bit count
ML9 flag = Compare R8,0
ML10 If flag, branch bit
// END
BIT
//*******************
OL3 R4=
R4 – 1 //decrement word count
OL4 flag = Compare R4,0
OL5 If flag, branch word
// END
WORD
//*******************
//Restore stuff
11 (instruction
to reset addressing mode)
12 (instruction
to reset buffer settings)
13 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
13 + 6*M*16*N +
16*N*10 + 5*N |
|
Memory |
2*M*16*N+N |
|
Multiplication |
0 |
|
|
13+
4*M*16*N + 16*N*10+4*N |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL3,4, ML9,10, IL5,6 eliminated |
( |
|
BPOS |
OL4, ML10, IL6 eliminated |
( |
|
ZOL |
L19-21 eliminated |
( |
|
EXTRACT |
ML2 eliminated |
( |
|
INDEX |
IL2 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) - (2*M*16*N+N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(2*M*16*N+N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (2*M*16*N+N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL3,4, ML9,10, IL5,6 added back |
( |
|
BPOS, BDEC |
OL4, ML10, IL6 added back |
( |
|
BPOS, ZOL |
OL4, ML10, IL6 added back |
( |
|
BPOS, BDEC, ZOL |
OL4, ML10, IL6 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (2*M*16*N+N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(2*M*16*N+N)/4 |
This
component represents the implementation of a block BPSK demodulator.
Specifically, it maps signal levels (presumably from the output of a
symbol-synchronization process) to bits and packs them into 16-bit words. Note
that this block does not do either symbol or carrier synchronization. For high
SNR environments, the AM demodulator can be used to implement a BPSK PLL.
Parameters: N (number
input samples – assumed to be divisible by 16)
BPSK_demod(input_data, output_buffer, N)
//Move input parameters to local
registers
1 R1
= input_data
2 R2
= output_buffer
3 R3
= N
4 R4
= 16
5 R5
= 0
6 R6
= 2^15 (1000 0000 0000 0000)
//*****************
// sample
L1 (loop) R7 = *R2++ //fetch sample
L2 if R7<0 branch L4
L2,5 R5 = R5 XOR R6 //should happen half the
time
L3 R6 = R6 >> 1 //used to set
where inputs go in (0 fill)
L4 R4
= R4-- //decrement bit counter
L5 flag = cmpgt (R4,0) //note that
using conditional execution/moving of the following would actually
// ADD cycles because the
instructions would be fetched every time instead of just once
L6 if flag (R4 >0), branch L11
L6.1 *R2++
= R5 //write filled word happens once every 16 passes
L6.2 R4 = 16 // reset bit counter
L6.3 R6 = 2^15
L7 R3=
R3 – 1 //decrement sample count
L8 flag = cmpgt R3,0
L9 If flag (R3>0), branch
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
7 + N*(9+0.5+3/16) |
|
Memory |
N*(1+1/16) |
|
Multiplication |
0 |
|
|
7 + N*(8+0.5+2/16) |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
L7,8 eliminated |
( |
|
BPOS |
L5,8 eliminated |
( |
|
ZOL |
L7-9 eliminated |
( |
|
COND_EXEC |
L2.5 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (N*(1+1/16))/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(N*(1+1/16))/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – (N*(1+1/16)) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
L7,8 added back |
( |
|
BPOS, BDEC |
L8 added back |
( |
|
BPOS, ZOL |
L8 added back |
( |
|
BPOS, BDEC, ZOL |
L8 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+ (N*(1+1/16))/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N*(1+1/16))4 |
This
component represents the implementation of a block binary frequency shift
keying (BFSK) modulator leveraging a sine lookup table with two different
increment offsets (corresponding to two different frequencies). Note that this
approach eliminates the need for any special relationship between symbol length
and encoding frequencies because there are no abrupt phase transition between
symbols. The block length is N 16-bit words with FSK symbols with M samples.
M is assumed to be hardcoded (though this is not too important).
Parameters: N (#
16 bit words to encode)
M (symbol length)
Requires: Circular
buffering (for wrap around in sine table)
BFSK_encode(word_ptr, sine_table, output_buffer,
increment1, increment2, N)
//Move input parameters to local
registers
1 (instruction
to store previous settings)
2 (instruction
to store pervious settings)
3 (instruction
to turn on circ buff) (for sine_table)
4 (instruction
to set buffer length)
5 R1
= word_ptr
6 R2
= sine_table
7 R3
= output_buffer
8 R4
= N
9 R5
= increment1 //(defines frequency 1)
10 R14
= increment2 //(defines frequency 2)
//*****************
// word loop
OL1 (outer loop) R7 = *R1++ //fetch word
OL2 R8= 16 // set bit counter
//*****************
// bit loop
ML1 R9 = R7 &1
ML2 R7 = R7 >> 1
ML3 flag = CMPEQ (R9, 0)
ML4 if flag branch ML6
ML5 R6 = R5 //increment = increment 1 //
note one of the ML5s must execute
ML5.5 branch ML8
ML5 R6 = R14 // increment = increment2
ML6 R11 = M // no particular reason
that this divide evenly into the length of the sine table
//*****************
// sine loop
IL1 R12 = *R2
IL2 R2 = R2 + R6 //indexed
addressing
IL3 *R3++ = R12
IL4 R11= R11 – 1 //decrement bit count
IL5 flag = Compare R11,0
IL6 If flag, branch sine
// END
SINE
//*******************
ML7 R8=
R8 – 1 //decrement bit count
ML8 flag = Compare R8,0
ML9 If flag, branch bit
// END
BIT
//*******************
OL3 R4=
R4 – 1 //decrement word count
OL4 flag = Compare R4,0
OL5 If flag, branch word
// END
WORD
//*******************
//Restore stuff
11 (instruction
to reset addressing mode)
12 (instruction
to reset buffer settings)
13 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
13+5*N+9.5*N*16 +
6*M*N*16 |
|
Memory |
N+2*M*N*16 |
|
Multiplication |
0 |
|
|
13+4*N+9.5*N*16+4*M*N*16 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL3,4,ML7,8, IL4,5 eliminated |
( |
|
BPOS |
OL4,ML3,8, IL5 eliminated |
( |
|
ZOL |
OL3-5,ML7-9, IL4-6 eliminated |
( |
|
COND_EXEC |
ML5.5 eliminated (that’s the effect at least) |
( |
|
EXTRACT |
ML2 eliminated |
( |
|
INDEX |
IL2 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) - (N+2*M*N*16)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(N+2*M*N*16)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (N+2*M*N*16) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL3,4,ML7,8, IL4,5 added back |
( |
|
BPOS, BDEC |
OL4,ML8, IL5 added back |
( |
|
BPOS, ZOL |
OL4,ML8, IL5 added back |
( |
|
BPOS, BDEC, ZOL |
OL4,ML8, IL5 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(N+2*M*N*16)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N+2*M*N*16)/4 |
This
component represents the implementation of a noncoherent block binary frequency
shift keying (BFSK) demodulator. This is done by passing the received signal
through two filters tuned to the two frequencies with the output of one filter
subtracted from another. This would then need to be passed through a
symbol-synchronization circuit to create actual output samples. This in turn
should be passed through the BPSK demodulator defined previously to create bits
and packed words. Note that because this is a block operation, we’re assuming alignment occurs external to this
procedure.
Parameters: N
(# samples)
Filt_length (filter length)
Requires: Circular
buffering (for filtering)
BFSK_decode(signal, coef1, coef2,
output_buffer, N, filt_length)
//Move input parameters to local
registers
1 (instruction
to store previous settings)
2 (instruction
to store pervious settings)
3 (instruction
to store previous settings)
4 (instruction
to store pervious settings)
5 (instruction
to turn on circ buff) (for coef1)
6 (instruction
to set buffer length)
7 (instruction
to turn on circ buff) (for coef2)
8 (instruction
to set buffer length)
//preceding eliminates need to reset R2, R3
pointers
9 R1 = signal
10 R2 = coef1
11 R3 = coef2
12 R4
= N
13 R10
= output_buffer
//OUTER
//Set up FILTERS
OL1 acc1 = 0
OL2 acc2 = 0
OL3 R9 = filt_length //hard coded
/
//Inner Loop (Filters)
IL1 R4 = *R1++ //data
IL2 R5 = *R2++ //filt1
IL3 R6 = R5 * R4
IL4 acc1 = acc1 + R6
IL5 R5 = *R3++ //filt2
IL6 R6 = R5 * R4
IL7 acc2 = acc2 + R6
IL8 R9 = R9 – 1 //decrement block count
IL9 flag = Compare R9,0
IL10 If flag, branch inner Loop
//END FILTERS LOOP
OL4 R11
= acc1 – acc2
OL5 *R10++
= R11
OL5 R1
= R1 – filt_length-1 //scaled as need be for word size
OL6 R4
= R4 – 1 //decrement block count
OL7 flag = Compare R4,0
OL8 If flag, branch
// END
WORD
//*******************
//Restore stuff
14 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
14 + N*8 +
N*filt_length*10 |
|
Memory |
N +3*filt_length*N |
|
Multiplication |
2*filt_length*N |
|
|
14 + N*7 + 5*N*filt_length |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL6,7,IL8,9 eliminated |
( |
|
BPOS |
OL7, IL9 eliminated |
( |
|
ZOL |
OL6-8,IL8-10 eliminated |
( |
|
MAC |
IL4,7 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) - (N +3*filt_length*N)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(N +3*filt_length*N)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (N +3*filt_length*N) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL6,7, IL8,9 added back |
( |
|
BPOS, BDEC |
OL7, IL9 added back |
( |
|
BPOS, ZOL |
OL7, IL9 added back |
( |
|
BPOS, BDEC, ZOL |
OL7, IL9 eliminated (again) |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(N +3*filt_length*N)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N +3*filt_length*N)/4 |
This
component represents the implementation of a 16-QAM modulator. It takes in
16-bit words and maps this to I and Q values. An additional routine would be
needed to modulate these I and Q values onto a carrier.
Parameters: N
(# words)
Requires:
16_QAM_mod(signal, out_I, out_Q, LUT_real,
LUT_imag, length)
//Move input parameters to local
registers
1 R1
= signal
2 R2
= out_I
3 R3
= out_Q
4 R4 = LUT_real
5 R5 = LUT_imag
6 R10
= length
//OUTER
OL1 R6
= *R1++
OL2 R11
= 4
//INNER
IL1 R7
= R6 >> 2 // GET 2 bits
IL2 R7 = R7 & 3
IL3 R8 = R4 + R7 //LUT_REAL
IL4 R9 = *R8
IL5 *R2 = R9 //
IL6 R7 = R6 >> 2 // GET 2 bits
IL7 R7 = R7 & 3
IL8 R8 = R5 + R7 //LUT_IMAG
IL9 R9 = *R8
IL10 *R3 = R9 //
IL11 R11 = R11 – 1 //decrement block count
IL12 flag = Compare R4,0
IL13 If flag, branch Inner Loop
//END INNER
OL3 R10=
R10 – 1 //decrement block count
OL4 flag = Compare R4,0
OL5 If flag, branch
// END
WORD
//*******************
//Restore stuff
7 (instruction
to branch back)
|
Class |
Equation |
|
Raw |
7 + N*5 + N*4*13 |
|
Memory |
N +N*4*4 |
|
Multiplication |
0 |
|
|
7 + N*4 + N*4*9 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL3,4,IL11,12 eliminated |
( |
|
BPOS |
OL4, IL12 eliminated |
( |
|
ZOL |
OL3-5,IL11-13 eliminated |
( |
|
EXTRACT |
IL1,6 eliminated |
( |
|
INDEX |
IL3,8 eliminated |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) - (N +N*4*4)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) - 3*(N +N*4*4)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) - (N +N*4*4) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL3,4,IL11,12 added back |
( |
|
BPOS, BDEC |
OL4,4,IL12 added back |
( |
|
BPOS, ZOL |
OL4,4,IL12 added back |
( |
|
BPOS, BDEC, ZOL |
OL4,4,IL12 re-elimianted |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(N +N*4*4)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N +N*4*4)/4 |
This
component represents the implementation of a 16-QAM demodulator. It takes in
I/Q samples, maps these to bits and packs these into 16-bit words. Note
separate processes would be needed for carrier recovery, channel equalization,
and symbol synchronization.
Parameters: N
(# samples)
16_QAM_demod(signal_I, signal_Q, out, length)
//Move input parameters to local
registers
1 R1
= signal_I
2 R2
= signal_Q
3 R3
= out
4 R4
= length
// OUTER LOOP (16 samples per pass)
OL1 R5 = 0
OL2 R11 = 4 // assuming 16-bit words
//INNER
LOOP (4 samples per pass)
//I
IL1 R6 = *R1++
IL2 flag = compare R6 < 0
IL3 If flag branch IL5
IL3.5 R5 = R5 + 1
IL4 R5 = R5 << 1
IL5 R6 = abs(R6)
IL6 flag = compare R6 <
thresh
IL7 If flag branch IL8
IL7.5 R5 = R5 + 1
IL8 R5 = R5 << 1
//Q
IL9 R6 = *R2++
IL10 flag = compare R6 < 0
IL11 If flag branch IL12
IL11.5 R5 = R5 + 1
IL12 R5 = R5 << 1
IL13 R6 = abs(R6)
IL14 flag = compare R6 <
thresh
IL15 If flag branch IL16
IL15.5 R5 = R5 + 1
IL16 R5 = R5 << 1
IL17 R11 = R11 – 1 //decrement block count
IL18 flag = Compare R11,0
IL19 If flag, branch Inner Loop
// END INNER LOOP
OL3 *R3++ = R5 //store output word
OL4 R10=
R10 – 1 //decrement block count
OL5 flag = Compare R4,0
OL6 If flag, branch
// END
OUTER
//*******************
//Restore stuff
5 (instruction to branch back)
|
Class |
Equation |
|
Raw |
5 + N*6/16 + N*(19 +
4*0.5)/4 |
|
Memory |
N/16 + N*(2)/4 |
|
Multiplication |
0 |
|
|
5 + N*5/16 + N*4*19/4 |
|
Instruction |
Impact |
Modifier Equation |
|
BDEC |
OL4,5,IL17,18 eliminated |
( |
|
BPOS |
OL5, IL3,7,11,15,18 eliminated |
( |
|
ZOL |
OL4-5,IL17-19 eliminated |
( |
|
COND_EXEC |
IL 3.5,7.5,11.5,15.5 eliminated |
( |
|
VSL |
IL4,8,12,16,3.5,7.5,11.5,15.5 |
( |
|
MEM2 |
MEM operations cut in half |
(MEM) – (N/16 +N/2)/2 |
|
MEM4 |
MEM operations cut to quarter |
(MEM) – 3*(N/16 +N/2)/4 |
|
NOREG |
MEM operations eliminated |
(MEM) – (N/16 +N/2) |
|
Modifiers |
Impact |
Modifier Equation |
|
BDEC, ZOL |
OL4,5,IL17,18 added back |
( |
|
BPOS, BDEC |
OL5,IL18 added back |
( |
|
BPOS, ZOL |
OL5,IL18 added back |
( |
|
BPOS, BDEC, ZOL |
OL5,IL18 eliminated (again) |
( |
|
COND_EXEC, VSL |
IL 3.5,7.5,11.5,15.5 added back |
( |
|
MEM2, |
MEM2 effect undone |
(MEM) target
+(N/16 +N/2)/2 |
|
MEM4, |
MEM4 effect undone |
(MEM) target
+ 3*(N/16 +N/2)/4 |
[1] The specific tradeoff is an expected 7.5 * 8 or 60 cycles for specifying # of 16-bit words versus an added 2 cycles per bit (actually a little bit more) for specifying # bits because of the need for extra loop control.