Components Mappings (Task 1.3b)

 

for the

 

Tool for Automating Estimation of DSP Resource Statistics for Waveform Components

 

 

 

 

 

 

Submitted under Subcontract FP-19738-430292

 

 

An Integrated Tool for SCA Waveform Development, Testing, and Debugging and a Tool for Automated Estimation of DSP Resource Statistics for Waveform Components

 

 

 

 

Version 1.1

 

 

 

Revision History

Version

Summary of Changes

Date

0.1 (JN)

Internal Release

1/31/08

1.0 (JN)

Initial Release

3/26/08

1.1 (JN)

Revised mappings to reflect accounting shift of stack operations from memory to ALU (improves data memory estimate); Viterbi cleaned up

Minor typos

5/15/08

 

 

 

 

 

 

 

 

 

 

Table of Contents

Table of Contents. 2

1      Introduction and Methodology. 6

1.1       Overview.. 6

1.2       Pseudo-code Conventions. 7

1.2.1        Use instructions consistent with the least complex DSP. 7

1.2.2        Label instruction lines. 7

1.2.3        Clearly identify assumptions. 7

1.2.4        Register conventions. 7

1.2.5        Reset internal settings before exiting. 8

1.2.6        Group operations that combine together 8

1.2.7        Attempt to parameterize as much as possible. 8

1.2.8        Comment 8

1.3       Creating equations. 9

1.4       Modifiers. 9

2      Real Filter 12

2.1       Pseudocode. 12

2.2       Raw Operations Equation. 13

2.3       Impact of Specialized Instructions. 13

2.4       Synergistic Modifiers. 13

3      Complex Filter 14

3.1       Pseudocode. 14

3.2       Raw Operations Equation. 16

3.3       Impact of Specialized Instructions. 16

3.4       Synergistic Modifiers. 16

4      Fast Fourier Transform (FFT) 17

4.1       Pseudocode. 17

4.2       Raw Operations Equation. 20

4.3       Impact of Specialized Instructions. 20

4.4       Synergistic Modifiers. 20

5      LMS Equalizer (real) 21

5.1       Pseudocode. 21

5.2       Raw Operations Equation. 22

5.3       Impact of Specialized Instructions. 22

5.4       Synergistic Modifiers. 22

6      Taylor Series. 23

6.1       Pseudocode. 23

6.2       Raw Operations Equation. 24

6.3       Impact of Specialized Instructions. 24

6.4       Synergistic Modifiers. 24

7      CORDIC.. 25

7.1       Pseudocode. 25

7.2       Raw Operations Equation. 26

7.3       Impact of Specialized Instructions. 26

7.4       Synergistic Modifiers. 26

8      Interleaver 27

8.1       Pseudocode. 27

8.2       Raw Operations Equation. 28

8.3       Impact of Specialized Instructions. 28

8.4       Synergistic Modifiers. 28

9      DeInterleaver 29

9.1       Pseudocode. 29

9.2       Raw Operations Equation. 30

9.3       Impact of Specialized Instructions. 30

9.4       Synergistic Modifiers. 30

10        CIC Filter (Interpolator, M=1) 31

10.1     Pseudocode. 31

10.2     Raw Operations Equation. 33

10.3     Impact of Specialized Instructions. 33

10.4     Synergistic Modifiers. 33

11        CRC Encoder 34

11.1     Pseudocode. 34

11.2     Raw Operations Equation. 35

11.3     Impact of Specialized Instructions. 35

11.4     Synergistic Modifiers. 35

12        CRC Decoder 36

12.1     Pseudocode. 36

12.2     Raw Operations Equation. 37

12.3     Impact of Specialized Instructions. 37

12.4     Synergistic Modifiers. 37

13        Convolutional Encoder 38

13.1     Pseudocode. 38

13.2     Raw Operations Equation. 39

13.3     Impact of Specialized Instructions. 39

13.4     Synergistic Modifiers. 39

14        Viterbi Decoder (rate 1/r, hard decisions, traceback = 32) 40

14.1     Pseudocode. 40

14.2     Raw Operations. 43

14.2.1      Find Max. 44

14.2.2      Traceback Unit 44

14.2.3      Add Compare Select 44

14.2.4      Path Metric Unit 44

14.2.5      Hard metric. 44

14.2.6      Branch Metric Unit 45

14.2.7      Main Decoder 45

14.2.8      Integrated Equations. 45

14.3     Impact of Specialized Instructions and Synergistic. 45

14.3.1      Find Max (N-31) 45

14.3.2      Traceback Unit (N-31) 46

14.3.3      Add Compare Select (N*num_states) 46

14.3.4      Path Metric Unit (N) 46

14.3.5      Hard metric (N*num_states*2) 47

14.3.6      Branch Metric Unit (N) 47

14.3.7      Main Decoder 47

14.3.8      Integrated Equations. 48

15        Polyphase Interpolator 50

15.1     Pseudocode. 50

15.2     Raw Operations. 51

15.3     Impact of Specialized Instructions. 52

15.4     Synergistic Modifiers. 52

16        AM Modulator 53

16.1     Pseudocode. 53

16.2     Raw Operations. 54

16.3     Impact of Specialized Instructions. 54

16.4     Synergistic Modifiers. 54

17        AM Demodulator 55

17.1     Pseudocode. 55

17.2     Raw Operations. 56

17.3     Impact of Specialized Instructions. 56

17.4     Synergistic Modifiers. 56

18        FM Modulator 57

18.1     Pseudocode. 57

18.2     Raw Operations. 58

18.3     Impact of Specialized Instructions. 58

18.4     Synergistic Modifiers. 58

19        FM Demodulator 59

19.1     Pseudocode. 59

19.2     Raw Operations. 60

19.3     Impact of Specialized Instructions. 60

19.4     Synergistic Modifiers. 60

20        BPSK Modulator 61

20.1     Pseudocode. 61

20.2     Raw Operations. 63

20.3     Impact of Specialized Instructions. 63

20.4     Synergistic Modifiers. 63

21        BPSK Demodulator 64

21.1     Pseudocode. 64

21.2     Raw Operations. 65

21.3     Impact of Specialized Instructions. 65

21.4     Synergistic Modifiers. 65

22        BFSK Modulator 66

22.1     Pseudocode. 66

22.2     Raw Operations. 68

22.3     Impact of Specialized Instructions. 68

22.4     Synergistic Modifiers. 68

23        BFSK Demodulator 69

23.1     Pseudocode. 69

23.2     Raw Operations. 70

23.3     Impact of Specialized Instructions. 70

23.4     Synergistic Modifiers. 70

24        16-QAM Modulator 71

24.1     Pseudocode. 71

24.2     Raw Operations. 72

24.3     Impact of Specialized Instructions. 72

24.4     Synergistic Modifiers. 72

25        16-QAM Demodulator 73

25.1     Pseudocode. 73

25.2     Raw Operations. 74

25.3     Impact of Specialized Instructions. 74

25.4     Synergistic Modifiers. 74


 

1             Introduction and Methodology

This document is intended to document the steps used to generate the component files created for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components” and to provide enough detail for others to be able to create similar files for their components.

 

This section gives an overview of the methodology for writing a component file and details on key processes associated with this methodology. This is followed by 22 example applications of this process to a variety of different component implementations.

1.1           Overview

A component file is intended to provide a base equation which details all of the instructions necessary to implement a component with the simplest possible instruction set (generally the ARM RISC set). To create this base equation, pseudo-code for the component is written which reflects an anticipated implementation of the component using this instruction set.

 

In general different DSPs will include specialized instructions and architectural characteristics which permit multiple of these simple instructions to execute in a single cycle. These include instructions that explicitly combine pairs of instructions (e.g., a MAC is a multiplication and an accumulation), ones that obviate the need for instructions (e.g., a block repeat eliminates the need for loop control instructions) and architectural optimizations (e.g., SIMD, VLIW) that permit multiple instructions to be executed in a single cycle. To model these conditions, additional equations are introduced which modify the original base equation.

 

Note that implementing a component in a different manner (e.g., an algorithmic optimization) will yield different numbers. The advantage of the approach adopted in this project is that rather than having to do a detailed analysis of a component for each DSP, a single component analysis suffices for all mapped DSPs - a significant time savings (e.g., for 20 DSPs, only 1/20th the time is required). Other advantages of this automated process include ensuring that others can leverage the results without having to duplicate the efforts, the separation of DSP analysis from component analysis means that not all systems engineers need to be an expert on all DSPs (each can be an expert on 1 or 2 and write those files), nor do they have to be an expert on all possible waveform components.

 

A disadvantage to this approach is that it implicitly assumes every instruction takes a single cycle to execute thereby overlooking latency and delay which can vary significantly from DSP to DSP and instruction to instruction. For well-designed pipelined code, this should have minimal impact on estimations as these should fall outside the loop kernel, but is likely the largest a source of error in the estimation method (and means that it should be expected that fewer cycles are reported than would b actually required.

 

1.2           Pseudo-code Conventions

The first step in estimating the number of cycles required to implement a component is estimating the number of instructions required to implement the component. To ensure that this is done in a manner that facilitates subsequent steps, the following conventions are used.

1.2.1      Use instructions consistent with the least complex DSP

In general there’s a wide variation in the capabilities of DSPs, but all DSPs will have to implement the same set of operations to implement the same process whether it’s done in a single instruction or 10. To appropriately model this, all pseudo-code instructions should be written in a manner consistent with the least complex DSP. In theory, this could vary from instruction to instruction, but using instructions consistent with the ARM9 instruction set will generally suffice.

 

Key examples to be aware of include the way conditional branches are taken (e.g., some require a flag be set, some can be done by directly examining the content of a register) as well as explicit instruction combinations (e.g., a MAC).

1.2.2      Label instruction lines

In general, the basic equation is modified by eliminating instructions that on a processor will either be unnecessary or combined with another instruction. However, it will sometimes be the case that multiple modifiers for a processor will eliminate the same line. To identify these situations in the equation formulation step, we will associate the eliminated instruction lines with the modifier so that when a duplication occurs, it can be handled as a “synergistic” modifier. Also a different appellation should be adopted for instructions in each loop. Examples used in the following include labeling lines as 1,2,3,… for instructions that occur outside a loop, L1, L2, L3, … for instructions inside a loop, OL1, OL2,… for instructions in an outer-loop and so on.

1.2.3      Clearly identify assumptions

It is sometimes hard to implement a perfectly generalizable component so some assumptions about a processor’s capabilities must be made (e.g., floating point or 32-bit). When this occurs, make certain that this is clearly identified so it can be built into the component file and thereby excluded from implementation on processors that do not support those assumptions.

1.2.4      Register conventions

While some DSPS (e.g., C54) can directly manipulate elements in its caches in its instructions, this is the exception. Thus before performing an operation on an element in memory, it should be moved to the local register file (e.g., R1-R16). For processors without register files, these extra cycles will be handled via a modifier that eliminates memory accesses.

 

Many DSPs assign special meanings to specific registers in their register files (e.g., to store the branch back address, input data pointers, output data pointers, for circular buffering). The specific registers should be ignored in this pseudo-code because of the significant variation from DSP to DSP. Also the pseudo-code should ignore limitations on the number of available registers. In practice this would require additional cycles to interface with cache, but this will be both component and DSP-specific. If need be, a specialized component could be written to address a register-constrained condition.

 

However, code should account for some instruction(s) to put data into an appropriate register when entering or leaving the procedure. Also when calling another procedure, there should also be an instruction to place the address of where the procedure should return to into the appropriate register or stack.

1.2.5      Reset internal settings before exiting

It is a good coding practice that whenever a procedure changes an internal setting, this setting should be reset before exiting the procedure. So, for example, if a circular addressing mode is needed in a particular component, the existing mode should be saved upon entry to the procedure and restored upon exit from the procedure.

1.2.6      Group operations that combine together

To simplify the generation of modifier equations, instructions which group together in known modifiers should be placed sequentially in the pseudo-code. If they cannot be placed sequentially, that is a good indication that the modifier would not apply in that situation.

1.2.7      Attempt to parameterize as much as possible

One of the goals of this tool is to allow for component file reuse as much as possible. For example, there should be no need to redo the analysis when moving from a filter of length 31 to a filter of length 63. Thus when possible, operations which depend on typical parameterizations of the component should be identified and expressed in terms of that parameter.

 

Even when the original coder is unaware of a parameterization, loop counters will generally serve as a parameterizable value (e.g., filter length). Note that a loop counter is not necessary for an equivalent parameterization. In fact, on processors where loop control is a significant burden, it is a common practice to unroll loops. In such a case, the pseudo-code might include a comment that the following should be repeated r times or some such. Again to help identify exactly what will be repeated (or looped) it is helpful to use meaningful labels.

1.2.8      Comment

Reading assembly code is hard (though not as hard as reading machine code!). To promote the sharing of component files and documentation and to make it possible for the original coder to perform validation weeks, if not merely hours, after writing the pseudo-assembly, the pseudo-code should be commented as much as possible. This also has the additional benefit that if a DSP is added to the suite later that has new capabilities, there may be sufficient context to evaluate where it could be applied to previously generated component files.

 

Note that lines used for commenting should not be added to the instruction / cycle count.

1.3           Creating equations

The primary goal of performing the component analysis is to generate the set of equations that define the component file. The basic operations equation is defined by counting up the number of instructions required to implement the component, parameterized as appropriate. This count should then be subdivided into memory operations, multiplication operations, and other (ALU) operations. This is needed for the VLIW algorithms to run correctly.

 

Modifier equations should be generated by reviewing the modifiers listed in Section 1.4 and identifying when the requisite conditions are satisfied in the pseudo-code. When this occurs, this should be noted along with the number of instructions that would be eliminated by the presence of that modifier. Each of these modifier equations should be associated with the specific instruction lines that would be eliminated.

 

Synergistic equations are created by reviewing the list of modifier equations and identifying where modifiers target the same instruction. The synergistic equation should undue the effects of instructions that were double-counted (or triple-counted) by the modifiers.

1.4           Modifiers

Table 1 reproduces the table of modifiers identified in the document entitled “DSP Mappings (Task 1.3a) for the Tool for Automating Estimation of DSP Resource Statistics for Waveform Components.” In general when pseudo-code is identified as exhibiting the operations listed in the middle column, the effect in the right column is applied to generate a modifier equation. Note that in practice many of these instruction modifiers actually capture multiple instructions (e.g., ABSALU is associated with evaluating the absolute value of the result of any ALU operation, and a MAC can be done with either a multiplication and accumulation or a multiplication and negative accumulation) and that different processors will use different terminology for the same effect (e.g., block repeat versus a zero-overhead loop).

Table 1: Instruction / Cycle modifiers identified from DSP analysis

Instruction Modifier

Operation

Modeled Effect

ABSALU

A typical ALU instruction (ADD, SUB) is combined with an absolute value operation. Useful for distance calculations and L1 norms.

Cycles associated with subsequent abs eliminated

ADDSUB

The ALU can simultaneously add and subtract two numbers and/or add/subtract two numbers of the native precision. Note, this must be the *same* two numbers, e.g., A+B, A-B

Consecutive adds/subtracts of the same registers eliminated

AVG

The processor implements (A+B)/2

Eliminates a left shift when following an add

BDEC

The processor decrements a counter and branches if the counter is non zero

Loop control cycles

BPOS

The processor branches on the conditions of registers rather than requiring an instruction to set a flag.

Cycle eliminated for generating branch condition

BITR

The processor reverses the bits of a register in a single cycle. Frequently implemented as bit reverse incrementing/arithmetic.

Cycles for process to bit-reverse an indexed array are eliminated. 1 cycle per array element added back in.

COND_EXEC

All instructions are executed conditionally

Cycles consumed for short control branches are eliminated

COND_MOV

All memory/move operations can be executed conditionally

Cycles consumed for short control branches related to memory are eliminated

CPX_MPY

A single cycle complex multiplication. Note that several DSPs have an instruction which implements a complex multiplication (or LMS update or an FIR cycle) but these are multi-cycle instructions. Here, we’re only modeling situations where x_r * y_r – x_i*y_i, x_i*y_r + x_r*y_i. So far, this has also included xyH.

Complex multiplication cycles (6) reduced to 1 per complex multiplication.

EXTRACT

The processor is capable of detailed bit manipulation in a single cycle. This takes various forms in different instructions, but a minimum requirement is the ability to extract out a specified set of bits from a word and then pack them into bytes

Bit manipulation cycles cut in half.

GMPY

The processor supports Galois Field arithmetic (useful in some error correction codes).

Cycles required to mimic Galois Field arithmetic are eliminated with one cycle per Galois Field arithmetic operation added in.

INDEX

Of form *R4[R6]++

Cycles used to offset a register for a memory operation are eliminated

MAC

A single-instruction (functional unit) multiplication and accumulation. Note that when both a multiplier and an ALU are required to implement this, a processor is not considered to exhibit the MAC modifier unless it does not support VLIW

Accumulate cycles eliminated

MAX

The processor does the following (and the reverse for a MIN):

if A>B, A-> dst

Cycles required to perform move following comparison operation eliminated

MAX2

The process performs the MAX operation for two pairs of words of native precision.

Cycles required to perform both moves and one comparison eliminated

MEM2

The data bus width of a processor is such that a single instruction fetches 2 words.

Memory cycles cut in half (rounded up).

MEM4

The data bus width of a processor is such that a single instruction fetches 4 words.

Memory cycles cut in fourth (rounded up).

NOREG

The processor memory maps all registers so there’s no need for an instruction to load registers from memory. Processors which do this, however, tend to clock much slower

All cycles used to move

SAD

Sum of absolute differences.

Absolute values removed. A special case of ABSALU.

SIDE_SUM

Adds up bytes in a word.

Eliminate cycles of sum of byte-packed words

VSL

A process by which a register is shifted left and an input 1 or 0 is appended to the right most bit. Useful for keeping track of paths (saves an instruction) and some bit manipulation operations.

Cycle saved per path update in

ZOL

The processor supports a form of zero-overhead (hardware) looping wherein loop instructions are placed in a hardware buffer and repeated a specified number of times.

All loop control cycles eliminated (branch, compare, decrement). One cycle added to set loop counter.

 

 

2             Real Filter

This component represents the implementation of a real filter without assumptions about the number or symmetry of the taps and computes a single output for a set of inputs. Note that different structures (e.g., block FIR, symmetric FIRs) will be generally coded in a different manner. Because the process will vary from implementation to implementation, all scaling is assumed to occur outside of this function.

2.1           Pseudocode

Requirements: circular

Parameters: N (length)

 

y=fir(coef, data, length, offset)

 

//Set circular buffer params

1              (instruction to store previous setting in local register)

2              (instruction to store buff length)

3              (instruction to turn on circ buff)

4              (instruction to set buffer length)

 

//Move input parameters to local registers

5              R1 = coef (address)

6              R2 = data (address)

7              R2 = data + offset // needed for circular buffering

8              R3 = length (actual #)

 

//zero accumulator (typically done by subtracting a register from itself)

9              acc = 0

 

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label)            R4 = *R1++ // I dont know of a DSP that doesnt support postfix addressing,    // but if there is one, a cycle would need to be added here

L2            R5 = *R2++ (

L3            R6 = R5 * R4

L4            acc = acc + R6

L5            R3 = R3 1

L6            flag = cmp(R3,0)

L7            if flag (R3==0), branch to loop

 

//             Move result to output register

10            R_out = acc

 

//             Restore stuff

11            (instruction to turn reset addressing mode)

12            (instruction to reset buffer length)

13            (instruction to branch back)

2.2           Raw Operations Equation

Class

Equation

Raw

13+ 7 * N

Memory

2*N

Multiplication

N

ALU

4*N + 13

2.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L5, L6 eliminated

(ALU)  -2*N

BPOS

L6 eliminated

(ALU) N

ZOL

1 cycle to set register, L5, L6, L7 eliminated

(ALU) 1 3*N

MAC

L4 eliminated

(ALU)  -N

MEM2

MEM cut in half

(MEM) (2*N)/2

MEM4

MEM cut in fourth

(MEM) -3*(2*N)/4

NOREG

MEM eliminated

(MEM) (2*N)

2.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L5,L6 added back in

(ALU)  target+2*N

BPOS, BDEC

L6 added back in

(ALU)  target+N

BPOS, ZOL

L6 added back in

(ALU)  target+N

BPOS, BDEC, ZOL

L6 eliminated (again)

(ALU)  target N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (2*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(2*N)/4

 

3             Complex Filter

This component represents the implementation of a complex filter without assumptions about the number or symmetry of the taps and computes a single output for a set of inputs. Note that different structures (e.g., block FIR, symmetric FIRs) will be generally coded in a different manner. Because the process will vary from implementation to implementation, all scaling is assumed to occur outside of this function.

3.1           Pseudocode

Requirements: circular

Parameters: N (length)

 

y=fir(coef_real, coef_imag, data_real, data_imag, length, offset)

 

//Set circular buffer params

1              (instruction to store previous setting in local register) (real)

2              (instruction to store previous setting in local register) (imag)

3              (instruction to store previous setting in local register) (real)

4              (instruction to store previous setting in local register) (imag)

5              (instruction to turn on circ buff) (real)

6              (instruction to turn on circ buff) (imag)

7              (instruction to set buffer length) (real)

8              (instruction to set buffer length) (imag)

 

//Move input parameters to local registers

9              R1 = coef_real (address)

10            R2 = coef_imag (address)

11            R3 = data_real (address)

12            R4 = data_imag (address)

13            R3 = data_real + offset // needed for circular buffering

14            R4 = data_real + offset // needed for circular buffering

15            R5 = length (actual #)

 

//zero accumulators (typically done by subtracting a register from itself)

16            acc_real = 0

17            acc_imag= 0

 

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label)            R6 = *R1++

L2            R7 = *R2++

L3            R8 = *R3++

L4            R9 = *R4++

//end memory fetches

L5            R10 = R1 * R3

L6            R11 = R2 * R4

L7            R12 = R2 * R3

L8            R13 = R1 * R4

L9            acc_real = acc_real + R10

L10          acc_real = acc_real R11

L11          acc_imag = acc_imag + R12

L12          acc_imag = acc_imag + R13

//             loop control

L13          R5 = R5 1

L14          flag = cmp(R5,0)

L15          if flag (R5!=0), branch to loop

 

//             Move result to output register

18            store real

19            store imag

 

//             Restore stuff

20            (instruction to turn reset addressing mode) (real)

21            (instruction to turn reset addressing mode) (imag)

22            (instruction to reset buff) (real)

23            (instruction to reset buff) (imag)

24            (instruction to branch back)

3.2           Raw Operations Equation

Class

Equation

Raw

24+ 15 * N

Memory

4*N

Multiplication

4*N

ALU

7*N + 24

3.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L13-14 eliminated

(ALU)  -2*N

BPOS

L14 eliminated

(ALU) N

ZOL

1 cycle to set register, L13-15 eliminated

(ALU) 1 3*N

MAC

L9-12 eliminated

(ALU)  -4*N

MEM2

MEM cut in half

(MEM) (4*N)/2

MEM4

MEM cut in fourth

(MEM) -3*(4*N)/4

NOREG

MEM eliminated

(MEM) (4*N)

CMPX_MPY

L5-L12 eliminated

(MULT) -3*N,

(ALU)-4*N

3.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L13,L14 added back in

(ALU)  target+2*N

BPOS, BDEC

L14 added back in

(ALU)  target+N

BPOS, ZOL

L14 added back in

(ALU)  target+N

BPOS, BDEC, ZOL

L14 eliminated (again)

(ALU)  target N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (4*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(4*N)/4

CMPX_MPY, MAC

L9-12 added back in

(ALU)  target+ 4*N

 

 

4             Fast Fourier Transform (FFT)

This component represents the implementation of a computation-in-place radix-2 complex FFT (decimation in time) where all twiddle factors have been precomputed and stored. Note that relevant counters are assumed hard-coded. The first stage of the FFT accesses the input data in bit-reversed fashion, subsequent stages in normal fashion. In practice this means the first pass should be outside of the main loop. For space, this is not actually coded though it is reflected in the equations. Also note that slightly high SNR could be achieved if the FFT input and the output for each stage is scaled by a factor of ½ as opposed to the implicit pre-stage scaling used here. However, this will add 4*log2(N) cycles (right shift and store rather than store).

4.1           Pseudocode

Requirements: bit-reverse addressing, circular addressing

Parameters: N (length) (everything dependent on this is assumed hard coded in)

 

y=fir(real_data, imag_data, twid_real, twid_imag)

 

1              R1 = real_data (address)

2              R2 = imag_data (address)

3              R3 = twid_real (address)

4              R4 = twid_imag (address)

 

//first pass through the input data should be accessed in bit reverse fashion

//after that, should be in normal linear fashion

//that is reflected here for R1, R2

5              (instruction to store previous setting in local register)

6              (instruction to store previous setting in local register)

7              (instruction to turn on bit-reverse add)

8              (instruction to turn on bit-reverse add)

9              (instruction to turn reset addressing mode)

10            (instruction to turn reset addressing mode)

//also there are 3 less instructions used (one less pass through outerloop loop control)

//note that will have a negative effect on modifier equations

 

11            num_stages = log_2 (N) // outer loop counter (stage), hardcoded

12            data_step = 1

13            num_DFT = N/2 //hard coded

14            offset = 1 //used for many things

                // defines # butterflies in DFT

                // also 2*offset*DFT_counter is start address for DFT

                // e.g., in stage 0, DFT2, start address is 2*1*2 = 4 (bit reversed since stage 0)

                //

 

//**********************

//OUTER LOOP (STAGE LOOP)

OL1         DFT_count = num_DFT //set up middle loop counter

 

//**********************

//MIDDLE (DFT) LOOP

//Point to twiddle to W^0

ML1        twid_real_reg = R3 // wrap around

ML2        twid_imag_reg = R4 // wrap around

 

//calculate initial index for butterfly (k*offset)

ML3        temp1 = Num_DFT - DFT_count //(k=0,1,2,…)

ML4        temp = temp1 * offset

ML5        temp = temp << 1

 

//initial a address

ML5        a_real_reg = temp

ML6        a_imag_reg = temp

//initial b address

ML7        b_real_reg = a_real_reg + offset

ML8        b_imag_reg = a_imag_reg + offset

 

ML9        butterfly_count = offset //inner loop counter

 

//**********************

//INNER LOOP

label: INNER LOOP

 

//START BUTTERFLY

//A FFT butterfly is implemented as

// A = a + twid*b (complex multiplication)

// B = a – twid*n (complex multiplication)

//butterfly (there are some chips where the following is collapsed down to 2-4 cycles, though not included //in this survey. Labeled separately to make it easier to adjust later

B1           twid_real_val = *twid_real_reg

B2           twid_imag_val = *twid_imag_reg

B3           a_real = *a_real_reg

B4           a_imag = *a_imag_reg

B5           b_real = *b_real_reg

B6           b_imag = *b_imag_reg

 

//complex multiplication b*twid

B7           R1 = b_real*twid_real_val

B8           R2 = b_imag*twid_imag_val

B9           b_mod_real = R1-R2

 

B10         R1 = b_real*twid_imag

B11         R2 = b_imag*twid_imag

B12         b_mod_imag = R1 + R2

 

//A

B13         temp_a_real = a_real + b_mod_real

B14         temp_a_imag = a_imag +b_mod_imag

B15         *a_real_reg++ = temp_a_real

B16         *a_real_imag++ = temp_a_imag

//B

B17         temp_b_real = a_real - b_mod_real

B18         temp_b_imag = a_imag -b_mod_imag

B19         *b_real_reg++ = temp_b_real

B20         *b_real_imag++ = temp_b_imag

//END BUTTERFLY

//**********************

 

IL1          twid_real_reg = twid_real_reg + num_DFTs // wrap around

IL2          twid_real_imag = twid_real_imag + num_DFTs // wrap around

 

IL3          butterfly_count = butterfly_count – 1

IL4          flag = cmpeq(butterfly_count,0)

IL5          if flag(butterfly _count !=0), branch INNER LOOP

 

//END INNER LOOP

//**********************

ML10      DFT_count = DFT_count – 1

ML11      flag = cmpeq(DFT_count,0)

ML12      if flag(DFT _count !=0), branch MIDDLE LOOP

//END MIDDLELOOP

//**********************

 

//resize for next stage

OL2         num_DFT = num_DFT >> 1; //e.g., 8,4,2,1                                                      

OL3         offset = offset << 1;                            

 

OL4         stage_size = stage_size << 1;                           

 

OL5         stage_counter = stage_counter 1

OL6         flag = cmpeq(stage_counter,0)

OL7         if flag(stage_counter !=0), branch OUTER LOOP

 

//END OUTER LOOP

//**********************

//             Restore stuff

16            (instruction to branch back)

4.2           Raw Operations Equation

Class

Equation

Raw

15+ 7*log2(N) + (N-1)* 12+ (20 + 6)*log2(N)

Memory

10*log2(N)

Multiplication

(N-1)*1 + 4*log2(N)

ALU

15+7*log2(N) + (N-1)*11+12*log2(N)

4.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL5,6 ML10,11, IL3,4 eliminated

(ALU)  -2*(log2(N) + (N-1) + log2(N))

BPOS

OL6 ML11, IL4 eliminated

(ALU) (log2(N) + (N-1) + log2(N))

ZOL

OL5-7 ML10-12, IL3-5 eliminated

(ALU) (log2(N) + (N-1)) 3* log2(N) + (N-1) + log2(N)

MEM2

MEM cut in half

(MEM) -(10*log2(N))/2

MEM4

MEM cut in fourth

(MEM) -3*(10*log2(N))/4

NOREG

MEM eliminated

(MEM) -(10*log2(N))

CMPX_MPY

B7-11 eliminated

(MULT) -3* log2(N),

(ALU)-4*log2(N)

MAC

B9, B12 eliminated

(ALU) -2*log2(N)

4.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL5,6 ML10,11, IL3,4 added back in

(ALU)  target+2*(log2(N) + (N-1) + log2(N))

BPOS, BDEC

OL6 ML11, IL4 added back in

(ALU)  target+(log2(N) + (N-1) + log2(N))

BPOS, ZOL

OL6 ML11, IL4 added back in

(ALU)  target+(log2(N) + (N-1) + log2(N))

BPOS, BDEC, ZOL

OL6 ML11, IL4 eliminated (again)

(ALU)  target (log2(N) + (N-1) + log2(N))

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (10*log2(N))/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(10*log2(N))/4

CMPX_MPY, MAC

B9, B12 added back in

(ALU)  target-2*log2(N)

 

5             LMS Equalizer (real)

This component represents the implementation of a real least-mean-squares equalizer which generates outputs on a symbol by symbol basis and updates the coefficients with the error calculated externally. Because the process will vary from implementation to implementation, all scaling is assumed to occur outside of this function. Note that circular buffering is not generally done with LMS equalization because it is run at the symbol rate rather than at the sample rate.

5.1           Pseudocode

Parameters: N (filter length)

 

y=lms(coef, data, length, error, step)

 

//Move input parameters to local registers

1              R1 = coef (address)

2              R2 = data (address)

3              R3 = length (actual #)

4              R4 = error

5              R5 = step

 

//zero accumulator (typically done by subtracting a register from itself)

6              acc = 0 //an ALU operation

 

//filtering operation

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

//(loop1 label)      

L1,1         R7 = *R1++

L1,2         R8 = *R2++

L1,3         R9 = R7 * R8

L1,4         acc = acc + R9

 

L1,5         R3 = R3 1

L1,6         flag = cmp(R3,0)

L1,7         if flag (R3!=0), branch to loop1

 

//adjust coefficients

7              R6 = step * error // no need to calculate this everytime               

8              R6 = R6 * acc //(y*err * weight)

9              R1 = coef (address)//reset pointers

10            R2 = data (address)

11            R3 = length

 

//Loop through and update coefficients

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

//loop2

L2,1         R7 = *R1 // note no postfix adjustment yet

L2,2         R8 = *R2++ //x[k]

L2,3         R9 = R8 * R6 //x [k] * y* err * weight

L2,4         R7 = R9 + R7 //h[k] + x [k] * y* err * weight

L2,5         *R1++ = R7 // h[k] + x [k] * y*err * weight

 

L2,6         R3 = R3 1

L2,7         flag = cmp(R3,0)

L2,8         if flag (R3!=0), branch to loop2

 

//             Move result to output register

12            R_out = acc

//             Restore stuff

13            (instruction to branch back)

5.2           Raw Operations Equation

Class

Equation

Raw

13+ N * 15

Memory

5*N

Multiplication

2*N + 2

ALU

8*N + 11

5.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L1,5-6; L2,6-7 eliminated

(ALU)  -4*N

BPOS

L1,6; L2,7 eliminated

(ALU) -2*N

ZOL

L1,5-7; L2,6-8 eliminated + setup

(ALU) 2 - 6*N

MAC

L1,4; L2,4 eliminated

(ALU)  -2*N

MEM2

MEM cut in half

(MEM) -(5*N)/2

MEM4

MEM cut in fourth

(MEM) -3*(5*N)/4

NOREG

MEM eliminated

(MEM) -(5*N )

5.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L1,5-6; L2,6-7 added back

(ALU)  target+6*N

BPOS, BDEC

L1,6; L2,7 eliminated

(ALU)  target+2*N

BPOS, ZOL

L1,6; L2,7 eliminated

(ALU)  target+2*N

BPOS, BDEC, ZOL

L1,6; L2,7 eliminated (again)

(ALU)  target 2*N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (5*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(5*N)/4

 

 

6             Taylor Series

This component represents the implementation of a Taylor series expansion of cos(x). It assumes reciprocals of the factorials are precomputed and stored (with appropriate signs). All scaling is expected to occur outside of this component. Note the # terms refers to the number of terms which should be used in the calculation which means the least term will be approximzately 2*term.

6.1           Pseudocode

Parameters: N (terms)

 

y=cos_tayl(x, rcp, length)

 

//Move input parameters to local registers

1              R1 = x (value)

2              R2 = rcp (address)

3              R3 = length (actual #)

 

//zero accumulator (typically done by subtracting a register from itself)

4              acc = 1 //scaled as need be

 

//calculate output

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label)            R7 = *R2++ //load rcp

L2            R1 = R1*R1 //even exponents

L3            R9 = R1 * R7

L4            R1 = R1*R1 //odd exponent

L5            acc = acc + R9

L6            R3 = R3 1

L7            flag = cmp(R3,0)

L8            if flag (R3==0), branch to loop

 

//             Move result to output register

5              R_out = acc

6              (instruction to branch back)

6.2           Raw Operations Equation

Class

Equation

Raw

6 + N*8

Memory

N

Multiplication

3*N

ALU

4*N + 6

6.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L,6-7 eliminated

(ALU)  -2*N

BPOS

L7 eliminated

(ALU) N

ZOL

L6-8 eliminated + setup

(ALU) 1 3*N

MAC

L5 eliminated

(ALU)  -N

MEM2

MEM cut in half

(MEM) -(N)/2

MEM4

MEM cut in fourth

(MEM) -3*(N)/4

NOREG

MEM eliminated

(MEM) -(N)

6.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L6-7 added back

(ALU)  target+2*N

BPOS, BDEC

L7 added back

(ALU)  target+N

BPOS, ZOL

L7 added back

(ALU)  target+N

BPOS, BDEC, ZOL

L7 eliminated (again)

(ALU)  target N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N)/4

 

7             CORDIC

This component represents the implementation of a CORDIC algorithm operating in normal mode (as opposed to hyperbolic or linear modes) where arctangent values have been precomputed and stored. Note K is given by which is assumed to have been precalculated external to this function (resolution is not frequently changed from call to call).

7.1           Pseudocode

Parameters: N (length, i.e., number of iterations)

 

result=CORDIC(theta, K, length, atan)

 

//Move input parameters to local registers

1              z = theta (value)

2              R2 = K

3              R3 = length (actual #)

4              R4 = atan (address)

 

//assign initial values (registers)

4              x = K

5              y = 0

6              iter = 0

 

//calculate output

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label)  R5 = RSH (y,iter)

L2            R6 = RSH (x,iter)

L3            R7 = *R4++

L4            flag = cmp(R3,0)

L5            if flag (R3<0), branch to label 2

 

L5a          x = x R6

L6a          y = y + R5

L7a          z = z R7

L7.5a       Branch to label 3

 

(label2)

L5b         x = x + R6

L6b         y = y R5

L7b         z = z + R7

 

(label 3)

L8            R3 = R3 1

L9            flag = cmp(R3,0)

L10          if flag (R3!=0), branch to loop

 

//             Move result to output register

7              store x (cos(theta))

8              store y (sin(theta))

9              (instruction to branch back)

 

7.2           Raw Operations Equation

Class

Equation

Raw

9 + N*10.5 (L7.5a only executed half of times)

Memory

N

Multiplication

0

ALU

9.5*N + 9

7.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L8,9 eliminated

(ALU) 2*N

BPOS

L4, L9 eliminated

(ALU) 2*N

ZOL

L8-10 eliminated + setup

(ALU) 1 3*N

COND_EXEC

L7.5 eliminated

(ALU) -0.5*N

MEM2

MEM cut in half

(MEM) (N)/2

MEM4

MEM cut in fourth

(MEM) 3*(N)/4

NOREG

MEM eliminated

(MEM) (N)

7.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L8-9 added back

(ALU)  target+2*N

BPOS, BDEC

L9 added back

(ALU)  target+N

BPOS, ZOL

L9 added back

(ALU)  target+N

BPOS, BDEC, ZOL

L9 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N)/4

 

8             Interleaver

This component represents the implementation of a block interleaver which is given an input linear array of data, an output linear array of data and a mapping array.

8.1           Pseudocode

Parameters: N (length, i.e., number of iterations)

 

interleaver(x, y, length, map)

 

//Move input parameters to local registers

1              R1 = x (address)

2              R2 = y (address)

3              R3 = length (actual #)

4              R4 = map (address)

 

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label) R5 = *R1++ (load x)

L2            R6 = *R4++ (load map address offset)

L3            R7 = R2 + R6 (offset the address)

L4            *R7 = R5 (store x[k] in y[map])

L5            R3 = R3 1

L6            flag = cmp(R3,0)

L7            if flag (R3==0), branch to loop

 

5              (instruction to branch back)

 

8.2           Raw Operations Equation

Class

Equation

Raw

5 + N*7

Memory

3*N

Multiplication

0

ALU

4*N+5

8.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L5,6 eliminated

(ALU)  -2*N

BPOS

L6 eliminated

(ALU) N

ZOL

L5-6 eliminated + setup

(ALU) 1 3*N

INDEX

L3 eliminated

(MEM) -N

MEM2

MEM cut in half

(MEM) -(3*N)/2

MEM4

MEM cut in fourth

(MEM) -3*(3*N)/4

NOREG

1-4; L1, L4 eliminated

(MEM) -3*N - 4

8.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L5-6 added back

(ALU)  target+2*N

BPOS, BDEC

L6 added back

(ALU)  target+N

BPOS, ZOL

L6 added back

(ALU)  target+N

BPOS, BDEC, ZOL

L6 eliminated (again)

(ALU)  target N

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(3*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(3*N)/4

 

9             DeInterleaver

This component represents the implementation of a block deinterleaver which is given an input linear array of data, an output linear array of data and a mapping array. Uses the same map as for the interleaver, i.e., moves y into x.

9.1           Pseudocode

Parameters: N (length, i.e., number of iterations)

 

deinterleaver(x, y, length, map)

 

//Move input parameters to local registers

1              R1 = x (address)

2              R2 = y (address)

3              R3 = length (actual #)

4              R4 = map (address)

 

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

L1            (loop label) R6 = *R4++ (load map address offset)

L2            R7 = R2 + R6 (offset the address)

L4            R5 = *R7 (fetch y[map])

L4            (loop label) *R1++ = R5 (store in x[k])

L5            (label 3) R3 = R3 1

L6            flag = cmp(R3,0)

L7            if flag (R3==0), branch to loop

 

5              (instruction to branch back)

 

9.2           Raw Operations Equation

Class

Equation

Raw

5 + N*7

Memory

3*N

Multiplication

0

ALU

4*N+5

9.3           Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L5,6 eliminated

(ALU)  -2*N

BPOS

L6 eliminated

(ALU) N

ZOL

L5-6 eliminated + setup

(ALU) 1 3*N

INDEX

L3 eliminated

(MEM) -N

MEM2

MEM cut in half

(MEM) -(3*N)/2

MEM4

MEM cut in fourth

(MEM) 3*(3*N)/4

NOREG

1-4; L1, L4 eliminated

(MEM) 3*N

9.4           Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L5-6 added back

(ALU)  target+2*N

BPOS, BDEC

L6 added back

(ALU)  target+N

BPOS, ZOL

L6 added back

(ALU)  target+N

BPOS, BDEC, ZOL

L6 eliminated (again)

(ALU)  target N

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(3*N )/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(3*N)/4

 

10       CIC Filter (Interpolator, M=1)

This component represents the implementation of a block CIC interpolator with differential delay of 1 (M=1 in traditional notation). Note that doing more than M=1 will generally significantly increase the cycle count (gotta calculate address offsets), but that increasing M beyond 1 will not increase cycles (though it does increase memory requirements).

 

Note: most CIC implementations have their stage loops unrolled (typically length <5 well within typically available # registers) and this is reflected in the implementation. Also note that while were still assuming that scaling occurs outside of the CIC filter (this can be quite large for CIC decimators).

10.1     Pseudocode

Parameters: N (block length of input data)

                    S (number of stages)

                    R (upconversion rate)

 

cic_interpolator(x, y, length, I_reg, C_reg, R)

 

//Move input parameters to local registers

1              R1 = x (address)

2              R2 = y (address)

3              R3 = length (actual #)

4              R4 = I_reg (address)

5              R5 = C_reg(address)

6              R6 = R (actual #)

 

//Note inherent assumption that length > 0

//Note for loops are implemented as conditional branches in assembly

 

OL1         (loop label) R7 = *R1++ (load x)

//Comb Stage (showing 3 unrolled could be S)

OL2,1      R8 = *R4++ (get stored value stage 1)

OL2,2      R9 = *R4++ (get stored value stage 2)

OL2,3      R10 = *R4++ (get stored value stage 3)

OL3,1      R10 = R9 - R10 (evaluate comb stage 3)

OL3,2      R9 = R8 - R9 (evaluate comb stage 2)

OL3,3      R8 = R7 R8 (evaluate comb stage 1)

OL4,1      *R4-- = R10 (store comb stage 1)

OL4,2      *R4-- = R9 (store comb stage 2)

OL4,3      *R4-- = R8 (store comb stage 3)

 

//Integrator Stage (showing 3 unrolled could be S)

//Theres some accumulated zeros from zero-stuffing that can be saved in the first integrator with a less

// general implementation

OL5         R7 = R6

(loop2)

IL1,1       R8 = *R5++ (get stored value stage 1)

IL1,2       R9 = *R5++ (get stored value stage 2)

IL1,3       R11 = *R5++ (get stored value stage 3)

IL2,1       R11 = R11 + R9 (evaluate comb stage 3)

IL2,2       R9 = R8 + R9 (evaluate comb stage 2)

IL2,3       R8 = R10 + R8 (evaluate comb stage 1)

IL3,1       *R5-- = R11 (store comb stage 3)

IL3,2       *R5-- = R9 (store comb stage 2)

IL3,3       *R5-- = R8 (store comb stage 1)

IL4          *R2+= = R11 (store output)

IL5          R7 = R7 1

IL6          flag = cmp(R7,0)

IL7          if flag (R7==0), branch to loop2

//outer loop control

OL6         (label 3) R3 = R3 1

OL7         flag = cmp(R3,0)

OL8         if flag (R3==0), branch to loop

 

7              (instruction to branch back)

 

10.2     Raw Operations Equation

Class

Equation

Raw

7+ N*(3*S +5) + N*R*(3*S +4)

Memory

N*(2*S+1) + N*R(2*S+1)

Multiplication

0

ALU

7+N*(S+4) + N*R*(S+3)

10.3     Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

IL5,6, OL6,7 eliminated

(ALU)  -2*N*(1+R)

BPOS

IL6, OL7 eliminated

(ALU) -N*(1+R)

ZOL

IL5-7, OL6 eliminated + setup

(ALU) 1 -3*N*R

MEM2

MEM cut in half

(MEM) N*S*(1+R) -N/2*(1+R)

MEM4

MEM cut in fourth

(MEM) 3*(N*(2*S+1) + N*R(2*S+1))/4

NOREG

MEM eliminated

(MEM) -(N*(2*S+1) + N*R(2*S+1))

10.4     Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

IL5,6, OL6,7 added back

(ALU)  target+2*N*(1+R)

BPOS, BDEC

IL6, OL7 added back

(ALU)  target+N*(1+R)

BPOS, ZOL

IL6, OL7 added back

(ALU)  target+N*(1+R)

BPOS, BDEC, ZOL

IL6, OL7 eliminated (again)

(ALU)  target - N*(1+R)

MEM2, NOREG

MEM2 effect undone

(MEM)  target +N*S*(1+R) +N/2*(1+R)

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N*(2*S+1) + N*R(2*S+1))/4

 

 

11       CRC Encoder

This component represents the implementation of a CRC encoder of a message with length M bits with an encoding polynomial of order r where r<=16. The implementation assumes a register width of 16, While this is a straight mapping from a typical hardware implementation (bit-by-bit), a more efficient CRC encoder can be implemented using LUTs when the CRC has a relatively low order (e.g.,  <=5). However, the LUT approach is impractical for high order encoders (e.g., a CRC with  order 16 would need 2^(16+) entries. Note for very short messages, it would more efficient to specify # of bits instead of # of 16-bit words.[1] Also note that if program memory were more important than cycles, this could be rewritten to loop instead of unrolling the loop. Also note that the input message should be padded with zeros at the end to effect outputting of the CRC check bits. Finally, the polynomial should include the leading 1

11.1     Pseudocode

Parameters: M (length of message)

                   

remainder = crc_encoder(message, p, length, output,)

 

//Move input parameters to local registers

1              R1 = message

2              R2 = p (encoding polynomial)

3              R3 = length (#16 bit words)

4              R10 = output

 

5              R6 = 0 //relax CRC register

6              R7 = 0 //output register

 

loop1:

L1            R5 = *R1++ //load first 16-bit word

//unrolled 16 times

L1R         R9 = R5&(2^16-1) //left most

L2R         flag = cmpgt(R9,0)

L3R         if flag(R9==0) branch label1

L3.5R      R6 = R6 XOR R2

Label 1:

L4R         R7 <<1

L5R         R9 = R6&1

L6R         R7 = R7 XOR R9

L7R         R5 = R5 << 1

L8R         R6 = R6 >> 1 //update shift register

 

L2            *R10++ = R7 //output word

 

//Note inherent assumption that length > 2

L3            R3 = R3 1

L4            flag = cmp(R3,0)

L5            if flag (R3==0), branch to loop

 

7              (instruction to branch back)

 

11.2     Raw Operations Equation

Class

Equation

Raw

7 +8.5*M*16 + M*5

Memory

2*M

Multiplication

0

ALU

7+8.5*M*16 + 3*M

11.3      Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L3,4 eliminated

(ALU)  -2*M

BPOS

L4, L2R eliminated

(ALU) -M-16*M

ZOL

L2-4 eliminated + setup

(ALU) 1 -3*M

COND_EXEC

L3.5R eliminated

(ALU) -8*M

EXTRACT

L5R, L7R eliminated

(ALU) -2*16*M

MEM2

MEM cut in half

(MEM) (2*M)/2

MEM4

MEM cut in fourth

(MEM) 3*(2*M)/4

NOREG

MEM eliminated

(MEM) -(2*M)

11.4     Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L2,3 added back

(ALU)  target+2*M

BPOS, BDEC

L3 added back

(ALU)  target+M

BPOS, ZOL

L3 added back

(ALU)  target+M

BPOS, BDEC, ZOL

L3 eliminated (again)

(ALU)  target - M

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(2*M)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(2*M)/4

 

12       CRC Decoder

This component represents the implementation of a CRC decoder of a message with length M bits with an encoding polynomial of order r where r<=16. Note that a CRC decoder is a CRC encoder, except that the input includes check bits (implicit to the output of the encoder above).

12.1     Pseudocode

Parameters: M (length of message)

                   

remainder = crc_encoder(message, p, length, output,)

 

//Move input parameters to local registers

1              R1 = message

2              R2 = p (encoding polynomial)

3              R3 = length (#16 bit words)

4              R10 = output

 

5              R6 = 0 //relax CRC register

6              R7 = 0 //output register

 

loop1:

L1            R5 = *R1++ //load first 16-bit word

//unrolled 16 times

L1R         R9 = R5&(2^16-1) //left most

L2R         flag = cmpgt(R9,0)

L3R         if flag(R9==0) branch label1

L3.5R      R6 = R6 XOR R2

Label 1:

L4R         R7 <<1

L5R         R9 = R6&1

L6R         R7 = R7 XOR R9

L7R         R5 = R5 << 1

L8R         R6 = R6 >> 1 //update shift register

 

L2            *R10++ = R7 //output word

 

//Note inherent assumption that length > 2

L3            R3 = R3 1

L4            flag = cmp(R3,0)

L5            if flag (R3==0), branch to loop

 

7              (instruction to branch back)

 

12.2     Raw Operations Equation

Class

Equation

Raw

7 +8.5*M*16 + M*5

Memory

2*M

Multiplication

0

ALU

7+8.5*M*16 + 3*M

12.3      Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L3,4 eliminated

(ALU)  -2*M

BPOS

L4, L2R eliminated

(ALU) -M-16*M

ZOL

L2-4 eliminated + setup

(ALU) 1 -3*M

COND_EXEC

L3.5R eliminated

(ALU) -8*M

EXTRACT

L5R, L7R eliminated

(ALU) -2*16*M

MEM2

MEM cut in half

(MEM) (2*M)/2

MEM4

MEM cut in fourth

(MEM) 3*(2*M)/4

NOREG

MEM eliminated

(MEM) -(2*M)

12.4     Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L2,3 added back

(ALU)  target+2*M

BPOS, BDEC

L3 added back

(ALU)  target+M

BPOS, ZOL

L3 added back

(ALU)  target+M

BPOS, BDEC, ZOL

L3 eliminated (again)

(ALU)  target - M

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(2*M)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(2*M)/4

13       Convolutional Encoder

This component represents the implementation of a convolutional encoder of a message with length M 16-bits words of constraint length K (K<=16) and rate r (of form 1/r puncturing can occur in a separate process). The encoding is assumed to be hardcoded to minimize cycles. Note: were encoding 16 bits per loop.

13.1     Pseudocode

Parameters:      M (# of 16-bit words in message – round up)

                        K (constraint length)

                        taps (total # of XOR taps for encoding polynomials)

                        r (rate)

                   

crc_encoder(message, length, g0, …, gr)

 

//Move input parameters to local registers

1              R1 = message (address)

2              R3 = length  (#16 bit words)

 

//Note inherent assumption that length > 2

3              R4 = 0 //buffer is initially empty

 

//             set pointers to output registers (r lines)

4              Rx1 = g0

5              Rx2 = g1

 

L1            loop1:     R5 = *R1++ // storing word to be shifted in

//repeat K  times (here well assume K = 3)

L21          R6 = R4 >> 1

L22          R2 = R5 & 1 //these three cycles are easily eliminated collapsed to one with good bit control

L23          R2 = R2 << 15

L24          R6 = R6 XOR R2

 

L31          R7 = R4 >> 2

L32          R2 = R5 & 3 //these three cycles are easily eliminated collapsed to one with good bit control

L33          R2 = R2 << 14

L34          R7 = R7 XOR R2

 

L41          R8 = R4 >> 3

L42          R2 = R5 & 7 //these three cycles are easily eliminated collapsed to one with good bit control

L43          R2 = R2 << 13

L44          R6 = R6 XOR R2

 

//repeat r times

//repeat taps -1 times

L51          R9= R8 XOR R6 // g0 = 1+x+x^2

L52          R9 = R9 XOR R5

L53          *g0++ = R9 //store 16-bit result in output word

// note total # operations will be equal to taps

 

L6            R4 = R5

//loop control

L7            R3= R3 1 //decrement count

L8            flag = Compare R3,0

L9            If flag, branch Loop

 

//Epilog

6              R5 = 0

(effectively repeat the loop except for L1, L6-L9)

 

7              (instruction to branch back)

13.2     Raw Operations Equation

Class

Equation

Raw

5 + r + (M-1)*(5+4*K+r*taps) + (4*K+r*taps)

Memory

r*(M) + M-1

Multiplication

0

ALU

5 + r+ M*(K*4+(taps-1)*r) + (M-1)*4

 

13.3     Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L7,L8 eliminated

(ALU)  -2*(M 1)

BPOS

L8 eliminated

(ALU) (M-1)

ZOL

L7-9 eliminated + setup

(ALU) -3*(M-1)+1

EXTRACT

L22-24 et al collapsed to single cycle

(ALU) -2*M*K

MEM2

MEM operations cut in half

(MEM) (r*(M) + M-1)/2

MEM4

MEM operations cut to quarter

(MEM) 3*(r*(M) + M-1)/4

NOREG

MEM operations eliminated

(MEM) (r*(M) + M-1)

13.4     Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L7,L8 added back

(ALU)  target +2*(M 1)

BPOS, BDEC

L8 added back

(ALU)  target+(M-1)

BPOS, ZOL

L8 added back

(ALU)  target+(M-1)

BPOS, BDEC, ZOL

L8 eliminated (again)

(ALU)  target - (M-1)

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (r*(M) + M-1)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(r*(M) + M-1)/4

 

14       Viterbi Decoder (rate 1/r, hard decisions, traceback = 32)

This component represents the implementation of a Viterbi decoder with hard decisions for a rate 1/r code without puncturing. The transition matrix (which states go to which states) and the output word matrix (what words would be output by those transitions) should be predefined and passed in and all memory allocations performed externally. For simplicity, the input receive vector is assumed to be grouped into words = of r bits  without packing. Note that traceback is dramatically simplified by assuming that the traceback length is less than the register width. Finally note that this should not be used with codes with K>5 (theres a rule of thumb for traceback that path lengths > 5.8xK for good SNR)

14.1     Pseudocode

Parameters: num_states

                    N (length > 32)

r

 

A Viterbi decoder is a very involved piece of software. For any hope of readability, the following conventions were changed to improve readability. First, variable names instead of generic register names were used. Second, portions of the code were separated out into distinct subsections. None of these subsection are intended to be treated as independent functions/procedures and are instead intended to be substituted in where indicated.

 

=======================

viterbi_decoder(rcv_vect, length, output_vect, transition_matrix, input_output_matrix)

 

//initialize metrics and paths

VD1        v = base_address

VD2        *v++ =  0 //0 node is initial node

VD3        reg_bank = base_address

VD4        *reg_bank++ = 0 // initial bit is 0

VD5        rcv_reg = rcv_vect

//VD4      output_reg = output_vect // assigned later when needed

VD6        transition_reg = transition_matrix

VD7        output_matrix = input_output_matrix

 

//***************

// INITIALIZATION LOOP

VD8        tmp = num_states-1

LABEL: INIT_LOOP

INL1       *v++ = MAX_NEG //hardcoded to largest negative

INL2       *reg_bank++ = 0

 

INL3       tmp = tmp-1

INL4       flag = cmpeq(tmp,0)

INL5       if !flag, branch init_loop

 

// END INITIALIZATION LOOP

//************

 

//****************

// MAIN LOOP

VD9        main_ctr = length

LABEL: MAIN_LOOP

MNL1     v = 0 address //repoint to beginning

MNL2     reg_bank = 0 address // repoint to beginning

 

(Branch Metric Unit) //advances rcv_sample pointer, assigns values to output_metric array

 

(Path Metric Unit)

 

//COPY LOOP

//Copy results from Path Metric Unit

MNL3     state_cnt = num_states

 

LABEL:  COPY_LOOP

CPY L1   tmp_V = *temp_V++

CPY L2   tmp_reg = *temp_reg_bank++              

               

CPY L3   *V++ = tmp_V

CPY L4   *reg_bank++ = tmp_reg

 

CPY L5   state_cnt = state_cnt – 1

CPY L6   flag = cmpeq(state_cnt,0)

CPY L7   if !flag, branch copyr loop

//END COPY LOOP

 

MNL4     Flag = cmplt(main_ctr,31) // main_ctr > 31 (full register) ??

MNL5     If Flag, branch MNL6

 

(Traceback Unit)

               

MNL6     main_ctr = main_ctr -1

MNL7     flag = cmpeq(main_ctr,0)

MNL8     if !flag, branch init_loop

// MAIN LOOP

//****************

 

//****************

// FLUSH REGISTERS

 

FR1         FR_cnt = 31 //one less than reg width

FR2         temp_reg = reg_bank + ind

FR3         temp_val = *temp_reg

 

LABEL: FR_LOOP

FRL1       temp_val = temp_val << 1

FRL2       temp_val2 = temp_val

FRL3       temp_val2 = tempval2 >> 31

FRL4       temp = temp_val2 & 1

FRL5       *output_vect++ = temp

 

FRL6       FR_cnt = FR_cnt - 1

FRL7       flag = cmpeq(FR_cnt, 0)

FRL8       if !flag, branch FR_LOOP

 

VD10      (Instruction to branch back)

 

Hard_metric(rcv_samp, ideal)

 

HM1       Tmp = xor(rcv_word,ideal)

HM2       Acc = 0

//repeat #bits per word (cept once)

W1          Acc = acc+tmp&1

W2          Tmp>>1

 

===============================

BRANCH METRIC UNIT (effectively)

BMU1                    state_cnt = num_states //hard coded

BMU2                    Rcv_word = *rcv_samp++

BMU3                    output_reg = input_output_matrix

//OUTER LOOP (STATES)

LABEL: OUTER_LOOP

//INNER LOOP (UNROLL twice, no actual loop control)

//2x following

BMU IL1                                ideal = *output_reg++

 

(Hard_metric)

 

BMU IL2                                *output_metric++ = Acc

 

BMU OL                1              state_cnt = state_cnt – 1

BMU OL2              flag = cmpeq(state_cnt,0)

BMU OL3              if !flag, branch outer loop

 

====================

 

Add-Compare-Select

 

ACS1      Temp0 = V1 + M1 //metric for path with new metric V1 and old M1

ACS2      Temp1 = V2 + M2//metric for path with new metric V2 and old M2

 

ACS3      Flag = cmp(temp0,temp1)

ACS4      If flag branch ACS8 //can’t quite do this with just a MAX

ACS4.1   Rtn_v = temp0

ACS4.2   Rtn_index = 0

ACS5      branch ACS7

ACS6      Rtn_v = temp1

ACS6.1   Rtn_index = 1

 

ACS7      //really first thing out of ACS – not counted as an actual instruction

 

===============================

PATH METRIC UNIT (effectively)

PMU1                     PMU_cnt = num_states //hard coded

PMU2                     offset = 0

 

LABEL: PMU_LOOP

 

//GET INDICES FOR METRICS (WHAT NODE BRANCHES INTO WHAT NODE?)

PMU L1                                 ind1 = *temp_transition_matrix++

PMU L2                                 ind2 = *temp_transition_matrix--

 

//GET METRICS FOR LEAD IN NODES

PMU L3                 V_temp = V + ind1 //INDEX

PMU L4                 V1 = *V_temp

PMU L5                 V_temp = V + ind2 //INDEX

PMU L6                 V2 = *V_temp                       

 

PMU L8                 M_temp = output_metrics + ind1

PMU L9                 M_temp = output_metrics + offset //INDEX

PMU L10               M1 = *M_temp

 

PMU L11               M_temp = output_metrics + ind2

PMU L12               M_temp = output_metrics + offset //INDEX

PMU L13               M2 = *M_temp

 

(ACS CALCULATION)

 

//PMU L14             *temp_v++ =Rtn_v //actually done in ACS. commented here for clarity

PMU L14               temp_transition_matrix = temp_transition_matrix + Rtn_index

PMU L15               ind2 = *temp_transition_matrix

PMU L16               reg_bank_reg = reg_bank + ind2 // INDEX

PMU L17               reg_bank_val = *reg_bank_reg

PMU L18               reg_bank_val = reg_bank_val << 1 //(zero fill)

PMU L19               reg_bank_val = reg_bank_val OR offset //eliminated with VSL instruction

PMU L20               *temp_reg_bank++ =reg_bank_val

 

PMU L21               offset = offset XOR 1 // note, this is very much not an add as it’s supposed to toggle the value

 

PMU L22               PMU_cnt = PMU_cnt - 1

PMU L23               flag = cmpeq(PMU_cnt, 0)

PMU L24               if !flag, branch PMU_LOOP

============================

 

===============================

TRACEBACK UNIT (effectively)

 

(FIND_MAX)

TBU1      temp_reg = reg_bank + ind

TBU2      temp_val = *temp_reg

TBU3      temp_val2 = temp_val >> 31 //extract left most bit

TBU4      temp_val = temp_val2 & 1

TBU6      *output_vect++ = temp_val

====================

 

===================

find_max(int *V_vect)

 

MAX1    max_cnt = num_states

MAX2    max = 0

MAX3    ind = 0

 

MAX4    V_vect = V //address

 

LABEL: MAX_LOOP

MAXL1  tmp = *V_vect++

 

MAXL2  flag = cmpgt(tmp,max)

MAXL3  if !flag, branch MAXL6

MAXL4  ind = num_states – max_cnt

MAXL5  max = tmp

 

MAXL6  max_cnt = max_cnt – 1

MAXL7  flag = cmpeq(max_cnt, 0)

MAXL8  if !flag, branch MAX_LOOP

14.2     Raw Operations

To make these estimations readable, the following breaks down the Viterbi mapping operations by module along with the number of times that module is called.

 

14.2.1 Find Max

Class

Equation

# Called

N-31

Raw

4 + 8*num_states

Memory

num_states

Multiplication

0

ALU

4 + 7*num_states

14.2.2 Traceback Unit

Class

Equation

# Called

N - 31

Raw

6

Memory

2

Multiplication

0

ALU

4

14.2.3 Add Compare Select

Class

Equation

# Called

N * num_states

Raw

4+3*0.5+2*0.5 = 6.5

Memory

0

Multiplication

0

ALU

6.5

14.2.4 Path Metric Unit

Class

Equation

# Called

N

Raw

2+num_states*24

Memory

num_states*10

Multiplication

0

ALU

2+num_stats*14

14.2.5 Hard metric

Class

Equation

# Called

N*num_states*2

Raw

2+rate*2 -1

Memory

0

Multiplication

0

ALU

2+rate*2 -1

 

14.2.6 Branch Metric Unit

Class

Equation

# Called

N

Raw

3+num_states*(4+3)

Memory

1 +num_states*4

Multiplication

0

ALU

2+num_states*3

14.2.7 Main Decoder

Class

Equation

# Called

1

Raw

13 +(num_states-1)*5 + N*(8 + num_states*7) + 3 + 8*31

Memory

3+ (num_states-1)*2 + N*(2 + num_states*4) + 0 + 1*31

Multiplication

0

ALU

10 + (num_states-1)*3 + N*(6 + num_states*3) + 31*7

14.2.8 Integrated Equations

Class

Equation

Raw

[13 +(num_states-1)*5 + N*(8 + num_states*7) + 3 + 8*31] + N*[3+num_states*(4+3)] + N*num_states*2*[2+rate*2 -1] + N*[2+num_states*24] + N*num­_states*6.5 + 6*(N-31) + (N-31)*(4+8*num_states)

Memory

3+ (num_states-1)*2 + N*(2 + num_states*4) + 1*31 + N*[1 +num_states*4] + N*num_states*2*0 + N * num_states*10 + 0 + (N-31)*2 + (N-31)*num_states

Multiplication

0

ALU

10 + (num_states-1)*3 + N*(6 + num_states*3) + 31*7 + N*(2+num_states*3) + N*num_states*2*(2+rate*2-1) + N*(2+num_states*14) + N*num_states*6.5 + (N-31)*4 + (N-31)*( 4 + 7*num_states)

14.3Impact of Specialized Instructions and Synergistic

The following describe the effects of the specialized operations except for MEM2, MEM4, and NOREG which are handled in the integrated equations.

14.3.1 Find Max (N-31)

Instruction

Impact

Modifier Equation

BDEC

MAXL4,5 eliminated

(ALU)  -(2*num_states)*(N-31)

BPOS

MAXL5 eliminated

(ALU) -num_states*(N-31)

ZOL

MAXL4-6 eliminated

(ALU) -3*num_states*(N-31)

 

Modifiers

Impact

Modifier Equation

BDEC, ZOL

MAXL4,5 added back

(ALU)  target +(2*num_states)*(N-31)

BPOS, BDEC

MAXL5 added back

(ALU)  target+ num_states*(N-31)

BPOS, ZOL

MAXL5 added back

(ALU)  target+ num_states*(N-31)

BPOS, BDEC, ZOL

MAXL,5 eliminated (again)

(ALU)  target - num_states*(N-31)

14.3.2 Traceback Unit (N-31)

Instruction

Impact

Modifier Equation

INDEX

TBU1 eliminated

(ALU)  -(N-31)

EXTRACT

TBU3 eliminated

(ALU) -(N-31)

 

No synergies.

14.3.3 Add Compare Select (N*num_states)

Subtractive Modifiers

Instruction

Impact

Modifier Equation

BPOS

PMUL23  eliminated

(ALU) -N*num_states

 

No synergies

14.3.4 Path Metric Unit (N)

Subtractive Modifiers

Instruction

Impact

Modifier Equation

BDEC

PMUL22,23 eliminated

(ALU)  -N*2*num_states

BPOS

PMUL23  eliminated

(ALU) - N*num_states

ZOL

PMUL22-24 eliminated

(ALU) +N*(1-3*num_states)

VSL

PMUL19 eliminated

(ALU) - N*num_states

INDEX

PMUL3,5,9,12,16

(ALU) -5*N*num_states

 

Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

PMUL22,23 added back

(ALU)  target +2*N*num_states

BPOS, BDEC

PMUL23  added back

(ALU)  target+ N*num_states

BPOS, ZOL

PMUL23 added back

(ALU)  target+ N*num_states

BPOS, BDEC, ZOL

PMUL23 eliminated (again)

(ALU)  target - N*num_states

 

 

14.3.5 Hard metric (N*num_states*2)

Subtractive Modifiers

Instruction

Impact

Modifier Equation

EXTRACT

W1 eliminated

(ALU)  -N*num_states*2*r

 

No synergies

14.3.6 Branch Metric Unit (N)

Subtractive Modifiers

Instruction

Impact

Modifier Equation

BDEC

BMUOL1,2 eliminated

(ALU)  -N*2*num_states

BPOS

BMUOL2  eliminated

(ALU) -N*num_states

ZOL

BMUOL1-3 eliminated

(ALU) +N*(1-3*num_states)

 

Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

BMUOL1,2 added back

(ALU)  target + N*2*num_states

BPOS, BDEC

BMUOL2   added back

(ALU)  target+ N*num_states

BPOS, ZOL

BMUOL2  added back

(ALU)  target+ N*num_states

BPOS, BDEC, ZOL

BMUOL2  eliminated (again)

(ALU)  target - N*num_states

14.3.7 Main Decoder

Subtractive

Instruction

Impact

Modifier Equation

BDEC

INL3,4, MNL6,7, CPYL5,6, FRL6,7 eliminated

(ALU)  -2*(num_states-1) -2*N - 2*N*num_states - 31*2

BPOS

INL4, MNL4,7, CPYL6, FRL7  eliminated

(ALU) - N*(num_states-1) - 2*N - N*num_states - 31

ZOL

IN3-5, MNL6-8, CPYL5-7, FRL6-8 eliminated

(ALU) +(1-3*(num_states-1)) + 1-3*N + N*(1-3*num_states) - 31*3

EXTRACT

FRL1-3 eliminated

(ALU) -3*31

INDEX

FR2 eliminated

(ALU) -1

 

Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

INL3,4, MNL6,7, CPYL5,6, FRL6,7 added back

(ALU)  target + 2*(num_states-1) +2*N + 2*N*num_states + 31*2

BPOS, BDEC

INL4, MNL7, CPYL6, FRL7  added back

(ALU)  target+ N*(num_states-1) + N + N*num_states + 31

BPOS, ZOL

INL4, MNL7, CPYL6, FRL7  added back

(ALU)  target+ N*(num_states-1) + N +N*num_states + 31

BPOS, BDEC, ZOL

INL4, MNL7, CPYL6, FRL7  eliminated again

(ALU)  target - N*(num_states-1) - N - N*num_states - 31

14.3.8 Integrated Equations

Class

Equation

Raw

13 +(num_states-1)*5 + N*(11 +2 + num_states*(7+7 +24+6.5+ 2*(2+r*2-1))) + 8*31 + (N-31)*(6+4+6*num_states)

Memory

7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states)

Multiplication

0

ALU

6 + (num_states-1)*3 + N*(8 + 2+6.5+num_states*(3+3+14+2*(2+r*2-1))) +31*(7) +(N-31)*(4 + 5+6*num_states)

 

Subtractive Equations

Note that the impact of these modifiers is described in the preceding and not duplicated here for space considerations.

Instruction

Modifier Equation

BDEC

(ALU)  -(2*num_states)*max((N-31),1)-N*4*num_states -2*(num_states-1) -2*N*(num_states+1) -62

BPOS

(ALU) -num_states*max((N-31),1) - 4*N*num_states -N*max((num_states-1),1) - 2*N 31

ZOL

(ALU) -3*num_states*max((N-31),1)-9*N*num_states -3*max((num_states-1),1) - 91

EXTRACT

(ALU) -max((N-31),1)-N*num_states*2*r-3*31

INDEX

(ALU) -1-5*N*num_states-max((N-31),1)

VSL

(ALU) -N*num_states

MEM2

(MEM) (7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/2

MEM4

(MEM) -3*(7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/4

NOREG

(MEM)  -(7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))

 

Synergistic Modifiers

Modifiers

Modifier Equation

BDEC, ZOL

(ALU)  target + 2*(N*max((num_states-1),1) + N + 31 +3*N*num_states +num_states*max((N-31),1))

BPOS, BDEC

(ALU)  target+ (N*max((num_states-1),1) + N + 31 +3*N*num_states +num_states*max((N-31),1))

BPOS, ZOL

(ALU)  target+ (N*max((num_states-1),1) + N + 31 +3*N*num_states +num_states*max((N-31),1))

BPOS, BDEC, ZOL

(ALU)  target- (N*max((num_states-1),1) + N + 31 +3*N*num_states +num_states*max((N-31),1))

MEM2, NOREG

(MEM)  target + (7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/2

MEM4, NOREG

(MEM)  target +3*(7 +(num_states-1)*2 + N*(3 + num_states*(4+4 + 10)) + 1*31+(N-31)*(2+1+num_states))/4

 

15       Polyphase Interpolator

This component represents the implementation of a block real polyphase interpolator with upsampling factor r, original filter length N, to be calculated for M input samples. N/r is assumed to be an integer. (If its not, the designer might as well use more coefficients for less ripple as itll take the same amount of time. However, coefficients can be padded.) Note that coef is the address for the coefficients for the original filter, not the polyphase subfilters. Subfiltering is taken care of automatically.

15.1     Pseudocode

Parameters: M (block length of input data)

                    N (filter length)

        R (upconversion rate)

Requires: Circular buffering

 

y=polyphase_interpolate(coef, data, length, output_array)

 

//Move input parameters to local registers

1              (instruction to store previous setting in local register)

2              (instruction to turn on circ buff)

3              (instruction to set buffer length)

 

4              R11 = data

5              R3 = length (actual #)

6              R8 = output_array (address)

//*****************

// block loop

OL1         (outer loop) R2 = data (address) + R5 // block loop

OL2         R7 = r  //number of subfilters

OL3         R10 = coef (address) +  r //points r above h[0]

//*****************

// filter loop

ML1        (middle loop) R1 = R10 R7 //set up pointer to base of appropriate subfilter

ML2        R2 = R11 //reset data pointer

ML3        acc = 0 //zero accumulator (typically done by subtracting a register from itself)

ML4        R9 = N/r  //length of subfilter, hard coded

//*****************

// subfilter loop

IL1          (inner loop) R4 = *R1 //get coefficient p(R10-R7)[k]

IL2          R1 = R1 + N/r 1 // hard coded value, not two operations // eliminated by indexing

IL3          R5 = *R2++ // note: data steps at normal size

IL4          R6 = R5 * R4

IL5          acc = acc + R6

 

IL6          R9= R9 1 //decrement count

IL7          flag = Compare R9,0

IL8          If flag, branch inner Loop

 

//******************

ML5        *R8++ = acc //store subfilter result

 

ML6        R7= R7 1 //decrement count

ML7        flag = Compare R7,0

ML8        If flag, branch middle Loop

// end filter loop

//****************

OL4         R11++     //increment data pointer     

 

OL5         R3= R3 1 //decrement count

OL6         flag = Compare R3,0

OL7         If flag, branch outer Loop

// end block loop

//****************

//             Restore stuff

7              (instruction to turn reset addressing mode)

8              (instruction to turn reset buffer length)

9              (instruction to branch back)

15.2     Raw Operations

Class

Equation

Raw

9 + 8*(N/r)*r*M + 8*r*M + 7*M

Memory

2*(N/r)*r*M + 1*r*M + 3*M

Multiplication

(N/r)*r*M

ALU

9+ 5*(N/r)*r*M + 7*r*M + 4*M

 

15.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

IL7,8, ML6,7, OL5,6 eliminated

(ALU)  -2*(N*M +r*M +M)

BPOS

IL8, ML7, OL6 eliminated

(ALU) -(N*M +r*M +M)

ZOL

IL7-9, ML6-8, OL5-7 eliminated

(ALU) -3*(N*M +r*M +M)

MAC

IL5 eliminated

(ALU) -N*M

INDEX

IL2 eliminated

(ALU) -N*M

MEM2

MEM operations cut in half

(MEM) -(2*(N/r)*r*M + 1*r*M + 3*M)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(2*(N/r)*r*M + 1*r*M + 3*M)/4

NOREG

MEM operations eliminated

(MEM) - (2*(N/r)*r*M + 1*r*M + 3*M)

15.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

IL7,8, ML6,7, OL5,6 added back

(ALU)  target +2*(N*M +r*M +M)

BPOS, BDEC

IL8, ML7, OL6 added back

(ALU)  target+(N*M +r*M +M)

BPOS, ZOL

IL8, ML7, OL6 added back

(ALU)  target+(N*M +r*M +M)

BPOS, BDEC, ZOL

IL8, ML7, OL6 eliminated (again)

(ALU)  target - (N*M +r*M +M)

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (2*(N/r)*r*M + 1*r*M + 3*M)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target 3*(2*(N/r)*r*M + 1*r*M + 3*M)/4

16       AM Modulator

This component represents the implementation of a block AM modulator for DSB-SC AM with a block of length N. The module takes as inputs a pointer to a sine table, an increment factor (to define a carrier frequency), a pointer to a message data block, and an output buffer.

16.1     Pseudocode

Parameters: N (block length of input data)

Requires: Circular buffering

 

y=AM_modulate(message, sine, increment, offset, output_array, length)

 

//Move input parameters to local registers

1              (instruction to store previous setting in local register)

2              (instruction to turn on circ buff) // for sine buffer

3              (instruction to set buffer length) // presumably known

 

4              R1 = message

5              R2 = sine

6              R3 = offset

7              R2 = R2 + R3

8              R3 = increment

9              R4 = length (actual #) //also loop counter

10            R5 = output_array

 

//*****************

// loop

L1            R6 = *R2 (sine sample)

L2            R2 = R2 + R3

L3            R7 = *R1++ //fetch message

L4            R8 = R6*R7 //modulate

L5            *R5++ = R8 // write to output

 

L6            R4= R4 1 //decrement count

L7            flag = Compare R4,0

L8            If flag, branch Loop

// end block loop

//****************

//             Restore stuff

11            (instruction to turn reset addressing mode)

12            (instruction to turn reset buffer length)

13            (instruction to branch back)

 

16.2     Raw Operations

Class

Equation

Raw

13 + 8*N

Memory

3*N

Multiplication

N

ALU

13 + 4*N

16.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L6,7 eliminated

(ALU)  -2*N

BPOS

L7 eliminated

(ALU) -N

ZOL

L6-8 eliminated

(ALU) 1-3*N

MAC

L5 eliminated

(ALU) N

INDEX

L2 eliminated

(ALU) N

MEM2

MEM operations cut in half

(MEM) (3*N)/2

MEM4

MEM operations cut to quarter

(MEM) 3*(3*N)/4

NOREG

MEM operations eliminated

(MEM) 3*N

16.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L6,7 added back

(ALU)  target + 2*N

BPOS, BDEC

L7 added back

(ALU)  target + N

BPOS, ZOL

L7 added back

(ALU)  target + N

BPOS, BDEC, ZOL

L7 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (3*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(3*N)/4

 

17       AM Demodulator

This component represents the implementation of a simple Costas-loop PLL for DSB-SC AM. The module takes as inputs a pointer to a sin/cos function, a phase accumulator, I and Q branch accumulators, and an output_array, and a message data block. All low-pass filters are implemented as integrators. Note a slightly different structure is needed for samples generated from a complex ADC.

17.1     Pseudocode

Parameters: N (block length of input data)

            Ovsp_factor (ratio of carrier to sampling rate)

            C cycles to call sin function

                 

y=AM_demodulate(message, sin_function, phase_acc, phase_inc, I_acc, Q_acc, output_array, length)

 

//Move input parameters to local registers

1              R1 = message

2              R2 = sin_function // function to call

//             R3 = sin_val

//             R4 = cos_val

3              R5 = length (actual #) //also loop counter

4              R6 = output

5              R7 = phase_acc

6              R8 = I_acc

7              R9 = Q_acc

8              R14 = phase_inc

               

//*****************

// loop

L1            R6 = *R1++

//             Generate

L2            (write R7 to appropriate function register)

LC           call sin_function // puts sin_val + cos_val into R3, and R4 (branch R2)

 

L3            R10 = R4*R1 //      Generate I branch

L4            R8 = R8 + R10 // LPF acc

 

L5            R11 = R3*r1 //       Generate Q branch

L6            R9 = R9 + R11 // LPF acc

 

L7            *R6++ = R8 //output I branch LPF as message

 

L8            R12 = R8*R9 //Generate error term

L9            R7 = R7 + R12 // LPF error term acc

L10          R13 = R7 + R14 //phase_inc

 

L11          flag = cmp(R13, (value for 2pi)

L12          if flag (R13 < 2pi) branch L14

L13.25     R13=R13 (value for 2pi) //assume oversampling factor of 4 for labeling purposes

 

L14          R5= R5 1 //decrement bit count

L15          flag = Compare R5,0

L16          If flag, branch Loop

// end block loop

//****************

//             Restore stuff

9              (instruction to store I_acc)

10            (instruction to store Q_acc)

11            (instruction to store phase)

12            (instruction to branch back)

17.2     Raw Operations

Class

Equation

Raw

11+ N*15 + N/ Ovsp_factor + C*N

Memory

2*N

Multiplication

3*N

ALU

11+ 10*N + N/Ovsp_factor

Function

N*C (doesnt get hit by VLIW or SIMD)

17.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L14,15 eliminated

(ALU)  -2*N

BPOS

L15 eliminated (no good way to eliminate L12)

(ALU) -N

ZOL

L14-16 eliminated

(ALU) 1-3*N

MAC

L4,6,9 eliminated

(ALU) 3*N

COND_EXEC

L13.25 eliminated

(ALU) N/Ovsp_factor

MEM2

MEM operations cut in half

(MEM) (2*N)/2

MEM4

MEM operations cut to quarter

(MEM) 3*(2*N)/4

NOREG

MEM operations eliminated

(MEM) (2*N)

17.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L14,15 added back

(ALU)  target + 2*N

BPOS, BDEC

L15 added back

(ALU)  target + N

BPOS, ZOL

L15 added back

(ALU)  target + N

BPOS, BDEC, ZOL

L15 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(2*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target +3*(2*N)/4

 

 

18       FM Modulator

This component represents the implementation of a block FM modulator with a block of length N. The module takes as inputs a pointer to a sine table, an increment factor (to define a carrier frequency), a pointer to a message data block, and an output buffer. The frequency deviation constant is built in rather than being passed in (difference of a cycle). Note that in practical implementations, this would be coupled with a pre-emphasis filter (which adds gain at higher frequencies relative to lower frequencies). Also note that the increment here refers to the phase step for the carrier.

18.1     Pseudocode

Parameters:      N (block length of input data)

Ovsp_factor (ratio of carrier to sampling rate)

C: trig function time

 

y=FM_modulate(message, sin_function, message_acc, phase_inc, output_array, length)

 

//Move input parameters to local registers

1              R1 = message

2              R2 = sine_function

3              R3 = message_acc

4              R5 = phase_inc

5              R6 = length (actual #) //also loop counter

6              R7 = output_array

 

//*****************

// loop

L1            R8 = *R1++ //get message value

L2            R3 = R3 + R8 //accumulate message // in the current form this should be forced to have zero DC  // bias, otherwise some extra cycles will be needed to make the abs of this value < 2pi

L3            R4 = R3 * scale // multiply by frequency deviation constant

L4            R4 = R4 + phase_inc //not an accumulate (could be done as a MAC, but then accumulator would

               // have to be reset every time through)

 

L5            flag = Compare R4 < (value for 2 pi)

L6            If flag, branch L7

L6.25       R4 = R4 (value for 2 pi)

 

//             Generate modulated

L7            (write R4 to appropriate function register)

LC           call sin_function // puts sin_val into R9

 

L8            *R7++ = R9 //write output

 

L9            R6 = R6 1 //decrement block count

L10          flag = Compare R6,0

L11          If flag, branch Loop

 

//****************

//             Restore stuff

7              (store message_acc)

8              (nstruction to branch back)

18.2     Raw Operations

Class

Equation

Raw

8+ N*11 + N/ Ovsp_factor + C*N

Memory

2*N

Multiplication

N

ALU

8+ 9*N + N/Ovsp_factor

Function

N*C (doesnt get hit by VLIW or SIMD)

18.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L9,10 eliminated

(ALU)  -2*N

BPOS

L10 eliminated (no good way to eliminate L6)

(ALU) -N

ZOL

L9-11 eliminated

(ALU) 1-3*N

COND_EXEC

L6.25 eliminated

(ALU) -N/Ovsp_factor

MEM2

MEM operations cut in half

(MEM) - (2*N)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(2*N)/4

NOREG

MEM operations eliminated

(MEM) - (2*N)

18.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L9,10 added back

(ALU)  target + 2*N

BPOS, BDEC

L10 added back

(ALU)  target + N

BPOS, ZOL

L10 added back

(ALU)  target + N

BPOS, BDEC, ZOL

L10 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (2*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target +3*(2*N)/4

 

 

19       FM Demodulator

This component represents the implementation of a block FM demodulator with a block of length N. The module takes as inputs a pointer to an arctangent function, I and Q samples, stored loop and phase accumulators, length, and the output array. Note that complex input samples could possibly be created via an external Hilbert transform or from a complex ADC.

19.1     Pseudocode

Parameters:      N (block length of input data)

                        Ovsp_factor (ratio of carrier to sampling rate)

C: trig function time     

FM_demodulate(sig_I, sig_Q, arctan_function, loop_acc, phase_acc, output_array, length, sin_function)

 

//Move input parameters to local registers

1              R1 = sig_I

2              R2 = sig_Q

3              R3 = length

4              R4 = output_array

5              R5 = phase_acc

6              R6 = loop_acc

 

//*****************

// loop

//generate appropriate real, imag from

L1            (push phase_acc to appropriate register for sin_cos call)

LC1         sin_cos_function // function to call, put in R7,R8

//complex multiplication

//             I

L2            R9 = R1 * R7

L3            R11 = R2*R8

L4            R9 = R9 R11

//             Q

L5            R10 = R2 * R7

L6            R11 = R1 * R8

L7            R10 = R10 + R11  

 

L8            (push R9 to appropriate register for atan call)

L9            (push R10 to appropriate register for atan call)

LC2         atan_function (assume ends up in R7)                                                                       

 

L10          R8 = R7 * loop_constant //needed to tweak loop BW

L11          R6 = R6 + R8//accumulate loop

L12          R7 = R7 + R6 //actual output

L13          *R4++ = R7

 

//phase branch

L14          R8 = R7 * a different constant

L15          R5 = R5 + R8 //phase accumulate

L16          R5 = R5 + hard_code_carrier_step

 

L17          flag = Compare R5 < (value for 2 pi)

L18          If flag, branch L19

L18.25     R5 = R5 (value for 2 pi)

 

L19          R3 = R3 1 //decrement block count

L20          flag = Compare R3,0

L21          If flag, branch Loop

// end block loop

//****************

//             Restore stuff

7              (instruction to store phase_acc)

8              (instruction to store loop_acc)

9              (instruction to branch back)

19.2     Raw Operations

Class

Equation

Raw

9+ N*(21+C1+C2) + N/ Ovsp_factor

Memory

4*N

Multiplication

6*N

ALU

9+ 11*N + N/Ovsp_factor

Function

N*(C1+C2) (doesnt get hit by VLIW or SIMD)

19.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L19,20 eliminated

(ALU)  -2*N

BPOS

L20 eliminated (no good way to eliminate L6)

(ALU) -N

ZOL

L19-21 eliminated

(ALU) 1-3*N

MAC

L4,7,11,15

(ALU) 4*N

COND_EXEC

L18.25 eliminated

(ALU) N/Ovsp_factor

MEM2

MEM operations cut in half

(MEM) (4*N)/2

MEM4

MEM operations cut to quarter

(MEM) 3*(4*N)/4

NOREG

MEM operations eliminated

(MEM) (4*N)

19.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L19,20 added back

(ALU)  target + 2*N

BPOS, BDEC

L20 added back

(ALU)  target + N

BPOS, ZOL

L20 added back

(ALU)  target + N

BPOS, BDEC, ZOL

L20 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (4*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target +3*(4*N)/4

 

20       BPSK Modulator

This component represents the implementation of a block BPSK sine wave modulator. T modulates N 16-bit words with M samples per symbol. Sine values are generated by stepping through a sine table and phase shifts accomplished by stepping half way through the sine table. To simplify the implementation, the sine wave buffer is set up for circular addressing.

20.1     Pseudocode

Parameters: N (number input words – 16-bit)

                    M (samples / symbol)

Requires: Circular buffering

 

BPSK_mod(word_ptr, sine_table, output_buffer, increment)

 

//Move input parameters to local registers

1              (instruction to store previous settings)

2              (instruction to store pervious settings)

3              (instruction to turn on circ buff) (for sine_table)

4              (instruction to set buffer length)

 

5              R1 = word_ptr

6              R2 = sine_table

7              R3 = output_buffer

8              R4 = N

9              R5 = increment //(defines frequency)

10            R6 = 0 //used to store old bit

 

//*****************

// word loop

OL1         (outer loop) R7 = *R1++ //fetch word

OL2         R8= 16 // set bit counter

//*****************

// bit loop

ML1        R9 = R7 &1

ML2        R7 = R7 >> 1

ML3        R10 = CMPEQ (R9, R6)

ML4        if !R10 branch ML6

ML5        R2 = R2 + pi (equivalent in index)

ML6        R11 = M

ML7        R6 = R9

//*****************

// sine loop

IL1          R12 = *R2

IL2          R2 = R2 + R5 //indexed addressing

IL3          *R3++ = R12        

 

IL4          R11= R11 1 //decrement bit count

IL5          flag = Compare R11,0

IL6          If flag, branch sine Loop

 

// END SINE LOOP

//*******************

ML8        R8= R8 1 //decrement bit count

ML9        flag = Compare R8,0

ML10      If flag, branch bit Loop

// END BIT LOOP

//*******************

OL3         R4= R4 1 //decrement word count

OL4         flag = Compare R4,0

OL5         If flag, branch word Loop

 

// END WORD LOOP

//*******************

//Restore stuff

11            (instruction to reset addressing mode)

12            (instruction to reset buffer settings)

13            (instruction to branch back)

20.2     Raw Operations

Class

Equation

Raw

13 + 6*M*16*N + 16*N*10 + 5*N

Memory

2*M*16*N+N

Multiplication

0

ALU

13+ 4*M*16*N + 16*N*10+4*N

20.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL3,4, ML9,10, IL5,6 eliminated

(ALU)  -2*(M*16*N + 16*N+N)

BPOS

OL4, ML10, IL6 eliminated

(ALU) -(M*16*N + 16*N+N)

ZOL

L19-21 eliminated

(ALU) -3*(M*16*N + 16*N+N)

EXTRACT

ML2 eliminated

(ALU) -16*N

INDEX

IL2 eliminated

(ALU) -M*16*N

MEM2

MEM operations cut in half

(MEM) - (2*M*16*N+N)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(2*M*16*N+N)/4

NOREG

MEM operations eliminated

(MEM) - (2*M*16*N+N)

20.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL3,4, ML9,10, IL5,6 added back

(ALU)  target + 2*(M*16*N + 16*N+N)

BPOS, BDEC

OL4, ML10, IL6 added back

(ALU)  target + (M*16*N + 16*N+N)

BPOS, ZOL

OL4, ML10, IL6 added back

(ALU)  target + (M*16*N + 16*N+N)

BPOS, BDEC, ZOL

OL4, ML10, IL6 eliminated (again)

(ALU)  target - (M*16*N + 16*N+N)

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (2*M*16*N+N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(2*M*16*N+N)/4

 

21       BPSK Demodulator

This component represents the implementation of a block BPSK demodulator. Specifically, it maps signal levels (presumably from the output of a symbol-synchronization process) to bits and packs them into 16-bit words. Note that this block does not do either symbol or carrier synchronization. For high SNR environments, the AM demodulator can be used to implement a BPSK PLL.

21.1     Pseudocode

Parameters: N (number input samples – assumed to be divisible by 16)

 

BPSK_demod(input_data, output_buffer, N)

 

//Move input parameters to local registers

1              R1 = input_data

2              R2 = output_buffer

3              R3 = N

4              R4 = 16

5              R5 = 0

6              R6 = 2^15 (1000 0000 0000 0000)

 

//*****************

// sample

L1            (loop) R7 = *R2++ //fetch sample

L2            if R7<0 branch L4

L2,5         R5 = R5 XOR R6 //should happen half the time

 

L3            R6 = R6 >> 1 //used to set where inputs go in (0 fill)

L4            R4 = R4-- //decrement bit counter

 

L5            flag = cmpgt (R4,0) //note that using conditional execution/moving of the following would actually

                // ADD cycles because the instructions would be fetched every time instead of just once

L6            if flag (R4 >0), branch L11

L6.1         *R2++ = R5 //write filled word happens once every 16 passes

L6.2         R4 = 16 // reset bit counter

L6.3         R6 = 2^15

 

L7            R3= R3 1 //decrement sample count

L8            flag = cmpgt R3,0

L9            If flag (R3>0), branch Loop

 

7              (instruction to branch back)

21.2     Raw Operations

Class

Equation

Raw

7 + N*(9+0.5+3/16)

Memory

N*(1+1/16)

Multiplication

0

ALU

7 + N*(8+0.5+2/16)

21.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

L7,8 eliminated

(ALU)  -2*N

BPOS

L5,8 eliminated

(ALU) 2*N

ZOL

L7-9 eliminated

(ALU) 1-3*N

COND_EXEC

L2.5 eliminated

(ALU) -0.5*N

MEM2

MEM operations cut in half

(MEM) (N*(1+1/16))/2

MEM4

MEM operations cut to quarter

(MEM) 3*(N*(1+1/16))/4

NOREG

MEM operations eliminated

(MEM) (N*(1+1/16))

21.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

L7,8 added back

(ALU)  target + 2*N

BPOS, BDEC

L8 added back

(ALU)  target + N

BPOS, ZOL

L8 added back

(ALU)  target + N

BPOS, BDEC, ZOL

L8 eliminated (again)

(ALU)  target - N

MEM2, NOREG

MEM2 effect undone

(MEM)  target + (N*(1+1/16))/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N*(1+1/16))4

 

22       BFSK Modulator

This component represents the implementation of a block binary frequency shift keying (BFSK) modulator leveraging a sine lookup table with two different increment offsets (corresponding to two different frequencies). Note that this approach eliminates the need for any special relationship between symbol length and encoding frequencies because there are no abrupt phase transition between symbols. The block length is N 16-bit words with FSK symbols with M samples. M is assumed to be hardcoded (though this is not too important).

22.1     Pseudocode

Parameters: N (# 16 bit words to encode)

                    M (symbol length)

Requires: Circular buffering (for wrap around in sine table)

BFSK_encode(word_ptr, sine_table, output_buffer, increment1, increment2, N)

 

//Move input parameters to local registers

1              (instruction to store previous settings)

2              (instruction to store pervious settings)

3              (instruction to turn on circ buff) (for sine_table)

4              (instruction to set buffer length)

 

5              R1 = word_ptr

6              R2 = sine_table

7              R3 = output_buffer

8              R4 = N

9              R5 = increment1 //(defines frequency 1)

10            R14 = increment2 //(defines frequency 2)

 

//*****************

// word loop

OL1         (outer loop) R7 = *R1++ //fetch word

OL2         R8= 16 // set bit counter

//*****************

// bit loop

ML1        R9 = R7 &1

ML2        R7 = R7 >> 1

ML3        flag = CMPEQ (R9, 0)

ML4        if flag branch ML6

ML5        R6 = R5 //increment = increment 1 // note one of the ML5s must execute

ML5.5     branch ML8

ML5        R6 = R14 // increment = increment2

ML6        R11 = M // no particular reason that this divide evenly into the length of the sine table

//*****************

// sine loop

IL1          R12 = *R2

IL2          R2 = R2 + R6 //indexed addressing

IL3          *R3++ = R12        

 

IL4          R11= R11 1 //decrement bit count

IL5          flag = Compare R11,0

IL6          If flag, branch sine Loop

 

// END SINE LOOP

//*******************

ML7        R8= R8 1 //decrement bit count

ML8        flag = Compare R8,0

ML9        If flag, branch bit Loop

// END BIT LOOP

//*******************

OL3         R4= R4 1 //decrement word count

OL4         flag = Compare R4,0

OL5         If flag, branch word Loop

 

// END WORD LOOP

//*******************

//Restore stuff

11            (instruction to reset addressing mode)

12            (instruction to reset buffer settings)

13            (instruction to branch back)

 

22.2     Raw Operations

Class

Equation

Raw

13+5*N+9.5*N*16 + 6*M*N*16

Memory

N+2*M*N*16

Multiplication

0

ALU

13+4*N+9.5*N*16+4*M*N*16

22.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL3,4,ML7,8, IL4,5 eliminated

(ALU)  -2*(N+N*16+M*N*16)

BPOS

OL4,ML3,8, IL5 eliminated

(ALU) (N+2*N*16+M*N*16)

ZOL

OL3-5,ML7-9, IL4-6 eliminated

(ALU) -3*(N+N*16+M*N*16)

COND_EXEC

ML5.5 eliminated (thats the effect at least)

(ALU) -0.5*N

EXTRACT

ML2 eliminated

(ALU) -16*N

INDEX

IL2 eliminated

(ALU) -M*16*N

MEM2

MEM operations cut in half

(MEM) - (N+2*M*N*16)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(N+2*M*N*16)/4

NOREG

MEM operations eliminated

(MEM) - (N+2*M*N*16)

22.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL3,4,ML7,8, IL4,5 added back

(ALU)  target + 2*(M*16*N + 16*N*+N)

BPOS, BDEC

OL4,ML8, IL5 added back

(ALU)  target + (M*16*N + 16*N*+N)

BPOS, ZOL

OL4,ML8, IL5 added back

(ALU)  target + (M*16*N + 16*N*+N)

BPOS, BDEC, ZOL

OL4,ML8, IL5 eliminated (again)

(ALU)  target - (M*16*N + 16*N*+N)

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(N+2*M*N*16)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N+2*M*N*16)/4

 

 

23       BFSK Demodulator

This component represents the implementation of a noncoherent block binary frequency shift keying (BFSK) demodulator. This is done by passing the received signal through two filters tuned to the two frequencies with the output of one filter subtracted from another. This would then need to be passed through a symbol-synchronization circuit to create actual output samples. This in turn should be passed through the BPSK demodulator defined previously to create bits and packed words. Note that because this is a block operation, were assuming alignment occurs external to this procedure.

23.1     Pseudocode

Parameters: N (# samples)

                    Filt_length (filter length)

Requires: Circular buffering (for filtering)

 

BFSK_decode(signal, coef1, coef2, output_buffer,  N, filt_length)

 

//Move input parameters to local registers

1              (instruction to store previous settings)

2              (instruction to store pervious settings)

3              (instruction to store previous settings)

4              (instruction to store pervious settings)

5              (instruction to turn on circ buff) (for coef1)

6              (instruction to set buffer length)

7              (instruction to turn on circ buff) (for coef2)

8              (instruction to set buffer length)

//preceding eliminates need to reset R2, R3 pointers

 

9              R1 = signal

10            R2 = coef1

11            R3 = coef2

12            R4 = N

13            R10 = output_buffer

 

//OUTER LOOP

//Set up FILTERS

OL1         acc1 = 0

OL2         acc2 = 0

OL3         R9 = filt_length //hard coded

/

//Inner Loop (Filters)

IL1          R4 = *R1++ //data

IL2          R5 = *R2++ //filt1

IL3          R6 = R5 * R4

IL4          acc1 = acc1 + R6

IL5          R5 = *R3++ //filt2

IL6          R6 = R5 * R4

IL7          acc2 = acc2 + R6

 

IL8          R9 = R9 1 //decrement block count

IL9          flag = Compare R9,0

IL10        If flag, branch inner Loop

//END FILTERS LOOP

OL4         R11 = acc1 acc2

OL5         *R10++ = R11

OL5         R1 = R1 filt_length-1 //scaled as need be for word size

 

OL6         R4 = R4 1 //decrement block count

OL7         flag = Compare R4,0

OL8         If flag, branch Loop

 

// END WORD LOOP

//*******************

//Restore stuff

14            (instruction to branch back)

23.2     Raw Operations

Class

Equation

Raw

14 + N*8 + N*filt_length*10

Memory

N +3*filt_length*N

Multiplication

2*filt_length*N

ALU

14 + N*7 + 5*N*filt_length

23.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL6,7,IL8,9 eliminated

(ALU)  -2*(N+N*filt_length)

BPOS

OL7, IL9 eliminated

(ALU) (N+N*filt_length)

ZOL

OL6-8,IL8-10 eliminated

(ALU) -3*(N+N*filt_length)

MAC

IL4,7 eliminated

(ALU) -2*N*filt_length

MEM2

MEM operations cut in half

(MEM) - (N +3*filt_length*N)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(N +3*filt_length*N)/4

NOREG

MEM operations eliminated

(MEM) - (N +3*filt_length*N)

23.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL6,7, IL8,9 added back

(ALU)  target + 2*(N+N*filt_length)

BPOS, BDEC

OL7, IL9 added back

(ALU)  target + (N+N*filt_length)

BPOS, ZOL

OL7, IL9 added back

(ALU)  target + (N+N*filt_length)

BPOS, BDEC, ZOL

OL7, IL9 eliminated (again)

(ALU)  target - (N+N*filt_length)

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(N +3*filt_length*N)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N +3*filt_length*N)/4

 

24       16-QAM Modulator

This component represents the implementation of a 16-QAM modulator. It takes in 16-bit words and maps this to I and Q values. An additional routine would be needed to modulate these I and Q values onto a carrier.

24.1     Pseudocode

Parameters: N (# words)

 

Requires:

 

16_QAM_mod(signal, out_I, out_Q, LUT_real, LUT_imag, length)

 

//Move input parameters to local registers

1              R1 = signal

2              R2 = out_I

3              R3 = out_Q

4              R4 = LUT_real

5              R5 = LUT_imag

6              R10 = length

 

//OUTER LOOP

OL1         R6 = *R1++

OL2         R11 = 4

 

//INNER LOOP

IL1          R7 = R6  >> 2 // GET 2 bits

IL2          R7 = R7 & 3

IL3          R8 = R4 + R7 //LUT_REAL

IL4          R9 = *R8

IL5          *R2 = R9 //WRITE OUTPUT

 

IL6          R7 = R6  >> 2 // GET 2 bits

IL7          R7 = R7 & 3

IL8          R8 = R5 + R7 //LUT_IMAG

IL9          R9 = *R8

IL10        *R3 = R9 //WRITE OUTPUT

               

IL11        R11 = R11 1 //decrement block count

IL12        flag = Compare R4,0

IL13        If flag, branch Inner Loop

//END INNER LOOP

 

OL3         R10= R10 1 //decrement block count

OL4         flag = Compare R4,0

OL5         If flag, branch Loop

 

// END WORD LOOP

//*******************

//Restore stuff

7              (instruction to branch back)

 

24.2     Raw Operations

Class

Equation

Raw

7 + N*5 + N*4*13

Memory

N +N*4*4

Multiplication

0

ALU

7 + N*4 + N*4*9

24.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL3,4,IL11,12 eliminated

(ALU)  -2*(N + 4*N)

BPOS

OL4, IL12 eliminated

(ALU) -(N + 4*N)

ZOL

OL3-5,IL11-13 eliminated

(ALU) -3*(N + 4*N)

EXTRACT

IL1,6 eliminated

(ALU) -2*4*N

INDEX

IL3,8 eliminated

(ALU) -2*4*N

MEM2

MEM operations cut in half

(MEM) - (N +N*4*4)/2

MEM4

MEM operations cut to quarter

(MEM) - 3*(N +N*4*4)/4

NOREG

MEM operations eliminated

(MEM) - (N +N*4*4)

24.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL3,4,IL11,12 added back

(ALU)  target + 2*(N + 4*N)

BPOS, BDEC

OL4,4,IL12 added back

(ALU)  target + (N + 4*N)

BPOS, ZOL

OL4,4,IL12 added back

(ALU)  target + (N + 4*N)

BPOS, BDEC, ZOL

OL4,4,IL12 re-elimianted

(ALU)  target - (N + 4*N)

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(N +N*4*4)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N +N*4*4)/4

 

 

25       16-QAM Demodulator

This component represents the implementation of a 16-QAM demodulator. It takes in I/Q samples, maps these to bits and packs these into 16-bit words. Note separate processes would be needed for carrier recovery, channel equalization, and symbol synchronization.

25.1     Pseudocode

Parameters: N (# samples)

 

16_QAM_demod(signal_I, signal_Q, out, length)

 

//Move input parameters to local registers

1              R1 = signal_I

2              R2 = signal_Q

3              R3 = out

4              R4 = length

 

// OUTER LOOP (16 samples per pass)

OL1         R5 = 0

OL2         R11 = 4 // assuming 16-bit words

 

//INNER LOOP (4 samples per pass)

//I

IL1          R6 = *R1++

 

IL2          flag = compare R6 < 0

IL3          If flag branch IL5

IL3.5       R5 = R5 + 1

 

IL4          R5 = R5 << 1

IL5          R6 = abs(R6)

 

IL6          flag = compare R6 < thresh

IL7          If flag branch IL8

IL7.5       R5 = R5 + 1

IL8          R5 = R5 << 1

//Q

IL9          R6 = *R2++

 

IL10        flag = compare R6 < 0

IL11        If flag branch IL12

IL11.5     R5 = R5 + 1

 

IL12        R5 = R5 << 1

IL13        R6 = abs(R6)

 

IL14        flag = compare R6 < thresh

IL15        If flag branch IL16

IL15.5     R5 = R5 + 1

 

IL16        R5 = R5 << 1

 

IL17        R11 = R11 1 //decrement block count

IL18        flag = Compare R11,0

IL19        If flag, branch Inner Loop

// END INNER LOOP

OL3         *R3++ = R5 //store output word

               

OL4         R10= R10 1 //decrement block count

OL5         flag = Compare R4,0

OL6         If flag, branch Loop

 

// END OUTER LOOP

//*******************

//Restore stuff

5              (instruction to branch back)

25.2     Raw Operations

Class

Equation

Raw

5 + N*6/16 + N*(19 + 4*0.5)/4

Memory

N/16 + N*(2)/4

Multiplication

0

ALU

5 + N*5/16 + N*4*19/4

25.3Impact of Specialized Instructions

Instruction

Impact

Modifier Equation

BDEC

OL4,5,IL17,18 eliminated

(ALU)  -2*(N/16+ N/4)

BPOS

OL5, IL3,7,11,15,18 eliminated

(ALU) (N/16 + 5*N/4)

ZOL

OL4-5,IL17-19 eliminated

(ALU) -3*(N/16 + N/4)

COND_EXEC

IL 3.5,7.5,11.5,15.5 eliminated

(ALU) N/2

VSL

IL4,8,12,16,3.5,7.5,11.5,15.5

(ALU) N*(4+4*0.5)/4

MEM2

MEM operations cut in half

(MEM) (N/16 +N/2)/2

MEM4

MEM operations cut to quarter

(MEM) 3*(N/16 +N/2)/4

NOREG

MEM operations eliminated

(MEM) (N/16 +N/2)

25.4Synergistic Modifiers

Modifiers

Impact

Modifier Equation

BDEC, ZOL

OL4,5,IL17,18 added back

(ALU)  target + 2*(N/16+ N/4)

BPOS, BDEC

OL5,IL18 added back

(ALU)  target + (N/16+ N/4)

BPOS, ZOL

OL5,IL18 added back

(ALU)  target + (N/16+ N/4)

BPOS, BDEC, ZOL

OL5,IL18 eliminated (again)

(ALU)  target - (N/16+ N/4)

COND_EXEC, VSL

IL 3.5,7.5,11.5,15.5 added back

(ALU) target + N/2

MEM2, NOREG

MEM2 effect undone

(MEM)  target +(N/16 +N/2)/2

MEM4, NOREG

MEM4 effect undone

(MEM)  target + 3*(N/16 +N/2)/4

 



[1] The specific tradeoff is an expected 7.5 * 8 or 60 cycles for specifying # of 16-bit words versus an added 2 cycles per bit (actually a little bit more) for specifying # bits because of the need for extra loop control.