EE265 Lab 1: TI TMS320C55x DSP Tutorial
Winter 2008-2009
Instructor: Teresa Meng
I. Background and Motivation:
You will be hearing a
lot about the term DSP throughout the course. What is DSP? Why is it important?
Why do you want to process signals in digital instead of leaving it in analog?
When should you use DSP? What are the advantages, disadvantages and tradeoffs?
What are DSP processors and how are they different than microprocessors? This
section will discuss briefly these questions, and hopefully, you will
understand these issues in greater detail by the end of this course.
DSP vs. Analog:
DSP stands for digital signal processing and it often involves
taking an analog signal, converting it to samples of digital values, processing
the digital data, and converting it back to analog for output. We often
consider DSP along with the analog front and back-ends since that is what we
ultimately hear and see. Why do we process signals digitally rather than
working with the original signal in the analog domain? The answer is it depends
on the system and its requirements. For some systems, working with the analog
signals gives a better solution. For others, DSP is better. It is up to you,
the designer, to make the decision based on materials you learn through this
and other courses. Here are some things to consider in making the tradeoffs:
- Considerations for Analog systems
o Operating conditions - temperature, frequency, supply voltage
o Radiation/interference
o Out of band interference
o Aliasing
- Considerations for Digital systems
o ADC (analog to digital converter) precision and filter noise
o DAC (digital to analog converter) filter order
o More complex algorithms may be used. Some may be unattainable with
analog systems (examples: compression, data analysis, and data synthesis).
o Flexibility in choice of algorithms.
o Reprogrammable.
o Fixed point - loss of precision due to limited datapath
o Floating point uses more power
o Reliability not dependent on operating conditions
o Reusability of hardware
o Throughput and latency - some systems require that DSP be performed
in real-time. This is a throughput constraint that requires the system to have
finite storage equivalent input and output data rates. This does not mean that
an input signal will be immediately processed to generate an output signal. On
the contrary, most DSP systems generate a corresponding output to an input
signal at a delayed time. The maximum latency tolerated by a system is part of
the specification.
o Storage of intermediate results - on-chip memory needed (may be a
limiting factor)
DSP Processors vs.
General-purpose Microprocessors:
How do DSP processors differ from microprocessors? Both types of
processors perform computations on digital signals. But the main difference is
that DSP processors are tailored to process data signals whereas
microprocessors are designed to reduce the amount of computations in a general
computing environment where most signals being processed pertain to some form
of program control. DSP processors are also designed to consume less power
under its target applications. Products that use DSPs include DVD players,
camcorders, cell phones, wireless base stations, modems, karaoke players, etc.
The separation between DSP processors and microprocessors is
narrowing since microprocessors are approaching, if not exceeding, throughput
capabilities of the DSP processors. In the personal computer industry, microprocessors
are taking over several computational tasks used to be cost effective only when
implemented with DSPs. This results in a narrowing market segment for DSP
processors in the PC industry. However, DSP processors remain highly
competitive in areas where non-PC related high throughput, low power computing
is required.
This course
integrates a number of different topics in digital signal processing (DSP): DSP
processor programming, DSP algorithms implementation, and performance
considerations. Understanding the programming methodology of DSP processors is
relatively simple. But to write optimal code requires understanding of the DSP
algorithms as well as the capabilities of the DSP processors. Although this
course covers the programming methodology for TI's TMS320C55x DSP processor
(which is one of the most popular DSP processor currently in use), the
fundamentals of DSP programming extend to other DSP processors as well.
Furthermore, the design principles you will learn in coding a DSP processor
will be applicable to DSP designs in application-specific hardware.
II. Purpose:
The purpose of this
lab is to introduce the design flow and basic programming methodology for
working with DSP processors (in particular, TI's TMS320C55x DSP processor).
This lab consists of an introduction, a couple DSP demos, and a brief
programming exercise. The introduction is not intended to cover everything you
need to know about DSP programming, but to provide you a working knowledge of
TI's C55x DSP and its tool environment and enough material for understanding
TI's C55x reference materials.
III. Introduction to the C55x DSP
Processor
This section provides
an overview of the C55x processor. The architecture and the instruction set of
the C55x will be discussed first, followed by an introduction to the
programming tools used in the EE265 lab. As an overview, the following
discussion does not provide detailed explanations of the topics discussed. For
detailed descriptions of both the architecture and the instruction set you will
need to refer to the C55x manuals. For now, you should read the following
section before reading the manuals. Section IV of this lab lists assigned
reading, which should help fill in many of the details.
As
a first introduction, the information you are exposed to in this lab can be
rather overwhelming. However, since much of this material will not be discussed
again in later labs, so you are highly encouraged to read the material now and
then reread it later in the course after you become more comfortable with the
C55x processor. It may help to make note of which parts are confusing so that
you can clear up the confusion later, when understanding is required by the lab
exercises.
A. Architecture
The most pressing
issue in the design of modern high speed digital logic is the high power
consumption of large memories. Thus, the architecture of the DSP is designed
around efficient memory access and utilization. Unlike
other DSP processors, the C55x processor architecture is based on unified
memory system, although the program memory and the data memory are logically separated. This means that the program memory and the data memory are
separately addressed, in which the program memory space is byte addressable
24bit address and the data memory space is word addressable 23bit address, but
the physical memory is shared by program memory space and data memory space. This
memory may include both on-chip and off-chip memories. Take a minute to examine
the C55x architecture in Figure
1. Notice that there are four read busses, three for data and one for
program code. Specialized instructions allow data access using all three data
busses in one clock cycle. Part of the data memory that is on-chip is
dual-ported and is referred to as DARAM (Dual-Access RAM) in the manuals. This
means that this part of the data memory can be accessed twice in the same clock
cycle. This is useful for implementing DSP algorithms where multi-operand instructions
are predominant.
Take note of the four
basic blocks of the C55x CPU, the Instruction Buffer Unit (I Unit), the Program
Flow Unit (P Unit), the Address-data Flow Unit (A Unit) and the Data
Computation Unit (D Unit). You should read more about these first three units
in the TMS320C55x
Technical Overview and the TMS320C55x DSP CPU Reference
Guide.
Datapath
Notice that there are
six pairs of buses interconnecting the core and the memory units. Five of these
buses are for the data memory and the sixth one is for the program memory.
Though there are five data buses, three read and two write, this does not mean
that the processor uses all buses every clock cycle. Rather, there are a
limited number of instructions that can make use of the full memory bandwidth.
The CB and DB read buses and the EB and FB write buses can be used jointly to access
single 32-bit values or individually to access two 16-bit values. The BB read
bus is used primarily by the special dual-MAC instructions, and can access a
single 16-bit value stored in internal memory (only in internal memory!).
Data Computation Unit
Take a minute to
study the architecture of the C55x Data Computation Unit (D-Unit) in Figure
2.
This figure can tell you a lot about the capabilities
and limitations of the C55x processor. First, notice the computational building
blocks of the C55x processor. These include the shifter, a 40-bit arithmetic
logic unit (ALU), 2 17x17 bit multiply-accumulators (MACs) and four
register-accumulators. Next, note the interconnections between these blocks.
This tells us how data can be transferred between the computing units and how
the processor core can be configured in each cycle. Of special note is the fact
that while there are two MAC units, there are only three data bus pathways.
Thus, while the TI C55x can perform two multiply-accumulate actions in a single
clock cycle, at least one of the input operands must be shared between
multipliers. Furthermore, this shared operand must be stored in internal
memory, as it is transferred on the BB data bus.
Functional Units
The two MACs use a
17x17 bit multiplier, so that signed values can be multiplied without
additional bit level operations. The output of the multiplier is fed into a
40-bit adder to generate a 40-bit, which can then be optionally saturated (32
bit or 40 bit). The combination of multiplier and adder
enables a single cycle multiply accumulate (MAC) operation and is very
useful for filtering operations. The ALU is independent of the MAC unit. It is
capable of performing logic operations as well as additions. Note that one of
the inputs to the ALU is taken from the Barrel shifter. This means that the
data word can be shifted prior to addition or logic operations. Additionally,
notice that the ALU is 40-bits. It can perform a single arithmetic operation on
a 40 bit value (i.e., from the accumulator) or two arithmetic operations on 16
bit values. As you explore the TI C55x instruction set, you'll note that the size of the operands for various instructions are typically
specified.
You will find
instructions that utilize multiple accumulators, the ALU, the Barrel Shifter,
and one or both MACs. Such instructions are very useful for certain
applications like Viterbi decoding. The ability to perform these types of
instructions makes TI's C55x a very powerful DSP since they utilize most of the
processor's core resources. However, it is up to the DSP engineer to choose the
appropriate algorithms tailored to the DSP architecture. Otherwise, almost any
algorithm can be implemented with simple add and multiply instructions. An
optimal DSP code is one that results in the lowest power consumption. This is
equivalent to a code that executes with minimal number of cycles and one that
uses instructions having high processor utilization.
Control Circuitry
In the C55x, the
program control and addressing circuitry are split into 3 units, the
Instruction Buffer (I Unit), the Program Flow Unit (P Unit) and the Address
Data Flow Unit (A Unit). C55x instructions can be up to 6 bytes. However, examining
the architectural diagram in Figure
1,
you'll notice that the program data bus is only 32 bits
wide! The instruction buffer queue stores up to 64 bytes of program code; as
many instructions are fewer than 4 bytes wide, this means that instructions
longer than 4 bytes can often be executed without the extra cycle latency that
would be required in the absence of the instruction buffer. However, if the
instruction buffer is emptied by a sequence of long instructions or as a result
of a branch or call, if the next instruction to be executed is longer than 4
bytes, a one cycle delay will ensue. Furthermore, code loops which fit entirely
within the instruction buffer queue can be executed efficiently not only in
terms of eliminating potential delays in fetch long instructions, but also by
forgoing the energy required for code memory accesses.
Many algorithms
access data by way of address pointers (much like C/C++-style pointers). The A
Unit contains a 16 bit ALU which gives the C55x the ability to dynamically
update address pointers without taking any additional cycles to perform pointer
arithmetic such as adding constants to a pointer or incrementing modulo some
value.
Memory Organization
The C55x has 3 memory
spaces, program (page 0), data (page 1), and I/O (page 2). Make note of the
page numbers associated with each memory space because you will need to know
them for coding. As instructions are specified in 8-bit (1 byte) chunks, the
program memory space is addressed at the byte level, with 24 bit addressing. As
data is processed in 16 bit words, data memory space is addressed at a word
level, and hence data memory addresses are 23 bits wide. The actual amount of
internal memory available on the C55x processor depends on the particular
model; the C5509A, the processor used in this class, has 8 blocks of 8KB of DARAM (64KB
in total) and 24 blocks of 8KB of SARAM. Note that for
full data bandwidth, some instructions require different operands to be stored
in different blocks of memory.
More detailed
description of the C55x memory space is available in Chapter
3 - Memory of the CPU Reference Guide and in the TMS320C5509A
Data Manual
The program memory is
pointed to by the Program Counter (PC) which references the next instruction to
be executed. There are also other auxiliary and status registers that are
associated with the program memory space. They are primarily used for program
flow control such as branching or conditional execution.
The data memory space
is associated with 8 16-bit auxiliary registers, AR0, AR1,
..., AR7. These registers are used primarily as pointers, just
like the pointers found in high-level languages such as C/C++. By functioning as pointers --- using a form of addressing called
“indirect addressing” the address registers enable faster instructions to be
implemented (more on this later). The A
Unit data address generator, DAGEN, is dedicated to operating on the address
registers. Automatic increments, decrements, modulo, as well as indexed increments,
bit-reversal, etc., are supported by DAGEN. These all make data manipulation
even more flexible with C55x. Recall that the data space uses 23 bit
addressing. When the address registers are used for indirect addressing, the
high 7 bits of address are taken from the high 7 bits of the extended auxiliary registers XA0, XA1, ..., XAR7, which are set using a
23 bit constant (k23).
Tables
6-4 and 6-5 in the TMS320C55x DSP CPU Reference Guide list the
possible variations of indirect addressing using address registers. Notice that there are restrictions on the set of address registers
that can be used with two-operand --- such as instructions that multiply two
numbers that are both stored in memory --- and parallel instructions; the “Dual
AR Indirect Addressing Modes” are shown in Table 6-7. Instructions which use three
operands from memory require special addressing, using the “coefficient data
pointer” (CDP. Before using instructions which use the CDP (denoted by the Cmem
notation in the reference guide) you should read and understand the details
given in section 6.4.3.
For
those that are curious, here is a brief explanation of why address registers
enable faster instructions: The C55x processor does not have a fixed
instruction length, so instructions can be up to 7 bytes long. However, there
are two reasons why longer instructions are not to be desired. First, memory
access is a major component of energy consumption; the more times memory is
accessed in a battery powered device, the shorter the battery lifetime will be.
Second, since the processor can only fetch two words (32 bits) of memory each
clock cycle, if an instruction is more than two words in length, this increases
the chances that it will not be able to be executed in a single clock cycle.
This means that instructions can be both more energy efficient and faster if
they can be packed into fewer bits (and therefore fewer words). Some of the
bits that make up an instruction are used to tell the instruction decoder what
type of operation is supposed to be performed, while the remaining bits
typically tell the processor what the operands are for the instruction. For
example, if a load instruction is used, the operand to be loaded can be
specified by encoding its address directly in the instruction. For the C55x,
addresses are 16 bits, so encoding this address directly in the instruction
(so-called “absolute addressing”) would result in an instruction that is at
least 2 words long. Alternatively, if the operands are addressed by way of
pointers (“indirect addressing"), then the
instruction only needs to encode enough bits to indicate which pointer to use.
Since there are only 8 address registers available for use as pointers, this
takes at most 3 bits. Even more detail: In the case of dual-operand addressing,
the instruction needs to specify two operands by way of pointers, so more bits
are needed, and thus there are more restrictions on the potential pointer
manipulations.
Recommended: For more information on addressing, read all of Chapter 6 - Addressing Modes of the TMS320C55x
DSP CPU Reference Guide.
Processor
Configuration (Recommended)
Inside the C55x
processor core are a number of registers pertaining to the control and
configuration of the DSP processor as well as communication with peripheral
devices. These registers can be used to monitor the status and to configure the
processor. To simplify programming, these registers are mapped to the data
memory. This means that instructions that work with the data memory can access
and operate on the information contained in the MMRs. This is why they are
called memory mapped registers (MMRs). The instruction set also has several
instructions that can only be used to operate on MMRs. C55x MMRs are listed in
Table 2-1 of the CPU Reference Manual. Of particular note are the 4 status registers ST0_C55 -- ST3_55. These registers control and report many basic operations.
Addressing, conditional flags, overflow mode, sign
extension, saturation, rounding, circular addressing, fraction mode, global
interrupt enable, shifting, and much more are all accessible through these
registers.
B. Instruction Set
Overview:
The C55x instruction
set can be summarized as: too many choices. There are many different types of
instructions available and for any given instruction there may be many
different variations to choose from. For example, a look at the instruction set
documentation will reveal that there are more than 20 variants of the addition
operation (including multiple versions of ADD, ADDV, ADD::MOV, ADDSUB, ADDSUB,
ADDSUBCC, and ADDSUB2CC). The number of choices may appear daunting at first,
but the availability of the many variations means that there is usually an
instruction available that will do exactly what you need. For example, if you
need to add two numbers then there is probably a specific instruction available
to do this for you, regardless of where the numbers are stored. Some examples:
1) adding two numbers that are both stored in memory, 2) adding a single number
stored in memory to a number in an accumulator, 3) adding a constant to an
accumulator, or 4) adding a value to an address pointer. The significance of
all of this is that if you know you need to perform a certain type of
operation, you need only find the proper version (the proper syntax) to use for
a particular situation. Even more importantly, the availability of many
different instructions results in more efficient code, since many operations
can be done in a single cycle instead of using several cycles to perform a
task. For example, if you were not allowed to add two numbers that were both
stored in memory, then you would first need to load one of the numbers into an
accumulator and then perform the add. This would take a minimum of two cycles
for the load and add, whereas the dual-operand version that adds two numbers
directly from memory could be done in a single cycle.
DSP-specific and
Application-Specific Instructions:
Much of the C55x
instruction set is comprised of common instructions such as Load/Store, ADD,
Multiply, etc., but there are also many instructions that are available
specifically for DSP operations. These DSP-specific instructions are the reason
why DSP processors can be more efficient than general-purpose processors. They
allow certain DSP operations to be performed using fewer instructions (and
fewer clock cycles) than would be required if using general-purpose
instructions. There are also instructions that are application-specific in that
they are available primarily to speed up certain specific DSP algorithms (such
as FIR filtering with symmetric coefficients or Viterbi decoding). Although
these instructions may have been targeted for certain specific algorithms, they
can sometimes be exploited for other uses as well.
Optimization,
Options and Pitfalls:
The abundance of
available instructions means that there is a lot of room for optimization
depending on which instruction is chosen for a particular task. (This also
means that it is very difficult to design a C-to-assembly compiler for DSPs.)
Being familiar with the available instructions can make programming easier and
more efficient. It is essential for the DSP programmer to have a good
understanding of all the options and pitfalls.
Instruction Types:
The C55x instruction
set can be broken down into the following categories:
- These instructions transfer data within and between
the program and data memory spaces. These are primarily the various Load,
Store and Move instructions.
- DSP programs mostly work on data stored in the data
memory. Data memory is composed entirely of RAM or processor-specific ROM. Many
programs operate on data that is generated externally, such as an audio
stream that is sampled with an external Analog-to-Digital (A/D)
converter. The C55x itself does not have an A/D converter, but (as you'll
see later) data can be obtained from an external A/D converter by way of
a serial port on the C55x.
- The DSP program will be physically stored in some
type of ROM addressable in the program memory space. Program memory
contains not only the instructions for a program but can also include
constant data such as filter coefficients for use in your DSP algorithm.
Most instructions operate on data in the data space, so data (such as
filter coefficients) must sometimes be moved from program memory to data
memory.
- Arithmetic or Logical Instructions: Instructions primarily handled by the ALU such as
AND, OR, XOR, ADD, SUB, etc.
- Shift: Most C55x ALU instructions have shifting embedded since
the Barrel Shifter is in series with the ALU. Arithmetic shift, logical
shift, and rotate instructions are also available.
- Long Word Instructions: As discussed above, the C55x memory datapaths are
16-bits wide and the processor core datapaths are 40-bits. You can only
load and store 16-bit words. The C55x has provisions (i.e. instructions
and processor modes) for working with 32-bit and 40-bit data, which are
available for special circumstances. (Most DSP algorithms do not require
more than 16-bit precision. )
- Exponent encoder: The C55x is a fixed point DSP. However, it is also
capable of performing limited floating point operations using the
exponent encoder.
- MAC: Multiply-accumulate (or multiply-subtract)
instructions allow the result of a multiplication to be added to a
previously-computed sum. This can often be done in a single cycle, which
makes for very efficient DSP operations. MAC is one of the most used
instructions since many DSP algorithms (such as FIR filtering) involve
adding the products of multiplications. There are several variations of
MAC instructions but some of these may not be available in all cases. In
planning to use the MAC, you need to consider where the operands are
stored and in which order you need to do the multiply and accumulate.
(Multiply-subtract (MAS) instructions are also available.)
- Application-Specific Instructions: The C55x has several application-specific
instructions. One such example is MAXDIFF - Compare and Select
Accumulator Content Maximum, specifically designed to facilitate Viterbi
decoding. (Application notes are available on-line that discuss how
Viterbi decoding is done on the C55x.)
- Condition - These instructions enable you to test a variable
(stored in the accumulator, address registers, the memory-mapped
registers, etc.) against a certain condition. If the condition is met,
the appropriate status flags are set. These status flags are used by
instructions such as branch or conditional execution (XCC) to determine what
instruction to execute next.
- Branch - There are a number of different branching
instructions available. These include basic branching (B), conditional
branching (BCC), and subroutine calling (CALL).
- Repeat - Repeat instructions are used for looping over a
single instruction or a block of instructions. Looping can also be done
using conditional branching, but repeat instructions can be used when the
number of iterations is known apriori. Looping with repeat instructions
is equivalent to a “for” loop, whereas branching is equivalent to a
“while” loop. The looping is controlled by way of local registers that
store information such as the loop iteration count, the address of the
first instruction in a loop and also the last instruction to be executed
in a loop. The program decoder will process the information stored in
these registers to decide which instruction to execute next (whether it
needs to exit the loop or branch back to the beginning, etc.). The repeat
loop has much less cycle overhead than branching because it knows exactly
how many times it needs to execute. Also when it is done looping, program
execution resumes at the next instruction following the loop. Therefore
the pipeline does not need to be flushed (unlike branching) and no cycles
are wasted other than the cycles needed to setup the repeat operation.
This type of instruction is specific to DSP processors, since the number
of iterations needed by a DSP algorithm is fixed.
- Many instructions in the C55x instruction set can be
executed in parallel, so that two instructions can be implemented in the
same cycle. This is possible when the two instructions use completely
different resources (functional units and buses), so they can be
performed at the same time without causing resource conflicts. These are
a perfect example of special instructions that are included to optimize
performance as much as possible. Parallelism in the C55x can be both at a
single instruction level (for example, MAC ::
MAC is a specific instruction) or user defined. Read Ch. 2 of the Mnemonic Instruction Set Reference Guide
for more information.
Pipelining
The pipeline of the
C55x is discussed in Section 4.4 of the TMS320C55x DSP Programmer's Guide. The C55x makes programming quite easy by protecting against almost
all potential pipeline conflicts. Thus, under normal circumstances, the
pipeline will not introduce any problems. However, if you find your code takes
longer than expected to run and are not aware of how the C55x handles
pipelining or even what a pipeline is, it will make debugging quite difficult.
The rule of thumb for debugging an assembly code on a pipelined processor is
that when you run into an instruction that does not make any sense during
debugging, add several NOP (no-operation) instructions before the instruction
or check the conditional flags (if you don't
understand pipelining, this sentence will probably not make any sense. Keep it
in mind and you will understand later).
Interrupts
A majority of the
labs in this course involve the use of interrupts. We will be discussing
interrupts in greater detail in Lab 2. Essentially, interrupts are used as
signals to the processor to do things other than what the processor is
currently doing. The sequence of events for processing an interrupt is as
follows.
- First, an external (or internal) device generates an
interrupt to the C55x processor. The C55x processor decides whether to
accept the interrupt. If the interrupt is accepted, the current program
flow is interrupted.
- If the interrupt is accepted, all interrupts are
henceforth disabled so that the current interrupt that is being responded
to cannot be interrupted.
- The current PC (program counter) value is pushed onto
the stack. (This is called the “context-save” and it is analogous to
setting a bookmark which allows the program execution to return to what it
was doing before the interrupt occurred).
- The PC is then loaded with the address of the
interrupt vector (which is a small set of instructions that are executed
each time an interrupt occurs).
- The interrupt vector corresponding to the incoming
interrupt is then executed. The general term for code that is executed
when an interrupt occurs is “interrupt service routine” or ISR. Often, the
ISR won't fit in the space allotted for the interrupt vector so the
interrupt vector contains a branch instruction to the larger interrupt
service routine. This is because all interrupt vectors for the C55x are
exactly 4 words long. If the vector includes a branch to an ISR, the ISR
is executed following the interrupt vector. If the ISR no more than 4
words, then it may be coded into the interrupt vector without using a
branch.
- At the end of the ISR, a return from interrupt (RETI)
is executed. This restores the saved PC from the stack and enables all
interrupts to resume the normal program flow.
A number of things
need to be configured for interrupts to work properly. As implied in the
previous paragraph, you need to setup the interrupt vector and the stack and
you need to code the ISR itself. You will also need to setup the interrupt mask
register (IMR) and the interrupt flag register (IFR). Also, before the main
program starts, you will need to enable the global interrupt mask (INTM). You
will learn how to do this by going through the exercises. Sometimes this setup can be done automatically for you by
the programming tools, as will be seen in later labs.
The IMR is used to
selectively enable and disable interrupts. In other words, the IMR configures
the C55x processor to listen to certain interrupts during normal processing.
The IFR indicates which interrupts have active requests so that you can find
out if an interrupt occurs while another interrupt is being serviced. The IFR
can also be used in another method of handling external signaling called
polling. Polling is different from interrupts in that a new interrupt sets the
appropriate IFR bit, but it does not stop the current program flow. Instead,
the program checks the IFR to see if an interrupt request is active. If so, an
equivalent “ISR” is executed. This results in more predictable program flow,
since the program is never actually interrupted. Finally, the global interrupt
mask (INTM) is a convenient way to enable (or disable) all interrupts that are
selected with IMR.
Different C55x models
have different number of interrupts. The maximum number of interrupts supported
on the C55x is 32, most of which are inactive. There are also two types of
interrupts: software and hardware. Software interrupts may be used to indicate
that an event occurred in the program. Hardware interrupts are generated by
physical devices. There are different types of hardware interrupts. Three
hardware interrupts are user specified. The user may connect a signaling wire
to the DSP hardware to control the DSP through these interrupts. Other hardware
interrupts are used for device-to-device communication such as serial ports and
buffered serial ports. We will focus on the hardware interrupts in this course.
C. Working with
Assembly Language
Assembly is a low
level programming language. It is dedicated to a certain type of hardware. For
example, DSP assembly codes cannot run on Sun Workstations unadulterated.
Assembly coding does not have the conveniences of a high level programming
language where you are provided with highly abstracted constructs like objects
or even certain simple functions. Like any programming language, it needs tools
for development, such as compilers, simulators, and debuggers. Often, the
actual DSP hardware is required to verify real-time capabilities of the assembly
code.
In this course, you
are provided with the following:
- Used for writing, simulating and debugging DSP
programs.
- Generates audio input to the DSP for processing.
- Used for running Matlab to verify your algorithm
and to validate the DSP processed signals.
- TI C5509 Evaluation Board (EVM)
- Verifies that the DSP program runs under real-time
constraints.
- Generates audio frequency output for listening to
the signal that is actually produced by your DSP algorithms.
- Code Composer Studio 3.1 (CCS) - This is TI's DSP development environment that
includes:
- Assembler - compiles the hand written assembly code into
object codes. Object codes are machine codes that have not been
allocated in the program memory. Relative addresses are compiled but
absolute addresses in instructions such as branches are still symbolic.
- Linker - links all necessary object codes, assigns each
portion of the code to a proper memory location, and generates a DSP
hardware compatible binary (or so called executable). All symbolic
addresses are resolved to a physical address.
- Debugger - loads and simulates the executable. Without the
EVM board, the term “debugger” usually refers to a software simulator.
However, when using an EVM board, the debugger essentially serves as an
interface to the EVM board. It allows you to upload your code onto the
EVM board and execute the code on the EVM. In addition, it provides
features for observing the states of the board, such as accumulator
values, status registers, PC counters, etc. (which can be observed
after stopping the EVM from running).
- Manuals - a majority of TI's users and reference
manuals are available in the electronic format (linked to the course
webpage) and through the CCS Help menu. They
have been downloaded from their main website and stored locally for
faster access. (A hard copy of some the manuals will be available in the
lab. Please do not take them out of the lab.)
- The course webpage has links to many related
documents, some of which you will need to refer to (and some which you
will never need to look at). There is a wealth of information available
here, but some of it will not be useful in the labs. We will try to
direct you toward the essential documents, but it is a good idea to take
a look at the other documents that are available.
- You may choose to print the documents, but keep be
mindful of the print quota where you are printing.
The code for a
typical assembly-based project usually consists of the following:
- Main assembly code
- Hardware configuration files
- Interrupt vector setup
- Serial port setup
- AIC (analog interface chip) setup
- Command files - command files are like configuration scripts that
configure how a tool works. There are various command files that configure
both the linker and the debugger. We will only describe the linker command
file, since this is the only command file that you will need to modify for
each project.
- Linker command file Configures the linker. It contains 3 parts, all of
which are placed in a single ASCII text file:
- The first part is the file I/O section. This
specifies which files the linker use for input and output of the linking
process. Command line options can also be placed in this section.
- The second part is the memory definition. This
tells linker about the memory sections that are available on the
processor, since the linker needs to decide where to place both code and
data. The size and locations of the processor's memory sections are
specified and assigned names here. The names could be any arbitrary
name, but are usually chosen to indicate the type of memory section that
is referred to. An example of this is the name IDATA, which can indicate
that it refers to internal data memory.
- The third part is the mapping section. This tells
the linker where to place the different parts of the entire DSP code.
Remember that a complete DSP code consists of not only the main program
instructions but also the interrupt vector definition, the stack
definition, space for intermediate variables, etc..
In general, these can be stored in either the program memory or the data
memory and they all need to be mapped into some memory section (as
defined in the previous part of the linker command file).
- General Extension Language (GEL) file - As far as this course is concerned, the GEL
file is used to initialize target memory locations, such as the DSP
configuration registers, to known values. This can be very important for
restoring the DSP to a working state. For example, if the configuration
registers were accidentally overwritten and therefore the processor was
not working properly, the GEL file could be used to restore these
registers. Code Composer Studio provides a gel file named dsk5509a.gel that has appropriate settings
for the C5509 processor. If you are curious to know more, the General Extension Language is an interpretive
language similar to C that lets you create functions to extend Code
Composer Studio's usefulness. It is particularly useful for automated
testing and user workspace customization. However, you will not need to
modify any GEL file in this course.
- Compilation scripts/Project files
IV. Readings:
Chapter 1: CPU Architecture
Chapter 6: Addressing Modes
Chapter 2: Tutorial (optional)
Chapter 4: Optimizing Assembly Code
V. Lab Environment:
This section takes
you through the lab computers and discusses where to find things and what to
look for.
We highly recommend
you to read through the lab and do much of the lab work outside of the lab as
the computer resources as well as the TA resources are limited. You may find
that some labs are rather lengthy. It is important that you work efficiently inside
the lab since there are limited number of lab
stations. Try to use the lab for debugging your code and to execute the
experiment. If this is not possible, try at least to have as much of your
algorithm roughly coded as possible before coming into the lab. This is
especially important in later labs where most of the labs are project-oriented
and require extensive coding.
We ask that you do
not lock the lab PC's in order to reserve a lab machine. In fact, the TA's and
the system administrators are responsible for logging off any locked lab PC's
if there are none available. If such situation arises, the user will be logged
off, and any programs that are running will be terminated. The TA will not be able
to tell if there are important programs running.
A. Working Directory
You will be assigned
an account on the EE network. This will allow you to log in on any of the lab
computers in Packard 001 to access your ee265 account. You are responsible for
keeping your files in your account folder and not leave random files on the
local machine, especially the C:/
drive. You may use the local C:/
drive for TEMPORARY storage since, on
occasion, the EE network may be down preventing you from working on the labs.
Files left on the C:/
drive will be removed by the system administrator on a regular basis.
Each user will have a
personal network directory. This directory is already mapped to the Z:/ drive for you and
will be available to you whichever lab machine you are on. You may access this
working directory through the "My Computer" icon. We will refer to
this working directory as Z:/
in this and subsequent labs. If
you don't know how to manipulate files under the windows operating system,
please ask the TA's. For the sake of system administration, please keep
ALL your lab files under your working directory. All
user files outside of your working directory may be deleted at any time.
The user privileges
are setup such that the different ee265 accounts are private, so you may not be
able to install certain programs. If you find something that you would like to
install but couldn't or you would like those programs accessible on the EE
domain, let us know and we will install it on the server.
B. Setting up the
Lab
All the lab materials
are stored on this website. Download these files to your personal directory
before doing lab assignments. Files for lab1 are
C. Code Submission
You are to
demonstrate your code to the TA on the due date (usually on Thursdays).
Demonstrations must be done during the TA office hours. The TA's have allotted
office hours on Thursdays for this and we expect that you will be able to find a
time slot to meet with them. During the meeting, you are asked to setup and
demonstrate your lab. Upon completion, you will need to provide a brief project
report (templates will be provided each week in the handouts section of the
webpage). You will also be expected to attach a hardcopy of your code (and
linker command file) to the lab submission for grading purposes.
Your code will be
graded based on a number of criteria: performance in terms of cycle time and
code size, coding style, and lab write-ups. We are not asking you to spend a
lot of time on comments and style, but enough for you and the TA to understand.
VI. Lab Exercises
In this lab, you will
first be introduced to the CCS tool environment from the perspective of an
experienced user. From it, you will get exposed to the do's and don'ts of lab
equipment and software. Then, you will go through two sample programs for a
demonstration of the C5509 DSP board. Finally, you will be asked write code
(and test) a simple FIR filter. You will also need to
answer many questions at the end of this lab.
A. Tool Environment
In this section, we
will take you through the tool environment, how to run CCS, what to look for,
and precautions to take when working with the hardware. There is a CCS tutorial
available. However, it is time consuming to go through. Besides, exact
instructions on what to do in CCS are provided in the lab exercises.
The Code Composer
Studio is an integrated development environment specifically designed for TI
DSP processors. The Code Composer Studio also has extended features for
hardware debugging. It facilitates communication of data between the host and
the EVM board and enables monitoring of the board status information. Like all
development environments, the combination of CCS and the
5509a DSK board is sometimes buggy and it will take time to get used to
the development flow. HINT: Often, repeating the same steps may actually make
something work.
1. On-line
References
You can find all the
important documentation you will need on-line through the TI homepage, http://www.ti.com/. To simplify the search for
information, we have downloaded the on-line references (most of them in the
Adobe Acrobat PDF format) and made them available on the EE network server
through the class homepage. In addition the CCS Help menu is VERY USEFUL and
contains searchable information. With a few clicks, you can locate the
description of the ST0_55 register or the MPY::MPY instruction.
2.
Hardware setup
Before you run CCS,
you need to power-up the EVM board and check
its connections. Previous EVM boards proved
rather fragile. We hope with care, and lessons learned from those boards to have
fewer boards broken in the early weeks of this class.
Look around the
workbench for the EVM. It is marked TMS320C5509A DSK. The EVM has a power
socket, USB connectors, stereo audio jacks. and the power supply.
In order to use the
EVM with CCS, you will first need to
- Connect the mini-USB connector to the DSP. Oddly, it sometimes appears to be important to plug
the USB cable first into the EVM before plugging it into the computer. The
mini-USB port, which is adjacent to the power socket, connects to a small
JTAG interface which enables the debugging features we will take advantage
of in this course. One of the other two peripheral-size USB ports is
connected directly to a USB output port of the DSP, and the other is
connected to an independent power measurement system which allows
developers to measure the instantaneous level of the DSP's power
consumption.
- Connect the USB cable to the computer.
- (Optionally) Connect the audio input and output
cables. The stereo 3.5 mm
jumper cable should connect on one side to an audio source, such as the
computer's line out. On the other end, it connects to the audio jack
marked “Line In” on the EVM. Your headphones should connect to the audio
jack marked “Headphone” on the EVM.
- Power the board by plugging the power supply into the power socket
at the corner of the board. You should notice the 4 LEDs light up in a
sequential pattern that concludes with all 4 illuminated. While one of the
buttons is marked “Reset”, we have found that unplugging the board is the
most reliable way to power cycle the DSP.
3. Software Tool
Environment
This section provides
you with the REALLY useful knowledge that is entirely experience based and you
will not find it in any of TI's documentation. It
is important that you go through this section in detail even if you don't
understand some of the terminologies. This section will save you time
later on when you encounter strange anomalies in CCS that you cannot
comprehend. It is highly recommended that you refer
to this section when you encounter problems in this lab or any future labs.
CCS startup procedure:
- You must first connect
power to the EVM board before you start CCS. Otherwise, you
will get an error dialog box telling you that CCS is having problem
communicating with the EVM.
- You can start CCS
by double clicking the 5509A DSK CCStudio v3.1
icon.
Do not poke around and start the CCS
setup utility. If you do, CCS will still run but the device driver for
RTDX (you will learn about this later) may break. The staff will have to
reinstall CCS to make it work again. If RTDX does not work for you, check to
see if the board is broken first and report the problem to the TA.
- After you started CCS, the development environment
window should come up. The next thing that you should do is to connect to
the EVM. To do this, choose the Connect
option from the Debug menu; Debug->Connect. If this is successful,
a window denoted “Disassembly” will pop up in the code view area of
the development environment.
- To the left is the Project
View window. At this point, you
can either open a project or load an executable
onto the EVM board. A
project is like a container that groups your source codes to help you keep
track of them. Executables are the compiled version of your source codes
that are ready to be loaded onto the EVM board. The typical development flow is as follows:
- Start a new project (Project->New...)
or open a project template (Project->Open...).
- Complete/edit the source codes.
- Compile the source codes (
Project->Build or
use a keyboard or toolbar shortcut) to build
the executable and verify that there are no syntax errors.
- Load the executable (
File->Load
Program ) onto the EVM
board.
- Run (Project->Run,
or keyboard/toolbar shortcuts) and debug
the program. In general, debugging would likely involve repeating steps
2-5 until the program runs as expected.
- The Project View
window contains a listing of the source codes. Only the source codes
listed here will be compiled and linked. You can add source codes to the
project either using the pull down menu Project->Add
Files to Project... or by dragging the source code from a file
manager to the Project View
window.
- Make note of the path
name of the directory where the source codes are stored. It cannot
contain spaces. For example, the path name C:\Program
Files\ti\my project is not supported by CCS. If the path to your
source codes contains spaces, what will happen is that the extra spaces
will be incorrectly encoded into the .pjt
file of your project. This causes a problem when you reload the project.
CCS will not complain but all the source codes that were originally
included in the project will not appear in the Project View window. (NOTE:
This was the case for earlier versions of CCS, but it may not be true in
CCS 3.1)
- Make sure that download and decompress lab templates into a working directory
in your home account before you start any lab exercises!!!
Crash
recovery:
- A crash occurs when the EVM board or the CCS hangs
when running the program. Occasionally, CCS will report in a dialog box
when the EVM malfunctions and these crashes usually require more drastic
actions than a simple software reset. Often,
crashes result from buggy programming such as memory leakage, incorrect
pointer arithmetic, improper addressing mode usage, or incorrectly
preserved context. This type of crash is very difficult to debug since it
is hard to trace where the memory leakage occurred or to determine the
register that had not been properly preserved. Here are some of the
prescribed symptoms and fixes for crashes due
to buggy programs:
- Symptom: When you start the program, it automatically stops and the PC counter
points to an address that is unfamiliar to you.
- Reason: A far branch has occurred which modified the XPC
register.
- Solution: It is almost impossible to trace the origin of the
branch. In fact, it may not be of any help to know where the branch
origin is since memory leakage is involved (which may have resulted from
improper coding). Double check your code to see if you have correctly allocated memory. Step through
the sections of the code where memory may be modified and verify your pointer arithmetic. If you are
working with interrupts, make sure that you have correctly saved and restored context.
- Symptom: The program is not doing anything when you run it.
And when you stop the program, it stopped at
some unknown hardware interrupt or inside an infinite loop.
- Reason: In DSP/BIOS, hardware
interrupts are defaulted to run infinite loops (i.e. be trapped) when
they are undefined. This tells you which undefined hardware
interrupt has been incorrectly triggered. There
are many reasons why the hardware interrupt may be incorrectly
triggered. A possibility is incorrect usage of the DSP/BIOS modules such
as improper calling of RTDX routines. Another possibility is memory
leakage where the IMR is incorrectly overwritten.
- Solution: Again, you must resolve this through careful
analysis of your code. Similar to the previous crash type, this type of
crash is extremely difficult to trace. In addition to going over the pointer arithmetic, pay special attention
to the STx55 registers. Make sure
they are preserved throughout
subroutine calls.
- Solution for integrated C/ASM
codes: If you are
integrating assembly source along with C source, you may have inadvertently modified the CPL bit in the ST1 register. This will cause the C subroutines to misinterpret stack-relative addressing as page direct addressing. Stack-relative
addressing is how arguments are passed to C subroutines. If the CPL bit
is cleared inside a C subroutine, it will read its arguments from the
wrong part of the data memory and will result in unpredictable program
behavior.
- Things you must do to reload
a program after a crash:
- Crashes that involve memory
leakage may cause some important system
registers to be incorrectly modified. Depending on the seriousness
of the situation, you will need to do one of the following:
- Always try first - Do a soft reset
(Debug->Reset) and then reload (File->Reload
Program) the program. Try this a
couple of times before telling yourself that this is not doing the job.
If you keep getting complaints from CCS that it cannot reset breakpoints
or that certain sections of the data memory are inaccessible, then go on
to try the next thing.
- Still not working - Then load the gel
file, C5509_Init (GEL->C5509_Configuration->C5509_Init)
and reload (File->Reload Program). Usually, this
should do it.
- Occasionally, the EVM board or the CCS will crash in
a way that is unrecoverable by any
software means. The two most typical unrecoverable crashes are:
- CCS brings up a dialog box indicating that it is unable to perform software reset on the target, or
unable to load the program.
- EVM is not outputting audio when it is
supposed to but the program seems to run normally. This may be a failure
in resetting the AIC or a failure of
the AIC. This is often an indication of a broken EVM board which may require the attention of the
staff. However, you should verify (see
below) that it is not just a software problem before you contact the TA.
- The most reliable crash recovery technique: If you find that going through the crash recovery
procedure described above does not help, then you may need to take more
drastic actions. The one that is prescribed for all unrecoverable
crashes is a simple power recycling of the
EVM board. Don't power off the EVM
board just yet! You must close CCS first. Otherwise, CCS may
cause the host computer to hang which may be even more difficult to
recover. Just remember that the order for
power cycling the EVM board and starting CCS is like a stack -
whatever goes into the stack last is the first to get off. So the CCS is
the last to be turned on and the first to be turned off.
Debugging quirks:
- Using break points,
probe points, or profile points:
- The thing to remember about break points is that
when you set them on CCS, they are also set on the EVM board. So suppose
you are working on a project, have a program loaded on the EVM board, and
have break points set for debugging, then you try to load some other
program onto the EVM board. What would happen is that while the new
program is being loaded, CCS will complain that there are break points that cannot be cleared. CCS may
also complain about break points even if you are not loading a new
program but you have made significant changes to your source codes such
as adding a few lines of code. The problem is that CCS is not smart
enough to associate the break points with the code itself. Rather, the
break points are associated with the physical memory address of the
instruction. So if the address of the instruction changes, the break
point will be misaligned causing problems with CCS. (NOTE: This does not seem to be a major problem in
CCS 2.0, but it is still worth mentioning)
- The rule of thumb is
to clear all break points, probe points, and profile points when you are
loading a new program or when you recompile and reload codes that have
major changes.
- Using mixed
source/assembly mode:
- The Mixed Source/ASM
mode (View->Mixed Source/ASM)
is a nice feature of CCS especially when you are coding in C. It enables
you to see the compiled assembly code line-by-line along with your C
source code. It is a great tool for debugging your C code and to
understand how C maps to assembly. There are a few caveats, however:
- You cannot edit your
source code in the Mixed Source/ASM mode. You would have to turn
this mode off first.
- You need to set up the
compiler so that the compiled executable would have the symbol information to support this mode.
This can be done through Project->Build
Options and in the Basic category
of the Compiler tab, make sure that Generate Debug Info is set to Full Symbolic Debug (-g).
- You must have either a compiled
program loaded onto the EVM board (File->Load
Program...) or its symbolic
information loaded into CCS (File->Load
Symbol...).
B. Sample
Demonstrations
To demonstrate what
real-time DSP boards can do, two sample programs are provided. You will, in
later labs, implement these programs and ones that are more
sophisticated.
1. FIR Filter Demo
Power up the EVM and start CCS as described earlier. If CCS is already opened, then make sure to close the previous project (Project->Close) before you proceed. This
clears the settings from the previous session.
Make sure you understand the hardware
description/precaution discussed in part A before you connect
anything to the EVM board.
Connect the EVM AIC input to the Computer Line-out
cable. For this, you should connect the Computer
Line-out cable to audio cable and then connect the other side of the
audio cable to the EVM Input.
Connect the EVM
output to the headphone. If you see
speakers lying around in the lab, DO NOT USE THEM.
They are the leading cause of EVM board deaths.
Play a music piece from the computer, so that an audio signal goes from your PC sound card
into the Input port of the EVM. For this, you may bring in your own CD.
Download fir.out and FIR_Control.exe for this demonstration to your personal directory under labs/lab1.
Select File->Load Program
§
a Load Program
window will appear. Change the directory to your personal directory under labs/lab1
§
Load fir.out.
Select Tools->RTDX->Configuration
Control.
§
The RTDX Configuration
form will appear at the bottom of the DSK5509A
session window. Click on "Configure". Under the
(default) "RTDX Configuration Page" tab, select "Continuous
Mode" of operation. Click "OK", then, Enable RTDX by clicking in the check box. Enabling RTDX
will allow the exchange of data between the host computer and the target EVM board.
Select Debug->Run or click on the Run
icon. You may also press the function key <F5>
to do this operation.
Open the program FIR_Control.exe. (This program is written in Visual Basic.)
You should be able to hear music now. You may select
different filters to hear their effects. If you cannot hear music or
if the music stops after a few seconds, then follow the steps for reloading
a program after a crash (i.e. stop the
program, reset the DSP, reload the program, and run
the program again). Sometimes it may take a try or two to get things right,
especially if you are new to this.
C. Simple Assembly
Program:
At this point, you
should be quite familiar with the tool environment. You may not have enough
background on the instruction set yet but that will come with practice. The
best way to learn assembly programming is to
write one (and to look at example programs). In this exercise, you are given
all the resources to write a simple assembly program. The trick is in debugging
and we will go through some of the basics.
All that is asked of
you here is to write a simple FIR filtering program with the following
specification:
- The FIR filter
will have 80 symmetric taps. (Note:
these taps are not perfectly symmetric, but symmetry is definitely
present).
- Your FIR program will only need to generate 1 filtered value as the output. This is
obtained by summing the products of the coefficient and data pair.
- Your code will be graded
based on the number of cycles for
running the filtering loop (setup excluded) as well as the code size (in units of words). Simply put, grading will be based on
the optimality of your FIR code as well as cleanliness of the other
unrelated codes. Strip your program to run in as few cycles as possible.
- You may use any means of writing this program (including
finding an on-line reference of the code) except sharing codes with a
fellow student.
- You may find the sample program (a working code)
stored in lab 1 sample.zip useful to get you
started.
- The project template is already provided to you (see 1. Procedure below). It contains the filter
coefficients that you will be using. So all you really have to do is code
the FIR filtering algorithm.
- After you have gone through the procedure below for
loading and pre-compiling the project template, take a look at the source
codes, simpleFIR.s55 (the assembly source) and simpleFIR.cmd (the linker command file). Note the following
lines of codes:
- The filter
coefficients are defined in the memory section coefDat. The coefDat section is labeled by coefDATA. In assembly language, labels can also be used as symbolic references. What this means is
that labels can be used as addresses in the instructions. The linker
will replace these symbolic references with actual physical addresses
when it allocates memory to the data and the code.
- coefDat is an arbitrary name that we gave to a defined memory section (.sect) that is to
be allocated in the data memory This section is defined because the
content of the memory section (the filter coefficients) is listed below
the .sect directive.
- The physical memory definition, e.g. DARAM0, DARAM1,
SRAM0, SARAM0, SARAM1, SARAM2, etc., are suppose to
reflect the actual hardware setup. Defining physical memories this way
gives you the ability to link your code under different hardware (in
this case, memory) configurations. The physical memories defined cannot
exceed the limits of the actual hardware. For example, if the physical
memory is 16K (i.e. length = 0x4000) and this memory is
configured in hardware to map to the address range of 0x2000 to 0x6000,
you should get a CCS linker error if you attempt to define a memory
entity originating from 0x2000 with
length of 0x6000.
- Make sure you understand the discussion on memory
access latency in part A.
- Note that certain performance enhancing
instructions (specifically, the single-cycle dual-addressing
instructions) are only guaranteed to work with dual-access RAM, but they
have been found to work in some other cases as well. The performance is
limited by the data buses that are available if both operands reside in
the same block of SARAM, then there will be a memory conflict for that
block and a pipeline stall will occur. This will force the instruction
to take 2 cycles. However, if the operands are in different blocks of
memory (for example, one operand is in DARAM and another is in SARAM, or different banks of SARAM), there
should be enough data buses available to make the instruction work. This
may require you to re-map the data sections in the linker command file,
and you'll have to verify that single-cycle performance works. This is one reason why it is much easier to
use DARAM for all dual-addressing instructions.
- output is output buffer, where FIR
filtering result shall be stored. Its size is 2 word.
- Since the
output is 32bit data, it should be even word address ( "2" at
the end of the line denotes the alignment constraint )
- dataSect is also an undefined section hosting 80 words,
like coefSect. It is for input data for FIR filtering.
[NOTE: since we don't need to align circular buffer in C55, I removed
all the mentioning related to alignment and circular buffer]
1. Procedure
- You are provided with a template for coding.
Download lab1 asm_tmpl.zip
and decompress it to your personal
directory under labs/lab1/filter.
- Follow the procedure
provided in part A to start CCS.
- In the DSK5509A window, open the project simpleFIR. Select
Project->Open.
- In the Project
Open dialog box, change to the
appropriate directory and select the file labs/lab1/filter/simpleFIR.pjt. There are two files in this project, simpleFIR.s55 and simpleFIR.cmd.
If you have not gone through these files, do so now before you proceed.
- Select Project->Build
Options.
- In the Build
Options dialog box, click on
the Compiler tab
- Change Category
to Basic. `
- Make sure that "Full
Symbolic Debug (-g) " is selected from the Generate Debug Info drop-down list.
What this does is it enables you to view the compiled assembler code
along with the linear assembly code that you wrote (see the discussion
on Mixed Source/ASM above). After you
compile the code, you can see the compiled code next to your assembly
code by selecting View->Mixed Source/ASM.
Also note that while you have this option selected, you cannot edit your
assembly code. So if you find yourself getting frustrated over not being
able to edit your code, deselect this option first.
- Click on OK to update the settings.
- There are 2 basic files required for compilation, simpleFIR.cmd (the linker command file) and simpleFIR.s55 (the linear assembly code
which hosts your FIR filtering code). The linker command file specifies
the memory sections available on the C5509. It also lets you control how
your code gets allocated (or placed) in memory. The linear assembly code
is what actually gets compiled. A linear assembly code is a slightly
higher-level version of the assembly code in the sense that you do not
have to worry about some of the details of writing assembly codes such as
pipeline conflicts. Linear assembly codes can contain no NOPs, whereas in
pure assembly code you have to explicitly put in NOPs to avoid pipeline
conflicts or to fill delay slots. An assembly code has the extension .a* (e.g. .a55,
.asm, etc.). In this course, you will
primarily work with C codes and linear assembly codes which have
extensions of the form .s* (e.g. .s55, .s,
etc.).
- Compile the project,
Project->Rebuild All (or use a
shortcut). This recompiles all the source codes into object codes.
If you use Project->Build,
it will do an incremental build, where only codes that have been modified
will be recompiled. NOTE: You may get a
warning message saying that the code entry point _c_int00 is
undefined. This is a CCS bug and doesn't actually affect the executable
as long as you actually followed the steps above to define the code entry
point.
- Load the empty program into CCS in the following
manner: Select Debug->Reset CPU.
Then select File->Load Program.
Normally, your output program is in ./Debug
folder.
- Select Debug->Restart. At this point, either the Disassembly window will open to the
memory location RESET, or your source
file, simpleFIR.s55, will be opened
and the line corresponding to RESET
will be pointed to with a yellow arrow.
- Press <F10> twice and the PC (Program
Counter) would jump to DONE.
This is because you have not written your code yet.
- Go ahead and write your code now and use the
CCS tools to debug your code (as described below).
2. Verifying Your Code:
- Debugging Tools:
Some helpful debugging tools
include (these should be used in all labs):
- CPU Register window
Select View->Registers->CPU
Registers.
- Memory windows
Select View->Memory. You can
examine an entire block of memory with this option. A Memory Window Options dialog box will
pop up. The form should be self-explanatory. You can also enter the
variable names for the address field.
For example, if you enter RESET
for the address field and select Program for the Page field, you would actually see the instruction opcodes for the program
instructions at the location RESET. Similarly, if you enter h0, which is the label of the first
filter coefficient (coefDATA would work too) you
will see the memory locations at that address represented by h0. You can modify values in
this window by double-clicking any value in these windows. This will
bring up a window that allows you to change the value. This can be very
useful.
- Watch window
Select View->Watch Window.
The Watch window should be opened
at the bottom of the Code Composer
Studio session window, and it has two tabs: Watch Locals and Watch 1. To add a new expression to watch, select the
Watch 1 tab. You should see a
blank line under the headings Name,
Value, Type and Radix. Click in the blank box under Name to add a new expression.
The following information may be confusing,
but it will be worthwhile to understand:
As an example of a Watch Window expression, you can enter: h0, which is the label of the
first filter coefficient. The address assigned to the label h0 is displayed in the format specified
under the Radix heading (which can
be changed). If you enter *(int *)h0, then you would get the data value
stored at the address h0 (this
is the same as “de-referencing” a pointer). This notation may look
confusing, but it is actually a lot like C. The first * is needed to
tell the watch window that the symbol h0 is an address (a pointer) and you do not
want to see the value of the address itself but rather the value of the
data stored at that address. The (int *)
part tells the watch window that the address h0 points to an integer (a
16-bit number on this processor). This is the same as a C-style type
cast, since an address on its own is only understood to be of type void*
(a pointer to a void type). If the expression in the watch window is
simply given as *h0 then the
watch window does not know the size or the format of the data at that
address, so it cannot properly display it. Other possible type casts
could be (unsigned int *), (long int *), etc. In assembly, all reserved
words are specified by labels such as h0, which are effectively memory addresses,
so this type of pointer de-referencing is needed. If a variable is
declared in C, then the watch window will know its type, so it can
simply display its value, instead of the address where it is located. So
if you have a data variable, say unsigned
int Size declared in C, then its value should be correctly
displayed as an unsigned integer if the expression Size is specified as the watch
expression. If you want to see the address of this variable, you can do
so by entering &Size for
the watch expression (the & operator
requests the address of a variable).
- Graph window
Select View->Graph->Time/Frequency. You can visualize the actual filter
coefficient with this function (this shall be described in detail in Lab
2).
A Graph
Property dialog box will pop up. Change the following
settings:
- Graph Title : coefficient This is the label you have chosen to
identify the graph window.
- Start Address : coefDATA This is the label you have chosen for
starting address of your data buffer.
- Acquisition Buffer Size : 80. This is the size of the data buffer.
- Display Data Size : 80. This specifies the portion of the data buffer
you would like to have plotted.
The number specified here must be smaller than the acquisition buffer
size.
- DSP Data Type : 16-bit signed integer. This is the data type for
digitized samples.
Click on OK to close this dialog box. The data in your coefficient buffer will be plotted with
respect to time.
- Verifying with MATLAB:
The filter coefficients can be read in MATLAB using get_coefs.m.
Then
generate a random vector with 80 length, of which each element is between
-1 and +1. This random vector can be converted to C55x assembly format
using write_vector_to_file_for_DSP.m.
This script generates a text file containing data that can be used in the assembly
file in CCS, so just copy and paste this text file to assembly file. If
you do the FIR filtering in MATLAB, you can compare the MATLAB result and the
DSP result to verify your work.
You will need to convert floating point random vector to 16bit integer
values using round(x*(2^15-1)), where x is a floating point random vector.
Useful Information:
This contains useful advice that has been
compiled from experience in working with these labs. You will probably need
some of this information for this lab, but there is much information here and
it might not make perfect sense at this time. Therefore, it is highly
recommended that you refer to this section again at a later date.
- It is recommended that you become very familiar
with the instruction references. Here are some locations where you can
find instruction set documentation:
- C55x Instruction Summary help menu (This is available in the CCS help menu contents
as TMS320C55x DSP Reference This help menu can be used to search for
information on individual instructions.
- A PDF document, Mnemonic Instruction Set Reference
Guide (accessible here and
on-line (EE265 Navigation Toolbar -
On-line Tools Reference). It is
recommended that you refer to Sections 1.1 and 1.2 of this document in
order to understand the symbols and abbreviations in the instruction
set documentation. For example, if
the documentation gives an instruction syntax
as “ADD Smem, src” Section 1.1 of the Mnemonic
Instruction Set Reference can provide more information about
what the symbols “Smem” and “src” represent. Since this is very
important information, this section is also available as course handout
#5.
- You should carefully read the
instruction set documentation for any instruction that you are about to
use. Pay
close attention to the“ Execution:” section of the documentation, since this defines exactly
what the instruction does with the arguments and also with many of the
processor's internal registers. This can be difficult to read at first,
because it uses several symbols and abbreviations that you not yet
familiar with. These symbols and abbreviations are explained briefly in
Section 1.1 of the Mnemonic Instruction Set Reference, which is also available on-line (EE265 Navigation
Toolbar - On-line Tools Reference).
(The rest of this information is intended to be
hints and advice. You don't have to read it, but you probably should at some
point.)
- “Smem” Instructions and
Addressing Modes:
Although
Section 1.1 of the Mnemonic Instruction Set Reference contains much useful information, it can be a bit too
terse for someone who is learning a new instruction set. For example,
the symbol “Smem” is said to mean “Word single data-memory access
(16-bit data access)”. This means that the operand for the instruction
is some word located in memory. This word of memory can be addressed by
practically any addressing mode (indirect addressing, absolute
addressing, etc.). Since Smem is one of the most commonly used symbols,
it is important to recognize this early on. This flexibility in the
choice of addressing mode also applies to other symbols, such as Xmem, Ymem, Cmem, MMR, etc..
- “MMR” Instructions and Syntax
Ambiguity:
There
are many instructions available that operate on MMRs (Memory-mapped
registers). One common example of this is a MOV instruction to store a
constant to AR1: “ MOV #0x0400, AR1”. While
the MMR is listed explicitly in the instruction here, it could also be
listed through indirect addressing as follows “ MOV
#0x0400, *AR2”, where AR2 is a pointer set up to point to AR1. The
result would be the same (if AR2 is initialized properly). This kind of
ambiguity may be a cause of confusing in later labs, so it is worth
mentioning early. It also brings up the very important point
that syntax is critical. Both “MOV #0x0400, AR1” and “MOV #0x0400,
*AR1” are valid instructions, but they mean very different things. This
is a cause of very common problems for assembly coding. It seems that
almost any syntax you use for an instruction has some valid
meaning, but it may not be the meaning that you intended. This means
that the assembler will not complain if you use the wrong syntax for
what you are trying to accomplish, but your code definitely will not
function the way you intend it to.
- Coding Advice:
When
starting out with assembly coding, it is recommended that you write
only on one or two instructions at a time and verify that they work as
expected before moving on. If you write 10 or 20 instructions before
you test any of them, it is very likely that you will have a bug
somewhere and debugging will be difficult. Instead, you should use the
Watch Window, the Memory Windows, and the CPU Registers Window to
verify that each instruction works as expected before proceeding.
- Refresh Latency in the
Debugger (NOPs are your friend): When stepping through code in the
debugger, you cannot expect to see the results right away because the
pipeline takes several cycles to produce a result. Therefore you may need
to step forward by several additional cycles to see the result of an
instruction that has already been “stepped-over” in the debugger. This
is very important to recognize, since the debugger will not appear to
be working otherwise. It may be best to insert a few NOPs between
instructions so that the effect of a single instruction can be easily
observed in the debugger. In general, if you want to observe the result
of a single instruction, insert a few NOPs directly after it and step
over them before checking the result. Of course, you'll want to remove
these NOPs before turning in your code; just make sure that the code
still works after the NOPs are removed.
- Debugging RPT loops:
The CCS debugger does not let you step through the individual loop
iterations of a repeat (RPT) loop. This means that you can only see the
end result of all loop iterations whenever you are using a RPT loop. If
an operation is repeated many times (such as repeating MAC 80 times),
it is difficult to determine the reason that the end result is not as
expected. In this case, a good way to debug this is to repeat the loop
only once (RPT #0) and then check the result of that single option. If
that works as expected, then do two loop iterations (RPT #1) and check
the result. Most bugs should appear by this point, but occasionally you
may have to keep increasing the number of iterations. If the loop works
for several iterations but doesn't work for the full number of
iterations, then you'll want to stop to think about other factors that
can cause this to fail (such as incorrectly used circular
buffering).
- Another way
to debug RPT loops:
Instead of RPT, which is single instruction repeat, you can use
block repeat instruction during the debugging. You can step over each
instruction in repeat block so it's much easier to debug. The following
code behaves identical to RPT loop except that step over on each
instruction is possible and somewhat slower.
MOV #(loopcount-1),
BRC0
RPTBLOCAL LOOPEND-1
<repeated
instruction>
NOP
LOOPEND
Since it is only for debugging, don't forget to convert RPTB
back to RPT after debugging.
3. Determining Cycle Count [Basic Profiling]:
This information is needed for determining the performance of your
code (after you have verified that it works correctly).
1. Make sure your program
is loaded into DSP memory (File->Load Program)
2. Click on Profile->Clock->Enable.
Make sure the clock is enabled

3. Click on
Profile->Clock->View.
4. Double click in the
margin of your simpleFIR.s55 file to place a breakpoint in your code. Place a
breakpoint on the line immediately after "bset SXMD" instruction
(your first line of code). <Break Point 1>
5. Place another
breakpoint at end of your code, on the line <Break Point 2>
B DONE ;This is
an infinite loop!!!!
6. Run your code to
<Break Point 1>
7. Reset clock
8. Run your code to
<Break Point 2>
9. Observe the cycle count
in clock. Let this clock cycle be <cycle1>
10. Then measure the cycles
between "Done: nop" to " B Done",
this is the clock cycle just for NOPs. Let this be <cycle2>
In order to do that, first make a break point at "Done: nop"
<Break Point 3>, then restart code, run to <Break Point 3>, Reset
Clock, and run to <Break Point 2>
11. <cycle1> -
<cycle2> is the performance of your code
12. Compare this with a
hand calculation of the number of cycle counts you would expect by using
the commands in your program. Are they different? Why?
NOTE: for the profiling, use <F5> instead of <F10>.
D. Concluding
Remarks:
Although you may not
have understood some aspects of the tools used in this lab, we hope to have
conveyed to you how these tools are tied together to create a working environment
for analyzing and creating codes for the DSP processors. You will become very
familiar with these tools as you use them over and over again throughout this
course. This lab is important in that it serves as the primary reference for
running the tools, so you should refer to it in later labs.
The EVM board is
usually not the complete solution. Often the DSP processors are integrated into
another system without physical connection to a PC. Example of stand along
system include portable devices such as cell phones, modems, electric piano,
network switches and routers, wireless base stations, etc. What you have seen
in this lab is only a small part of what you can do with DSP processors. It
just happens that the PC you are running the software on is very powerful. This
means that the PC can support the same digital signal processing algorithms
that the DSP can support. But you must realize a few things, 1) the CPU is much
more expensive and 2) it is not designed for computing DSP algorithms and is
not cost effective to implement in portable devices.
VI.
Lab Demonstration
You'll have to prove
to yourself that your code is working before you demonstrate to us, the TAs. We are also interested in how you prove to yourself
that your code is working. That is usually our approach to check out your labs.
In this particular lab, you should first use all 1's as your input data, then
test with random data. You should verify your result using Matlab for both
cases.
The following are
some short-answer questions related to the readings. They are part of the Lab1
write-up.
1. Name the 3 memory
spaces on the C55x processor. Also describe the number of bits for addressing,
the total amount of address space, and the granularity of addressing for each
memory space.
2. Where are the memory
mapped registers located (memory space and address range)?
3. Which memory segment do
these directives map to (program or data or both, initialized or
uninitialized)?
- .bss
- .text
- .data
- .sect
- .usect
4. What is dual-access RAM
and how big is it in C5509A DSP?
5. What are the available
addressing modes for 3 memory operand (Xmem, Ymem, and Cmem) instructions such
as dual-MAC, and what constraints on the memory allocation should be
considered?
6. Addressing Modes: Find
out the following details about the following instructions (Try View->Mixed
Sources/ASM):
#bytes
#cycles
addressing mode(s)
------------------------------------------------------------------------------------------
RPT #256
RPT #255
AMOV #8000, XAR2
AMAR *AR1(#1), XAR0
7. Explain the differences
of the following two instructions setting auxiliary register, in terms of
modified bits in auxiliary register, length of instruction, ...
AMOV #0x00A008, XAR2
MOV #0xA008, AR2
8. What are Data in the
address 0xa000/0xa001 and AR3 after execution of "MOV AC0, dbl(*AR3+)" ?
1) AC0 = 0xFFFF1111, AR3=0xa000
2) AC0 = 0xFFFF1111, AR3=0xa001
Last modified: 10:43 am 10/10/2008