You will be hearing a lot about the term DSP throughout the course. What is DSP? Why is it important? Why do you want to process signals in digital instead of leaving it in analog? When should you use DSP? What are the advantages, disadvantages and tradeoffs? What are DSP processors and how are they different than microprocessors? This section will discuss briefly these questions, and hopefully, you will understand these issues in greater detail by the end of this course.
DSP stands for digital signal processing and it often involves taking an analog signal, converting it to samples of digital values, processing the digital data, and converting it back to analog for output. We often consider DSP along with the analog front and back-ends since that is what we ultimately hear and see. Why do we process signals digitally rather than working with the original signal in the analog domain? The answer is it depends on the system and its requirements. For some systems, working with the analog signals gives a better solution. For others, DSP is better. It is up to you, the designer, to make the decision based on materials you learn through this and other courses. Here are some things to consider in making the tradeoffs:
o Operating conditions - temperature, frequency, supply voltage
o Radiation/interference
o Out of band interference
o Aliasing
o ADC (analog to digital converter) precision and filter noise
o DAC (digital to analog converter) filter order
o More complex algorithms may be used. Some may be unattainable with analog systems (examples: compression, data analysis, and data synthesis).
o Flexibility in choice of algorithms.
o Reprogrammable.
o Fixed point - loss of precision due to limited datapath
o Floating point uses more power
o Reliability not dependent on operating conditions
o Reusability of hardware
o Throughput and latency - some systems require that DSP be performed in real-time. This is a throughput constraint that requires the system to have finite storage equivalent input and output data rates. This does not mean that an input signal will be immediately processed to generate an output signal. On the contrary, most DSP systems generate a corresponding output to an input signal at a delayed time. The maximum latency tolerated by a system is part of the specification.
o Storage of intermediate results - on-chip memory needed (may be a limiting factor)
DSP Processors vs. General-purpose Microprocessors:
How do DSP processors differ from microprocessors? Both types of processors perform computations on digital signals. But the main difference is that DSP processors are tailored to process data signals whereas microprocessors are designed to reduce the amount of computations in a general computing environment where most signals being processed pertain to some form of program control. DSP processors are also designed to consume less power under its target applications. Products that use DSPs include DVD players, camcorders, cell phones, wireless base stations, modems, karaoke players, etc.
The separation between DSP processors and microprocessors is narrowing since microprocessors are approaching, if not exceeding, throughput capabilities of the DSP processors. In the personal computer industry, microprocessors are taking over several computational tasks used to be cost effective only when implemented with DSPs. This results in a narrowing market segment for DSP processors in the PC industry. However, DSP processors remain highly competitive in areas where non-PC related high throughput, low power computing is required.
This course integrates a number of different topics in digital signal processing (DSP): DSP processor programming, DSP algorithms implementation, and performance considerations. Understanding the programming methodology of DSP processors is relatively simple. But to write optimal code requires understanding of the DSP algorithms as well as the capabilities of the DSP processors. Although this course covers the programming methodology for TI's TMS320C55x DSP processor (which is one of the most popular DSP processor currently in use), the fundamentals of DSP programming extend to other DSP processors as well. Furthermore, the design principles you will learn in coding a DSP processor will be applicable to DSP designs in application-specific hardware.
The purpose of this lab is to introduce the design flow and basic programming methodology for working with DSP processors (in particular, TI's TMS320C55x DSP processor). This lab consists of an introduction, a couple DSP demos, and a brief programming exercise. The introduction is not intended to cover everything you need to know about DSP programming, but to provide you a working knowledge of TI's C55x DSP and its tool environment and enough material for understanding TI's C55x reference materials.
This section provides an overview of the C55x processor. The architecture and the instruction set of the C55x will be discussed first, followed by an introduction to the programming tools used in the EE265 lab. As an overview, the following discussion does not provide detailed explanations of the topics discussed. For detailed descriptions of both the architecture and the instruction set you will need to refer to the C55x manuals. For now, you should read the following section before reading the manuals. Section IV of this lab lists assigned reading, which should help fill in many of the details.
As a first introduction, the information you are exposed to in this lab can be rather overwhelming. However, since much of this material will not be discussed again in later labs, so you are highly encouraged to read the material now and then reread it later in the course after you become more comfortable with the C55x processor. It may help to make note of which parts are confusing so that you can clear up the confusion later, when understanding is required by the lab exercises.
The most pressing issue in the design of modern high speed digital logic is the high power consumption of large memories. Thus, the architecture of the DSP is designed around efficient memory access and utilization. Unlike other DSP processors, the C55x processor architecture is based on unified memory system, although the program memory and the data memory are logically separated. This means that the program memory and the data memory are separately addressed, in which the program memory space is byte addressable 24bit address and the data memory space is word addressable 23bit address, but the physical memory is shared by program memory space and data memory space. This memory may include both on-chip and off-chip memories. Take a minute to examine the C55x architecture in Figure 1. Notice that there are four read busses, three for data and one for program code. Specialized instructions allow data access using all three data busses in one clock cycle. Part of the data memory that is on-chip is dual-ported and is referred to as DARAM (Dual-Access RAM) in the manuals. This means that this part of the data memory can be accessed twice in the same clock cycle. This is useful for implementing DSP algorithms where multi-operand instructions are predominant.
Take note of the four basic blocks of the C55x CPU, the Instruction Buffer Unit (I Unit), the Program Flow Unit (P Unit), the Address-data Flow Unit (A Unit) and the Data Computation Unit (D Unit). You should read more about these first three units in the TMS320C55x Technical Overview and the TMS320C55x DSP CPU Reference Guide.
Notice that there are six pairs of buses interconnecting the core and the memory units. Five of these buses are for the data memory and the sixth one is for the program memory. Though there are five data buses, three read and two write, this does not mean that the processor uses all buses every clock cycle. Rather, there are a limited number of instructions that can make use of the full memory bandwidth. The CB and DB read buses and the EB and FB write buses can be used jointly to access single 32-bit values or individually to access two 16-bit values. The BB read bus is used primarily by the special dual-MAC instructions, and can access a single 16-bit value stored in internal memory (only in internal memory!).
Take a minute to study the architecture of the C55x Data Computation Unit (D-Unit) in Figure 2. This figure can tell you a lot about the capabilities and limitations of the C55x processor. First, notice the computational building blocks of the C55x processor. These include the shifter, a 40-bit arithmetic logic unit (ALU), 2 17x17 bit multiply-accumulators (MACs) and four register-accumulators. Next, note the interconnections between these blocks. This tells us how data can be transferred between the computing units and how the processor core can be configured in each cycle. Of special note is the fact that while there are two MAC units, there are only three data bus pathways. Thus, while the TI C55x can perform two multiply-accumulate actions in a single clock cycle, at least one of the input operands must be shared between multipliers. Furthermore, this shared operand must be stored in internal memory, as it is transferred on the BB data bus.
The two MACs use a 17x17 bit multiplier, so that signed values can be multiplied without additional bit level operations. The output of the multiplier is fed into a 40-bit adder to generate a 40-bit, which can then be optionally saturated (32 bit or 40 bit). The combination of multiplier and adder enables a single cycle multiply accumulate (MAC) operation and is very useful for filtering operations. The ALU is independent of the MAC unit. It is capable of performing logic operations as well as additions. Note that one of the inputs to the ALU is taken from the Barrel shifter. This means that the data word can be shifted prior to addition or logic operations. Additionally, notice that the ALU is 40-bits. It can perform a single arithmetic operation on a 40 bit value (i.e., from the accumulator) or two arithmetic operations on 16 bit values. As you explore the TI C55x instruction set, you'll note that the size of the operands for various instructions are typically specified.
You will find instructions that utilize multiple accumulators, the ALU, the Barrel Shifter, and one or both MACs. Such instructions are very useful for certain applications like Viterbi decoding. The ability to perform these types of instructions makes TI's C55x a very powerful DSP since they utilize most of the processor's core resources. However, it is up to the DSP engineer to choose the appropriate algorithms tailored to the DSP architecture. Otherwise, almost any algorithm can be implemented with simple add and multiply instructions. An optimal DSP code is one that results in the lowest power consumption. This is equivalent to a code that executes with minimal number of cycles and one that uses instructions having high processor utilization.
In the C55x, the program control and addressing circuitry are split into 3 units, the Instruction Buffer (I Unit), the Program Flow Unit (P Unit) and the Address Data Flow Unit (A Unit). C55x instructions can be up to 6 bytes. However, examining the architectural diagram in Figure 1, you'll notice that the program data bus is only 32 bits wide! The instruction buffer queue stores up to 64 bytes of program code; as many instructions are fewer than 4 bytes wide, this means that instructions longer than 4 bytes can often be executed without the extra cycle latency that would be required in the absence of the instruction buffer. However, if the instruction buffer is emptied by a sequence of long instructions or as a result of a branch or call, if the next instruction to be executed is longer than 4 bytes, a one cycle delay will ensue. Furthermore, code loops which fit entirely within the instruction buffer queue can be executed efficiently not only in terms of eliminating potential delays in fetch long instructions, but also by forgoing the energy required for code memory accesses.
Many algorithms access data by way of address pointers (much like C/C++-style pointers). The A Unit contains a 16 bit ALU which gives the C55x the ability to dynamically update address pointers without taking any additional cycles to perform pointer arithmetic such as adding constants to a pointer or incrementing modulo some value.
The C55x has 3 memory spaces, program (page 0), data (page 1), and I/O (page 2). Make note of the page numbers associated with each memory space because you will need to know them for coding. As instructions are specified in 8-bit (1 byte) chunks, the program memory space is addressed at the byte level, with 24 bit addressing. As data is processed in 16 bit words, data memory space is addressed at a word level, and hence data memory addresses are 23 bits wide. The actual amount of internal memory available on the C55x processor depends on the particular model; the C5509A, the processor used in this class, has 8 blocks of 8KB of DARAM (64KB in total) and 24 blocks of 8KB of SARAM. Note that for full data bandwidth, some instructions require different operands to be stored in different blocks of memory.
More detailed description of the C55x memory space is available in Chapter 3 - Memory of the CPU Reference Guide and in the TMS320C5509A Data Manual
The program memory is pointed to by the Program Counter (PC) which references the next instruction to be executed. There are also other auxiliary and status registers that are associated with the program memory space. They are primarily used for program flow control such as branching or conditional execution.
The data memory space is associated with 8 16-bit auxiliary registers, AR0, AR1, ..., AR7. These registers are used primarily as pointers, just like the pointers found in high-level languages such as C/C++. By functioning as pointers --- using a form of addressing called “indirect addressing” the address registers enable faster instructions to be implemented (more on this later). The A Unit data address generator, DAGEN, is dedicated to operating on the address registers. Automatic increments, decrements, modulo, as well as indexed increments, bit-reversal, etc., are supported by DAGEN. These all make data manipulation even more flexible with C55x. Recall that the data space uses 23 bit addressing. When the address registers are used for indirect addressing, the high 7 bits of address are taken from the high 7 bits of the extended auxiliary registers XA0, XA1, ..., XAR7, which are set using a 23 bit constant (k23).
Tables 6-4 and 6-5 in the TMS320C55x DSP CPU Reference Guide list the possible variations of indirect addressing using address registers. Notice that there are restrictions on the set of address registers that can be used with two-operand --- such as instructions that multiply two numbers that are both stored in memory --- and parallel instructions; the “Dual AR Indirect Addressing Modes” are shown in Table 6-7. Instructions which use three operands from memory require special addressing, using the “coefficient data pointer” (CDP. Before using instructions which use the CDP (denoted by the Cmem notation in the reference guide) you should read and understand the details given in section 6.4.3.
For those that are curious, here is a brief explanation of why address registers enable faster instructions: The C55x processor does not have a fixed instruction length, so instructions can be up to 7 bytes long. However, there are two reasons why longer instructions are not to be desired. First, memory access is a major component of energy consumption; the more times memory is accessed in a battery powered device, the shorter the battery lifetime will be. Second, since the processor can only fetch two words (32 bits) of memory each clock cycle, if an instruction is more than two words in length, this increases the chances that it will not be able to be executed in a single clock cycle. This means that instructions can be both more energy efficient and faster if they can be packed into fewer bits (and therefore fewer words). Some of the bits that make up an instruction are used to tell the instruction decoder what type of operation is supposed to be performed, while the remaining bits typically tell the processor what the operands are for the instruction. For example, if a load instruction is used, the operand to be loaded can be specified by encoding its address directly in the instruction. For the C55x, addresses are 16 bits, so encoding this address directly in the instruction (so-called “absolute addressing”) would result in an instruction that is at least 2 words long. Alternatively, if the operands are addressed by way of pointers (“indirect addressing"), then the instruction only needs to encode enough bits to indicate which pointer to use. Since there are only 8 address registers available for use as pointers, this takes at most 3 bits. Even more detail: In the case of dual-operand addressing, the instruction needs to specify two operands by way of pointers, so more bits are needed, and thus there are more restrictions on the potential pointer manipulations.
Recommended: For more information on addressing, read all of Chapter 6 - Addressing Modes of the TMS320C55x DSP CPU Reference Guide.
Inside the C55x processor core are a number of registers pertaining to the control and configuration of the DSP processor as well as communication with peripheral devices. These registers can be used to monitor the status and to configure the processor. To simplify programming, these registers are mapped to the data memory. This means that instructions that work with the data memory can access and operate on the information contained in the MMRs. This is why they are called memory mapped registers (MMRs). The instruction set also has several instructions that can only be used to operate on MMRs. C55x MMRs are listed in Table 2-1 of the CPU Reference Manual. Of particular note are the 4 status registers ST0_C55 -- ST3_55. These registers control and report many basic operations. Addressing, conditional flags, overflow mode, sign extension, saturation, rounding, circular addressing, fraction mode, global interrupt enable, shifting, and much more are all accessible through these registers.
Overview:
The C55x instruction set can be summarized as: too many choices. There are many different types of instructions available and for any given instruction there may be many different variations to choose from. For example, a look at the instruction set documentation will reveal that there are more than 20 variants of the addition operation (including multiple versions of ADD, ADDV, ADD::MOV, ADDSUB, ADDSUB, ADDSUBCC, and ADDSUB2CC). The number of choices may appear daunting at first, but the availability of the many variations means that there is usually an instruction available that will do exactly what you need. For example, if you need to add two numbers then there is probably a specific instruction available to do this for you, regardless of where the numbers are stored. Some examples: 1) adding two numbers that are both stored in memory, 2) adding a single number stored in memory to a number in an accumulator, 3) adding a constant to an accumulator, or 4) adding a value to an address pointer. The significance of all of this is that if you know you need to perform a certain type of operation, you need only find the proper version (the proper syntax) to use for a particular situation. Even more importantly, the availability of many different instructions results in more efficient code, since many operations can be done in a single cycle instead of using several cycles to perform a task. For example, if you were not allowed to add two numbers that were both stored in memory, then you would first need to load one of the numbers into an accumulator and then perform the add. This would take a minimum of two cycles for the load and add, whereas the dual-operand version that adds two numbers directly from memory could be done in a single cycle.
DSP-specific and Application-Specific Instructions:
Much of the C55x instruction set is comprised of common instructions such as Load/Store, ADD, Multiply, etc., but there are also many instructions that are available specifically for DSP operations. These DSP-specific instructions are the reason why DSP processors can be more efficient than general-purpose processors. They allow certain DSP operations to be performed using fewer instructions (and fewer clock cycles) than would be required if using general-purpose instructions. There are also instructions that are application-specific in that they are available primarily to speed up certain specific DSP algorithms (such as FIR filtering with symmetric coefficients or Viterbi decoding). Although these instructions may have been targeted for certain specific algorithms, they can sometimes be exploited for other uses as well.
Optimization, Options and Pitfalls:
The abundance of available instructions means that there is a lot of room for optimization depending on which instruction is chosen for a particular task. (This also means that it is very difficult to design a C-to-assembly compiler for DSPs.) Being familiar with the available instructions can make programming easier and more efficient. It is essential for the DSP programmer to have a good understanding of all the options and pitfalls.
Instruction Types:
The C55x instruction set can be broken down into the following categories:
The pipeline of the C55x is discussed in Section 4.4 of the TMS320C55x DSP Programmer's Guide. The C55x makes programming quite easy by protecting against almost all potential pipeline conflicts. Thus, under normal circumstances, the pipeline will not introduce any problems. However, if you find your code takes longer than expected to run and are not aware of how the C55x handles pipelining or even what a pipeline is, it will make debugging quite difficult. The rule of thumb for debugging an assembly code on a pipelined processor is that when you run into an instruction that does not make any sense during debugging, add several NOP (no-operation) instructions before the instruction or check the conditional flags (if you don't understand pipelining, this sentence will probably not make any sense. Keep it in mind and you will understand later).
A majority of the labs in this course involve the use of interrupts. We will be discussing interrupts in greater detail in Lab 2. Essentially, interrupts are used as signals to the processor to do things other than what the processor is currently doing. The sequence of events for processing an interrupt is as follows.
A number of things need to be configured for interrupts to work properly. As implied in the previous paragraph, you need to setup the interrupt vector and the stack and you need to code the ISR itself. You will also need to setup the interrupt mask register (IMR) and the interrupt flag register (IFR). Also, before the main program starts, you will need to enable the global interrupt mask (INTM). You will learn how to do this by going through the exercises. Sometimes this setup can be done automatically for you by the programming tools, as will be seen in later labs.
The IMR is used to selectively enable and disable interrupts. In other words, the IMR configures the C55x processor to listen to certain interrupts during normal processing. The IFR indicates which interrupts have active requests so that you can find out if an interrupt occurs while another interrupt is being serviced. The IFR can also be used in another method of handling external signaling called polling. Polling is different from interrupts in that a new interrupt sets the appropriate IFR bit, but it does not stop the current program flow. Instead, the program checks the IFR to see if an interrupt request is active. If so, an equivalent “ISR” is executed. This results in more predictable program flow, since the program is never actually interrupted. Finally, the global interrupt mask (INTM) is a convenient way to enable (or disable) all interrupts that are selected with IMR.
Different C55x models have different number of interrupts. The maximum number of interrupts supported on the C55x is 32, most of which are inactive. There are also two types of interrupts: software and hardware. Software interrupts may be used to indicate that an event occurred in the program. Hardware interrupts are generated by physical devices. There are different types of hardware interrupts. Three hardware interrupts are user specified. The user may connect a signaling wire to the DSP hardware to control the DSP through these interrupts. Other hardware interrupts are used for device-to-device communication such as serial ports and buffered serial ports. We will focus on the hardware interrupts in this course.
Assembly is a low level programming language. It is dedicated to a certain type of hardware. For example, DSP assembly codes cannot run on Sun Workstations unadulterated. Assembly coding does not have the conveniences of a high level programming language where you are provided with highly abstracted constructs like objects or even certain simple functions. Like any programming language, it needs tools for development, such as compilers, simulators, and debuggers. Often, the actual DSP hardware is required to verify real-time capabilities of the assembly code.
In this course, you are provided with the following:
The code for a typical assembly-based project usually consists of the following:
Chapter 1: CPU Architecture
Chapter 6: Addressing Modes
Chapter 2: Tutorial (optional)
Chapter 4: Optimizing Assembly Code
This section takes you through the lab computers and discusses where to find things and what to look for.
We highly recommend you to read through the lab and do much of the lab work outside of the lab as the computer resources as well as the TA resources are limited. You may find that some labs are rather lengthy. It is important that you work efficiently inside the lab since there are limited number of lab stations. Try to use the lab for debugging your code and to execute the experiment. If this is not possible, try at least to have as much of your algorithm roughly coded as possible before coming into the lab. This is especially important in later labs where most of the labs are project-oriented and require extensive coding.
We ask that you do not lock the lab PC's in order to reserve a lab machine. In fact, the TA's and the system administrators are responsible for logging off any locked lab PC's if there are none available. If such situation arises, the user will be logged off, and any programs that are running will be terminated. The TA will not be able to tell if there are important programs running.
You will be assigned an account on the EE network. This will allow you to log in on any of the lab computers in Packard 001 to access your ee265 account. You are responsible for keeping your files in your account folder and not leave random files on the local machine, especially the C:/ drive. You may use the local C:/ drive for TEMPORARY storage since, on occasion, the EE network may be down preventing you from working on the labs. Files left on the C:/ drive will be removed by the system administrator on a regular basis.
Each user will have a personal network directory. This directory is already mapped to the Z:/ drive for you and will be available to you whichever lab machine you are on. You may access this working directory through the "My Computer" icon. We will refer to this working directory as Z:/ in this and subsequent labs. If you don't know how to manipulate files under the windows operating system, please ask the TA's. For the sake of system administration, please keep ALL your lab files under your working directory. All user files outside of your working directory may be deleted at any time.
The user privileges are setup such that the different ee265 accounts are private, so you may not be able to install certain programs. If you find something that you would like to install but couldn't or you would like those programs accessible on the EE domain, let us know and we will install it on the server.
All the lab materials are stored on this website. Download these files to your personal directory before doing lab assignments. Files for lab1 are
You are to demonstrate your code to the TA on the due date (usually on Thursdays). Demonstrations must be done during the TA office hours. The TA's have allotted office hours on Thursdays for this and we expect that you will be able to find a time slot to meet with them. During the meeting, you are asked to setup and demonstrate your lab. Upon completion, you will need to provide a brief project report (templates will be provided each week in the handouts section of the webpage). You will also be expected to attach a hardcopy of your code (and linker command file) to the lab submission for grading purposes.
Your code will be graded based on a number of criteria: performance in terms of cycle time and code size, coding style, and lab write-ups. We are not asking you to spend a lot of time on comments and style, but enough for you and the TA to understand.
In this lab, you will first be introduced to the CCS tool environment from the perspective of an experienced user. From it, you will get exposed to the do's and don'ts of lab equipment and software. Then, you will go through two sample programs for a demonstration of the C5509 DSP board. Finally, you will be asked write code (and test) a simple FIR filter. You will also need to answer many questions at the end of this lab.
In this section, we will take you through the tool environment, how to run CCS, what to look for, and precautions to take when working with the hardware. There is a CCS tutorial available. However, it is time consuming to go through. Besides, exact instructions on what to do in CCS are provided in the lab exercises.
The Code Composer Studio is an integrated development environment specifically designed for TI DSP processors. The Code Composer Studio also has extended features for hardware debugging. It facilitates communication of data between the host and the EVM board and enables monitoring of the board status information. Like all development environments, the combination of CCS and the 5509a DSK board is sometimes buggy and it will take time to get used to the development flow. HINT: Often, repeating the same steps may actually make something work.
You can find all the important documentation you will need on-line through the TI homepage, http://www.ti.com/. To simplify the search for information, we have downloaded the on-line references (most of them in the Adobe Acrobat PDF format) and made them available on the EE network server through the class homepage. In addition the CCS Help menu is VERY USEFUL and contains searchable information. With a few clicks, you can locate the description of the ST0_55 register or the MPY::MPY instruction.
Before you run CCS, you need to power-up the EVM board and check its connections. Previous EVM boards proved rather fragile. We hope with care, and lessons learned from those boards to have fewer boards broken in the early weeks of this class.
Look around the workbench for the EVM. It is marked TMS320C5509A DSK. The EVM has a power socket, USB connectors, stereo audio jacks. and the power supply.
In order to use the EVM with CCS, you will first need to
This section provides you with the REALLY useful knowledge that is entirely experience based and you will not find it in any of TI's documentation. It is important that you go through this section in detail even if you don't understand some of the terminologies. This section will save you time later on when you encounter strange anomalies in CCS that you cannot comprehend. It is highly recommended that you refer to this section when you encounter problems in this lab or any future labs.
Do not poke around and start the CCS setup utility. If you do, CCS will still run but the device driver for RTDX (you will learn about this later) may break. The staff will have to reinstall CCS to make it work again. If RTDX does not work for you, check to see if the board is broken first and report the problem to the TA.