 | Level: Intermediate Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio
22 Feb 2007 Continue looking in depth at the Cell Broadband Engine™ (Cell BE) processor's synergistic processor elements (SPEs) and how they work at the lowest level. This installment explores storage alignment issues and the communication facilities of the SPEs.
Non-aligned loads and stores
Because the synergistic processing unit (SPU) is focused on vector, not scalar, processing, it is only able to
load and store 16 bytes at a time (the size of a register) from local store
locations which are aligned on 16-byte boundaries. Therefore, you cannot just
load a word from, say, memory location 12. To get that word, you would
need to load a quadword from memory location 0, and then shift the bits so that
the value you want is in the preferred slot. The original quadword must be
loaded, the appropriate value inserted into the right location in the quadword,
and then the result stored back. Because of these issues, it is usually
advisable to store all data aligned to 16 bytes. To load a value which crosses a
16-byte boundary is even more difficult, as you would actually have to load it
into two registers, shift them, and then mask and combine them. Storing such
values is even more difficult, so it is best to never use values that cross
16-byte boundaries.
While it allows you to use
data that is not aligned to 16-byte boundaries, the loading and storing technique I will discuss requires that the data be
naturally aligned to prevent it from crossing the 16-byte
boundary. That means that words will be 4-byte aligned, halfwords will be 2-byte
aligned, and bytes don't have to be aligned at all.
Doing an unaligned load requires two or three instructions, depending on the
size of the data. The reason for this is that if you are loading a single value,
you probably want it in the preferred slot of the register. The first
instruction does the load and the second instruction rotates the value so that
the requested address is at the beginning of the register. Then, if the data is
smaller than a word, a shift is needed to move it away from the beginning into
the preferred slot (if it is a word or a doubleword, the beginning of the
register is the preferred slot). Here is the code for a byte load, which
takes an address in the preferred slot of register 3 and uses it to load a byte
into the preferred slot of register 4:
Listing 3. Load from non-aligned memory
###Load byte unaligned address $3 into preferred slot of register $4###
#Loads from nearest quadword boundary
lqd $4, 0($3)
#Rotate value to the beginning of the register
rotqby $4, $4, $3
#Rotate value to the preferred slot (-3 for bytes, -2 for halfwords, and nothing for
words or doublewords)
rotqbyi $4, $4, -3
|
Remember, the lqd instruction only loads from 16-byte
boundaries. It will therefore ignore the four least significant bits during the
load, and just load an aligned quadword from memory. Therefore, for arbitrary
addresses, we have no idea where in the loaded quadword the value we wanted is.
The rotqby instruction, "rotate (left) quadword by
bytes," uses the address you loaded from to indicate how far to rotate the
register. It only uses the least four significant bits of the address in the
register (the ones ignored by the load) to determine how far to rotate. This will
always be the number of bytes it needs to shift left to move the address
specified to the beginning of the register. Finally, for bytes, the preferred
slot is not at the beginning of the register, but three bytes to the
right. So the instruction rotqbyi does a shift
using an immediate-mode value to shift by. Word- and doubleword-sized transfers
do not need this last instruction, because their preferred slot is at the
beginning of the register anyway. At the end of this, register 4 has the final
value, with the byte shifted into the preferred slot.
Storing is more difficult. Here is the code to store a byte that is in the
preferred slot of register $4 into the address specified by register $3:
Listing 4. Store to non-aligned address
###Store preferred byte slot $4 into unaligned address $3
#Load the data into a temporary register
lqd $5, 0($3)
#Generate the controls for a byte insertion
cbd $6, 0($3)
#Shuffle the data in
shufb $7, $4, $5, $6
#Store it back
stqd $7, 0($3)
|
To understand this cryptic-looking sequence, again keep in mind that
the SPU only does loads and stores a quadword at a time, on quadword-aligned
addresses. Therefore, if you want to store only one byte, if you tried to do it
directly on an unaligned address, it would both go into the wrong location and
clobber the remaining bytes in the quadword. To avoid this, you need to first
load the quadword from memory, insert the value into the appropriate byte in the
quadword, and then store it back. The hard part is inserting it into the proper
location based only on the address. Thankfully, two instructions help out, cbd ("generate control for byte insertion")
and shufb ("shuffle bytes"). The cbd instruction takes an address and generates a control
word that can be used by shufb to insert a byte at the
proper location in the quadword for that address. cbd $6,
0($3) uses the address in register 3 to generate the control quadword, and
then stores it in register 6. The instruction shufb $7, $4,
$5, $6 uses the control quadword in register 6 to generate a new value
into register 7 which consists of the original quadword that was in memory (now
in register 5) and a byte from register 4 in the preferred slot, and stores the
result in register 7. Once the byte is shuffled in, the value is stored back
into memory.
To illustrate the technique, I'll write a function that takes the
address of an ASCII character, loads it, converts it to uppercase, and stores it
back. I'll put the function convert_to_upper in a separate file from the main function so that I can reuse it in another program
later on. Here is the code for the main
function (save it as convert_main.s):
Listing 5. Uppercase conversion program start
.data
string_start:
.ascii "We will convert the following letter, "
letter_to_convert:
.ascii "q"
remaining:
.ascii ", to uppercase\n\0"
.text
.global main
.type main, @function
main:
.equ MAIN_FRAME_SIZE, 32
.equ LR_OFFSET, 16
#PROLOGUE
stqd $lr, LR_OFFSET($sp)
stqd $sp, -MAIN_FRAME_SIZE($sp)
ai $sp, $sp, -MAIN_FRAME_SIZE
#MAIN FUNCTION
ila $3, letter_to_convert
brsl $lr, convert_to_upper
ila $3, string_start
brsl $lr, printf
#EPILOGUE
ai $sp, $sp, MAIN_FRAME_SIZE
lqd $lr, LR_OFFSET($sp)
bi $lr
|
Now enter the function that actually does the uppercase conversion (enter as
convert_to_upper.s):
Listing 6. Function to convert to uppercase
.text
.global convert_to_upper
.type convert_to_upper, @function
convert_to_upper:
#Register usage
# $3 - parameter 1 -- address of byte to be converted
# $4 - byte value to be converted
# $5 - $4 greater than 'a' - 1?
# $6 - $4 greater than 'z'?
# $7 - $4 less than or equal to 'z'?
# $8 - $4 between 'a' and 'z' (inclusive)?
# $9 through $12 - temporary storage for final store
# $13 - conversion factor
#address of letter stored in unaligned address in $3
#UNALIGNED LOAD
lqd $4, 0($3)
rotqby $4, $4, $3
rotqbyi $4, $4, -3
#IS IN RANGE 'a'-'z'?
cgtbi $5, $4, 'a' - 1
cgtbi $6, $4, 'z'
nand $7, $6, $6
and $8, $5, $7
#Mask out irrelevant bits
andi $8, $8, 255
#Skip uppercase conversion and store if $4 is not lowercase (based on $8)
brz $8, end_convert
is_lowercase:
#Perform Conversion
il $13, 'a' - 'A'
absdb $4, $4, $13
#Unaligned Store
lqd $9, 0($3)
cbd $10, 0($3)
shufb $11, $4, $9, $10
stqd $11, 0($3)
end_convert:
#no stack frame, no return value, just return
bi $lr
|
To compile and run, perform the following commands:
spu-gcc convert_main.s convert_to_upper.s -o convert
./convert
|
The main function doesn't function too
differently than before, so I won't discuss it here. Note, however, that it is passing the
address of the letter to convert_to_upper, not
the letter itself.
The convert_to_upper function takes the address of an
arbitrary character, converts it to uppercase, and then stores it back and
returns nothing. It never calls another function, so it doesn't need a stack
frame.
The first thing the function does is an unaligned load as described previously
into register 4. It then checks to see if the byte is in the range a through z. It does that by
comparing if it is greater than 'a' - 1, and then
seeing if it is greater than 'z.' I did not do a
"less than" comparison, because they aren't available on the SPU! SPUs
only have comparisons for "greater than" and "equal to." Therefore, if you want
to do a "less than or equal to" comparison, you must do a "greater than"
comparison and then do a "not" on it, which is performed using the nand instruction with both source arguments being the same
register. You then combine the comparisons using the and instruction (note that you could have combined all the
logical instructions into one with an xor, but the
code would have been much less clear). Finally, because the branch instructions
only operate on halfword or word values, you have to mask out the non-relevant
portions of the register. (I didn't have to do that in the factorial example
because I was dealing with a full word).
If the bits in the preferred slot of register 8 are all set to false, you skip to
the end of the function. If they are true, you perform the conversion. The only
byte-oriented arithmetic function on the SPU is absdb,
"absolute difference of bytes," which gives the absolute value of the difference
between two operands. You use that, combined with the difference between the
lowercase and uppercase values, to perform the conversion. Finally, you perform
an unaligned store. Since you did not call any functions or use any local
storage, you did not need a stack frame at all, so you can now just exit through
the link register.
Communication with the PPE
So far I have concentrated on SPE-only programs. Now I will look into
PPE-controlled programs, and for that, I need to know how to get the PPE and the
SPE to communicate.
Channels and the MFC
Remember that SPEs have a memory that is separate from the processor's main
memory, called the local store. The SPE cannot read main memory directly,
but instead must import and export data between the local store and main memory
using DMA commands to a unit called the memory flow controller, or MFC.
The local store address space is limited to 32 bits, but it is usually much
smaller (in the Sony® PLAYSTATION® 3, for instance, it is only 18 bits). The reason for
this is so that memory accesses by SPE code can be deterministic. Main memory
can get swapped out, moved around, cached, uncached, or memory mapped.
Therefore, the amount of time required for any particular memory access is
completely unknown (if the memory is swapped out, who knows how long it will
take). By separating out the SPE memory into a local store, the SPE can have a
deterministic access time for any memory it accesses, and schedule the MFC to
asynchronously move data in and out of main memory as needed. Addresses within
an SPE's local store are called local store addresses (LSAs), while
addresses within the main memory are called effective addresses (EAs).
This will be important as you learn how to use the memory flow controller's DMA
facilities.
SPEs communicate with the outside world by using channels. A channel is
a 32-bit area which can be written to or read from (but not both -- they
are
unidirectional) using special instructions. A channel can also have a depth, or
channel count. The channel count is the amount of data waiting to
be read
(for read channels), or the amount of data which can still be written (for write
channels). Channels are used for all SPE input and output. They are used for
issuing DMA commands to the memory flow controller, handling SPE events, and
reading and writing messages to and from the PPE. The next program I'll show you utilizes the MFC and the channel interface to do character conversions
on data specified by the PPE.
Creating and running SPE tasks
So far, the main function has not been using any
parameters. However, when it is run from a PPE program, it actually receives
three 64-bit parameters -- the SPE task identifier in register 3, a
pointer to
application parameters in register 4, and a pointer to runtime environment
information in register 5. The contents of the areas pointed to by application
and environment pointers are actually user-defined. However, remember that they
point to memory in the main storage of the application (an effective
address), not to the SPE's local store. Therefore, they cannot be accessed
directly, but must be moved in through DMA.
SPE tasks are created with the function speid_t
spe_create_thread(spe_gid_t spe_gid, spe_program_handle_t *spe_program_handle,
void *argp, void *envp, unsigned long mask, int flags). The parameters
work as follows:
- spe_gid
This is the SPE thread group to assign this task
to. It
can simply be set to zero.
- spe_program_handle
This is a pointer to a structure which
holds
the data about the SPE program itself. This data is normally defined either
automatically by embedding an SPU application within a PPU executable (this will
be shown later), by using dlopen()/dlsym() on a library containing an SPU application, or by
using spe_open_image() to directly load an SPU
application.
- argp
This is a pointer to application-specific data for
program
initialization. Set to null if it is not going to be used.
- envp
This is a pointer to environment data for the
program. Set
to null if it is not going to be used.
- mask
This is the processor affinity mask. Set it to -1
to
assign
the process to any available SPE. Otherwise, it contains a bitmask for each
available processor. 1 means that the processor should be used, 0 means that it
should not. Most applications set this to -1.
- flags
This is a set of bit flags which modify how the SPE
is set
up. These are all outside the scope of this article.
A PPE/SPE program using DMA
As an example of DMA communication, I will write a program where the PPE takes
a string, and invokes an SPE program which copies over the string, converts it to
uppercase, and copies it back into main storage. All of the data transfers will
use the MFC's DMA facilities, controlled through SPE channels.
The main SPE program will receive an effective address pointer to a struct
containing the size and pointer of a string in main memory. It will then copy it
into its buffer, perform the conversion, and copy it back. Here is the SPE code
(enter as convert_dma_main.s):
Listing 7. SPU code to perform uppercase conversion for PPU program
.data
.align 4
conversion_info:
conversion_length:
.octa 0
conversion_data:
.octa 0
.equ CONVERSION_STRUCT_SIZE, 32
.section .bss #Uninitialized Data Section
.align 4
.lcomm conversion_buffer, 16384
.text
.global main
.type main, @function
#MFC Constants
.equ MFC_GET_CMD, 0x40
.equ MFC_PUT_CMD, 0x20
#Stack Frame Constants
.equ MAIN_FRAME_SIZE, 80
.equ MAIN_REG_SAVE_OFFSET, 32
.equ LR_OFFSET, 16
main:
#Prologue
stqd $lr, LR_OFFSET($sp)
stqd $sp, -MAIN_FRAME_SIZE($sp)
ai $sp, $sp, -MAIN_FRAME_SIZE
#Save Registers
#Save register $127 (will be used for current index)
stqd $127, MAIN_REG_SAVE_OFFSET($sp)
#Save register $126 (will be used for base pointer)
stqd $126, MAIN_REG_SAVE_OFFSET+16($sp)
#Save register $125 (will be used for final size)
stqd $125, MAIN_REG_SAVE_OFFSET+24($sp)
##COPY IN CONVERSION INFORMATION##
ila $3, conversion_info #Local Store Address
#register 4 already has address #64-bit Effective Address
il $5, CONVERSION_STRUCT_SIZE #Transfer size
il $6, 0 #DMA Tag
il $7, MFC_GET_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
##COPY STRING IN TO BUFFER##
#Load buffer data pointer
ila $3, conversion_buffer #Local Store
lqr $4, conversion_data #64-bit Effective Address
lqr $5, conversion_length #SIZE
il $6, 0 #DMA Tag
il $7, MFC_GET_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
#LOOP THROUGH BUFFER
#Load buffer size
lqr $125, conversion_length
#Load buffer pointer
ila $126, conversion_buffer
#Load buffer index
il $127, 0
loop:
ceq $7, $125, $127
brnz $7, loop_end
#Compute address for function parameter
a $3, $127, $126
#Next index
ai $127, $127, 1
#Run function
brsl $lr, convert_to_upper
#Repeat loop
br loop
loop_end:
#Copy data back
ila $3, conversion_buffer #Local Store Address
lqr $4, conversion_data #64-bit effective address
lqr $5, conversion_length #Size
il $6, 0 #DMA Tag
il $7, MFC_PUT_CMD #DMA Command
brsl $lr, perform_dma
#Wait for DMA to complete
il $3, 0
brsl $lr, wait_for_dma_completion
#Return Value
il $3, 0
#Epilogue
ai $sp, $sp, MAIN_FRAME_SIZE
lqd $lr, LR_OFFSET($sp)
bi $lr
|
This code relies on some utility functions for handling DMA commands. Enter
those functions as dma_utils.s:
Listing 8. DMA transferring utilities
##UTILITY FUNCTION TO PERFORM DMA OPS##
#Parameters - Local Store Address, 64-bit Effective Address, Transfer Size,
DMA Tag, DMA Command
.global perform_dma
.type perform_dma, @function
perform_dma:
shlqbyi $9, $4, 4 #Get the low-order 32-bits of the address
wrch $MFC_LSA, $3
wrch $MFC_EAH, $4
wrch $MFC_EAL, $9
wrch $MFC_Size, $5
wrch $MFC_TagID, $6
wrch $MFC_Cmd, $7
bi $lr
.global wait_for_dma_completion
.type wait_for_dma_completion, @function
wait_for_dma_completion:
#We receive a tag in register 3 - convert to a tag mask
il $4, 1
shl $4, $4, $3
wrch $MFC_WrTagMask, $4
#Tell the DMA that we only want it to inform us on DMA completion
il $5, 2
wrch $MFC_WrTagUpdate, $5
#Wait for DMA Completion, and store the result in the return value
rdch $3, $MFC_RdTagStat
#Return
bi $lr
|
Now, not only do you need to compile this program, you need to prepare it for
embedding in a PPE application. Assuming you still have the convert_to_upper.s from your last program in the current
directory, here are the commands to compile the code and prepare it for
embedding:
spu-gcc convert_dma_main.s dma_utils.s convert_to_upper.s -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
|
This produces what is called a CESOF Linkable, which allows an object
file for the SPE to be embedded in a PPE application and loaded as needed.
Here is the PPU code to make use of the SPU code (enter as ppu_dma_main.c):
Listing 9. PPU code to utilize SPU application
#include <stdio.h>
#include <libspe.h>
#include <errno.h>
#include <string.h>
/* embedspu actually defines this in the generated object file,
we only need an extern reference here */
extern spe_program_handle_t convert_to_upper_handle;
/* This is the parameter structure that our SPE code expects */
/* Note the alignment on all of the data that will be passed to the SPE is 16-bytes */
typedef struct {
int length __attribute__((aligned(16)));
unsigned long long data __attribute__((aligned(16)));
} conversion_structure;
int main() {
int status = 0;
/* Pad string to a quadword -- there are 12 spaces at the end. */
char *tmp_str = "This is the string we want to convert to uppercase. ";
/* Copy it to an aligned boundary */
char *str = memalign(16, strlen(tmp_str) + 1);
strcpy(str, tmp_str);
/* Create conversion structure on an aligned boundary */
conversion_structure conversion_info __attribute__((aligned(16)));
/* Set the data elements in the parameter structure */
conversion_info.length = strlen(str) + 1; /* add one for null byte */
conversion_info.data = (unsigned long long)str;
/* Create the thread and check for errors */
speid_t spe_id = spe_create_thread(0, &convert_to_upper_handle,
&conversion_info, NULL, -1, 0);
if(spe_id == 0) {
fprintf(stderr, "Unable to create SPE thread: errno=%d\n", errno);
return 1;
}
/* Wait for SPE thread completion */
spe_wait(spe_id, &status, 0);
/* Print out result */
printf("The converted string is: %s\n", str);
return 0;
}
|
To build and execute the program, enter the following commands:
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert
|
A lot of things are going on in this code, and my goal is to
introduce
all of the necessary foundational material so that we don't get bogged
down in it when learning optimization secrets in the next article. (Stay with me, and you'll be on your way to expert SPU
programming in no time!) Now, I'll explain what the code is doing. I'll start
with the PPU code, since it's a little easier.
The first interesting part of the PPU code is the inclusion of the libspe.h header file, which contains all of the function
declarations for running programs on the SPE. It then references a handle called
convert_to_upper_handle. This is only an extern reference, not the declaration itself. This is
because convert_to_upper_handle is defined in spe_convert_csf.o. The name of the variable was set on the
command line of the embedspu command. That variable
is the handle to the program code, which will be used to create your SPE tasks.
Next, you define the structure that will be used as the parameter to your SPE
program. You need the length of the string and the pointer to the string itself.
These all need to be quadword aligned, so that you can copy it into your main
program and use the values with DMA transfers. Note that the pointer you used is
declared an unsigned long long rather than just a
pointer. This is so that the address transfer is stored the same way whether it
is compiled in 32-bit mode or 64-bit mode. With a pointer, if it were
compiled in
32-bit mode, the pointer would be aligned differently within the structure. You
also have to use the memalign function and strcpy to copy the data into an area of appropriate
alignment. Here's a pointer from long nights of trial and error with this stuff:
If you are continually receiving a "bus error," you are probably doing a DMA
transfer that is either not 16-byte aligned or is not a multiple of 16 bytes.
In the main program, you declare your variables. Note that all of the declared
variables which will be copied using DMA are aligned on quadword boundaries and
are multiples of quadwords. That's because DMA transfers, with a few exceptions
for small transfers, must be quadword aligned in both the source and
destination addresses (the program will get even better performance if both
source and destination are 128-byte aligned). Next, the SPE task is
created with spe_create_thread, passing in your
parameter structure. Now you can just wait for the SPE task to complete using
spe_wait, and then print out the final value. As you
may have guessed, most of the interesting parts of the program are taking place
on the SPE, including all of the DMA transfers. DMA transfers are almost always
done by the SPEs rather than by the PPE because they can handle much more data
and many more active DMA operations than the PPE.
Before getting into the details of the main program, I'll explore the DMA
utility functions. The first function is perform_dma,
which, not surprisingly, performs DMA commands. The Cell BE Handbook defines the
sequence of channel operations needed to perform a DMA transfer on pages 450-456 (see Resources).
The first thing the function is doing is converting the 64-bit effective address
in register 4 into two 32-bit components -- a high- and a low-order
component
(remember, the channels are only 32 bits wide). Because channels are
written
using a register's preferred word-sized slot, the 64-bit address already has the
high-order bits in the preferred slot. Therefore, you just shift the contents to
the left by four bytes into a new register to get the low-order bits in the
preferred slot. You then write the local store address, the high-order bits of
the effective address, the low-order bits of the effective address, the size of
the transfer, the "tag" of the DMA command, and then the command itself to their
appropriate channels using the wrch instruction.
When the command is written, the DMA request is enqueued into the MFC provided it
has available slots -- yours certainly does as you are not doing any other
concurrent DMA requests. The "tag" is a number which can be assigned to one or
many DMA commands. All DMA commands issued with the same tag are considered a
single group, and status updates and sequencing operations apply to the group as
a whole. In this application, you will only have one DMA command active at a
time, so all of your operations will use 0 as the DMA tag. The DMA command should
be either MFC_GET_CMD or MFC_PUT_CMD. There are others, but we aren't concerned with
them here. MFC commands are all done from the perspective of the SPE, whether or
not it is actually the SPE issuing the command. So MFC_GET_CMD moves data from main memory to the local store,
and MFC_PUT_CMD goes the other way.
Because DMA commands are asynchronous, it is useful to be able to wait for one
to complete. The function wait_for_dma_completion
does precisely that. It takes a tag as its only parameter, converts it to a tag
mask, requests a DMA status, and then reads the status. So how does this wait
for the DMA operation to complete? When writing to the $MFC_WrTagUpdate channel with a value of 2, it causes the
$MFC_RdTagStat to not have a value until the operation
is completed. Thus, when you try to read the channel using rdch, it will block until the status is available, at which
point the transfer will be complete.
Now, moving on to the actual program itself. The first thing our SPE program
does is reserve space for the application's parameter data. This is also aligned
to quadword boundaries (.align 4 in assembly language
works the same as __attribute__((aligned(16))) in C
because 2^4 = 16). .octa reserves quadword values
(the mnemonic is a holdover from 16-bit days). You then define a constant CONVERSION_STRUCT_SIZE for the size of the whole structure.
After this, you go to the .bss section, which is like
the .data section, except that the executable itself
does not contain the values, it just notes how much space should be reserved for
them. This section is for uninitialized data. .lcomm
conversion_buffer, 16384 reserves 16K of space, with the starting address
defined in the symbol conversion_buffer. It is
defined for holding 16K because that is the maximum size of an MFC DMA transfer.
Therefore, if any string is longer than that, the PPE will have to invoke the
program multiple times (a better program would simply break up the request into
chunks on the SPE side).
The main function has the main meat of the program.
It starts by setting up a stack frame. It then saves three non-volatile registers
that will be used for the main control of the program. Next, it performs a DMA
transfer to copy in the parameter structure from the PPE. Remember, the first
parameter to the function is the 64-bit address that was passed in from the PPE.
You then use a DMA command to fetch the full structure, and wait for the DMA to
complete. After the transfer, you use the data in that structure to copy the
string itself into your buffer in the local store using another DMA transfer, and
wait for it to complete. Note that you used the ila
instruction ("immediate load address") to load the address of the buffer. The
ila instruction maxes out as 18 bits, which works for
the PLAYSTATION 3. However, if a Cell BE processor has a larger local
store size,
you would load it instead with the following two instructions:
ilhu $3, conversion_buffer@h #load high-order 16 bits of conversion_buffer
iohu $3, conversion_buffer@l #"or" it with the low-order 16 bits of conversion_buffer
|
Then the target effective address, the length of the string, the DMA tag, and a
MFC_GET_CMD DMA command are all passed to perform_dma. The program then waits for the operation to
complete.
At this point, all of the data is loaded in and you just need to convert it. You
then use register 127 as your loop counter and register 126 as your base pointer,
and perform convert_to_upper on each value until you
get to the end of the buffer.
At loop_end, all of the data is converted, and you
need only to copy it back. You use the same DMA parameters as for the last
transfer, but this time it is an MFC_PUT_CMD command.
Once the DMA is completed, your function is done. You load register 3 with the
return value and perform the function epilogue to restore the stack frame and
return.
SPE/PPE communication using mailboxes
While DMA transfers are an excellent way of moving bulk data between the SPE and
the PPE, another simpler method for smaller transfers which I will briefly
discuss is mailboxes. For the SPE, it is simply a set of channels (a read
channel and a write channel) to write 32-bit values to the PPE.
To demonstrate the concept, I will write a very simple SPE server which waits
for an unsigned integer number in the mailbox and then writes back the square of
that number. Here is the code (enter as square_server.s):
Listing 10. SPU squaring server
.text
.global main
.type main, @function
main:
#Read the value from the inbox (stalls if no value until one is available)
rdch $3, $SPU_RdInMbox
#Square the value
mpyu $3, $3, $3
#Write the value back
wrch $SPU_WrOutMbox, $3
#Go back and do it again
br main
|
That's all! This will just sit around and wait for requests and process them.
It simply quits when the parent program quits. And, if there is no value
available in the inbox, the rdch instruction simply
stalls until there is one.
The PPE side isn't much harder (enter as square_client.c):
Listing 11. PPE squaring client
#include <libspe.h>
#include <stdio.h>
extern spe_program_handle_t square_server_handle;
int main() {
int status = 0;
/* Create SPE thread */
speid_t spe_id = spe_create_thread(0, &square_server_handle, NULL, NULL, -1, 0);
if(spe_id == 0) {
fprintf(stderr, "Unable to create SPE thread!\n");
return 1;
}
/* Request a square */
spe_write_in_mbox(spe_id, 4);
/* Wait for result to be available */
while(!spe_stat_out_mbox(spe_id)) {}
/* Read and display result */
printf("The square of 4 is %d\n", spe_read_out_mbox(spe_id));
/* Do it again */
spe_write_in_mbox(spe_id, 10);
while(!spe_stat_out_mbox(spe_id)) {}
printf("The square of 10 is %d\n", spe_read_out_mbox(spe_id));
return 0;
}
|
To compile and run this program, issue the following commands:
spu-gcc square_server.s -o square_server
embedspu -m64 square_server_handle square_server square_server_csf.o
gcc -m64 square_client.c square_server_csf.o -lspe -o square
./square
|
The mailboxes, even for the PPE, are named according to the perspective of the
SPE. So you write to the inbox and read from the outbox if you are the PPE.
Unlike the SPE, the PPE does not stall and wait for a value when it reads or
writes. Instead, the program must use spe_stat_out_mbox to wait for a value, and spe_stat_in_mbox to see if there are slots left for writing
to the mailbox. You don't use the latter as you only have one value in play at a
time.
The real power of mailboxes comes when a program combines the mailbox and the
DMA approach. For example, an SPE task can be created which listens for buffer
addresses on its mailbox, and then uses that address to pull in all of the data
to be processed through DMA.
Conclusion
Thus far, this series has covered the main concepts of assembly
language
programming on the Cell BE processor of the PLAYSTATION 3 under Linux®. Topics covered include the
basic architecture, the syntax of the SPU assembly language, and the primary
modes of communication between the SPE and the PPE. The next article looks at how to pump every ounce of performance out of the Cell BE
processor SPEs that you can. And later articles will apply this knowledge to SPE programming in C, to
make your life just a little bit easier.
Resources
About the author  | |  | Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.
|
Rate this page
|  |