From the Cell/B.E. Architecture forum
This blog-based column looks at some of the more interesting problems and challenges posed recently in the Cell Broadband Engine Architecture forum.
Originally, paulsimon wanted to know about expected behavior when starting more pthreads than SPEs: I have been using libspe1 and am now learning libspe2. Running the example code from program "simple" from the SDK 2.1, I don't understand the results I'm getting. I'm using SDK 1.1 on a PS3, so I only have 6 SPEs, not the 8 which are assumed in the example. The program just starts 8 threads, each is supposed to print "Hello Cell...". When using more than 6 SPEs, the program usually hangs, but occasionally executes completely. It doesn't block or return errors for any of spe_create_context, spe_program_load, pthread_create, or spe_context_run. In ppu-gdb, it reliably hangs on the join of the 7th thread. This behavior is quite different from the "simple" program that used libspe1.
Is this the expected behavior when starting more pthreads than SPEs? Have others seen this when running the libspe2 example code on the PS3? I have seen the advice not to have more SPE threads than actual SPEs, but I understood that this was just for efficiency reasons. But it should still execute, shouldn't it? What is the consideration or syncronization that I'm overlooking here? Or maybe this is something to do with me using libspe2 in SDK1.1?
The IBM SDK Service Administrator answered with: In 1.1 the threads would block, so that if 6 are in use and you start a 7th, it will block until one of those 6 exits. 2.1 has the pre-emptive scheduler so it can time-slice for realtime threads. Also, there are some known problems with pre-emption in SDK 2.1 that will be fixed in the next release, so it's possible you're hitting those and that's why the behavior is not consistent.
Now there's more.
[lowellns]: Is this context switching supported by the libspe2? Or the spufs interface? Perhaps a scenario will help.
If I have 6 SPEs and 8 pthreads that do this:
<pre class="jive-pre"><code class="jive-code jive-java">spe_context_create()
spe_program_load()
spe_context_run()
</code></pre>
what will happen? Won't two threads wait until two spes return or are you telling me that there will be some context switching?
And possibly related: How would setting pthread schedule policies affect spe threads once they are running?
[SDK Service Administrator]: I don't believe this is supported on the PS3 since it's kernel dependent, so in your case the 2 extra threads will wait for two of the others to exit.
Editor: You might also want to see "Changes in libspe: How libspe2 affects Cell Broadband Engine programming" (developerWorks, July 2007).
|
Can you answer this one? Why are the fma (floating multiply and add) and lqd (load quad-word, d form) pairs not dual issued in timing analysis? Answer A.
|
iamrohitbanga wants to know if there are resources for getting the SDK going on Debian: I need help to install Cell SDK 2.0 on Ubuntu Hardy Heron. The reason is Fedora would not support my wifi card. i tried to convert rpm packages using alien, but not succeeded so far.
[davidhi] (from 2006): I wanted to get the CBE SDK up and running on my laptop, so I thought I'd share my experiences getting the CBE simulator running on Debian. I went ahead and made debs out of all the packages and did away with install.sh (the postinst scripts do the appropriate work). I ran into two basic problems, one having to do with a Makefile and one having to do with tcl/tk compile-time options.
At first the command PATH=$SCE_INST_DIR/ppu/bin:$PATH make -C src was breaking under Debian. After a bit of digging I found out that the Makefile doesn't work properly under make 3.81beta4 (which Debian uses), but it still works under 3.80 (which FC4 uses). I just ended up grabbing a debian package of 3.80-9 and using that for the time being. The breakage has something to do with the $$(@F) sysV make compatibility feature.
After I got everything built/installed, the simulator would fail like so:
Segmentation fault on address 0x2c6cc528
Restored previous handler.
../run_gui: line 30: 6483 Segmentation fault $TOP/../systemsim -cell $* -g
After more digging around (noticing that the ldd output for systemsim-cell had libpthread on Debian but not on FC4), I found that this was due to Debian's tk8.4 and tcl8.4 being built with --enable-threads. I don't know why this causes the simulator to fail, but I built new tk and tcl packages without this flag and now everything works peachy. This will be an important thing for IBM to fix, however, because the maintainer of the FC tcl and tk packages has said they are now being built with --enable-threads as of last week.
[davidhi] (later): Well, just one day after I finished making Debian packages of all the tools, they released new versions of almost everything. I went ahead and built new packages and tried them all out and I didn't encounter any new problems. The Makefile structure in cell-sdk-lib-samples-1.0.1 still fails under make 3.81beta4 and the simulator 1.0.1 still crashes when tcl8.4 and tk8.4 are built with --enable-threads.
So again, everything works great as long as I use the slightly older make 3.80 to build the stuff in cell-sdk-lib-samples and rebuild my own tcl and tk packages without --enable-threads.
For fun, I also tried to see if I could bootstrap the ppc64 version of Debian in the simulator with cdebootstrap/debootstrap as my system image (I mean, why not remove FC4 all the way). It works fine, although it takes a long time to boot because by default it'll try to add swap, fsck, start up an MTA, cron, atd, inetd, syslogd, klogd, etc. So I disabled most of that. Of course, you also need to change inittab so it doesn't try to spawn getty on the first ttys. Also, I needed to make /bin/sh a statically linked shell (I used zsh) rather than the default ppc64 bash. Otherwise, it'd crash when trying to interpret the scripts called by init. The error:
malloc: ../bash/variables.c:1854: assertion botched
malloc: block on free list clobbered
[davidhi] (later): Some people have asked me how to rebuild the tk and tcl packages without the --enable-threads flag and I thought I'd post some directions in case other people find it useful. First, you can use my packages if you'd like. They're i386 arch and based off of the testing version of tcl and tk.
If you are using a different architecture (like x86_64) or flavor (stable) or you'd just like to build them yourself, here's a quick tutorial.
You'll need a couple of tools (fakeroot and debhelper): apt-get install fakeroot debhelper. Run apt-get source tk8.4 and apt-get source tcl8.4 in some directory, let's call it "pkg-tmp."
For me, this unpacks some source in the tk8.4-8.4.12 subdirectory and in the tcl8.4-8.4.12 subdirectory. Their names might be slightly different depending on whether you're running stable, testing or unstable.
In the tcl8.4-8.4.12/debian directory, edit the "rules" file and remove --enable-threads from the ./configure line. Then, from the tcl8.4-8.4.12 directory run the command fakeroot debian/rules binary and it will build new debs in the top-level "pkg-tmp" directory.
tcl8.4-doc_8.4.12-1_all.deb
tcl8.4_8.4.12-1_i386.deb
tcl8.4-dev_8.4.12-1_i386.deb
It rebuilds the documentation too, but you only need to reinstall the main and -dev libraries. Use dpkg -i tcl8.4_8.4.12-1_i386.deb tcl8.4-dev_8.4.12-1_i386.deb. Then do the same process for tk8.4-8.4.12 (edit debian/rules, run fakeroot debian/rules binary and install).
There is some order you must do these in (one package will build with threads if the other is installed with threads no matter what configure says): Based on my experience, I'm almost certain tcl needs to be built and installed with no threads before tk.
Run
ldd /usr/lib/libtcl8.4.so
ldd /usr/lib/libtk8.4.so
and make sure that there is no line like libpthread.so.0 => /lib/tls/libpthread.so.0 in the output (that would imply that they were built with threads).
Can you answer this one? How do I view the pipe trace output in Debug Controls SPU_DISPLAY_EXEC & SPU_DISPLAY_ISSUE? Answer B.
|
kabe wants to know why he/she's having a problem with DMA transfer in libspe2: Hey! I have a problem with the DMA transfer. First it worked all fine. I used really simple code and the libspe.h. This worked. Then i noticed that opening an SPE-Thread is quite expensive. So now I wanted to open a thread and keep it. I found some bits about using pthread with the SPEs and the Program I'm trying to convert to work on the Cell already uses pthreads on other architectures to use multiple cores. So I thought that would be a good idea.
But now I have switched to libspe2 and the execution of the code just stops in the SPE-code when I use mfc_get. In one instance without changing the code in that area it actually stopped after the mfc_write_tag_mask, so it looks a bit like undefined behaviour.
PPE-Code I'm using:
void *evaluateOnSpu(void *data)
{
int retval;
unsigned int entry_point = SPE_DEFAULT_ENTRY; // Required for continuing execution,
// SPE_DEFAULT_ENTRY is the standard starting offset.
spe_context_ptr_t my_context;
// Create the SPE Context
my_context = spe_context_create(SPE_EVENTS_ENABLE|SPE_MAP_PS, NULL);
// Load the embedded code into this context
spe_program_load(my_context, &spuevaluate_handle);
//evaluateInfo* info = AB::BSplineSurface::mInfo;
// Run the SPE program until completion
do {
retval = spe_context_run(my_context, &entry_point, 0, &info, NULL, NULL);
} while (retval > 0); // Run until exit or error
pthread_exit(NULL);
}
void AB::BSplineSurface::testEvaluateThread()
{
std::cout << "test" << std::endl;
info = mInfo;
pthread_t my_thread;
int retval;
// Create Thread
retval = pthread_create(
&my_thread, // Thread object
NULL, // Thread attributes
evaluateOnSpu, // Thread function
NULL // Thread argument
);
// Check for thread creation errors
if(retval) {
fprintf(stderr, "Error creating thread! Exit code is: %d\n", retval);
exit(1);
}
// Wait for Thread Completion
retval = pthread_join(my_thread, NULL);
//* Check for thread joining errors
if(retval) {
fprintf(stderr, "Error joining thread! Exit code is: %d\n", retval);
exit(1);
}
}
SPE-Code:
#ifndef EVALUATEINFO
#define EVALUATEINFO
typedef struct {
AB::VecPack2 param;
Bool4 mask;
AB::VecPack3 result[3];
Float4 cDBItemList83;
unsigned int cDBOffset[4];
unsigned int count[2];
unsigned int degree[2];
Float4 controlPointList31;
} evaluateInfo;
#endif
/**
*\brief This function is supposed to realize the evaluation on an SPU
*/
int main(unsigned long long speid __attribute__ ((unused)),
unsigned long long argp,
unsigned long long envp __attribute__ ((unused)))
{
evaluateInfo pd __attribute__((aligned(sizeof(evaluateInfo))));
int tag_id = 0;
//READ DATA IN
//Initiate copy
//program_data_ea >>= 32;
mfc_get(&pd, argp, sizeof(pd), tag_id, 0, 0);
// ************* Normally it just stops here ******************
//Wait for completion
mfc_write_tag_mask(1<<tag_id);
mfc_read_tag_status_any();
http://...
I confess I don't really understand everything I'm doing here. Most of the code is from examples found around the net. Did I probably only progress half-heartedly to libspe2? Did i forget something?
I found a thread with a similar problem here, bit it was still with libspe1, I think. And it wasn't solved.
P.S. I found out that it works perfectly if I transfer a simple char of the length 128. Is there a problem with transfering structs? Am I doing it wrong? Can I use something else?
On a sidenote: I've read that spe_context_run should be pretty slow, too. Is there a way to keep the SPU busy while transfering new data? Is that what mailboxes are for? Do I have to use completely different code then anyway?
[jmt_dh1]: Well I'll pick up on one thing in the small extracts of code you posted:
evaluateInfo pd __attribute__((aligned(sizeof(evaluateInfo))));
You should be aligning it to 16 bytes (or larger). Aligning it to a multiple of the structure size is not conceptually what you should be doing to follow the DMA alignment rules (though you may of course get lucky if the size is a multiple of 16).
On a sidenote: I've read that spe_context_run should be pretty slow, too. Is there a way to keep the SPU busy while transfering new data? Is that what mailboxes are for? Do I have to use completely different code then anyway? That's a lot of questions, but have a read about double buffering...
[NotZed]: As jmt said, alignment is important. And for the dma size as well. It must be <16 or a multiple of 16. I always seem to hit that problem as I only dabble with cell coding rarely and forget it the next time I get back to it. My own side note: Whenever I tried something like the loop:
// Run the SPE program until completion
do {
retval = spe_context_run(my_context, &entry_point, 0, &info, NULL, NULL);
} while (retval > 0); // Run until exit or error
It would always seem to "lose the executable" at some point after a second or so and crash -- It seems you need to re-load the programme every time. I'm wondering if this is expected or is it a bug or just a coincidence?
[jmt_dh1]: You need to re-initialize the entry point variable every time. So many things to remember :)
[kabe]: Thanks for your comments so far. Yes, I constructed the struct so it is aligned to a multiple of 16. Otherwise it won't even compile. I'm not that far that I could run my SPU-programme for a longer time, so I don't know if I have to reload the code.
I tried another route now and it kinda works... I'm using the mailbox to tell the SPU where the data it needs starts, then I load for example an int, then I load a float from the starting address + sizeof(int) etc. Doesn't look that elegant.
Another point is, that I have to load the data in a special order. There is a bigger array in my data set of 128 Float4. When I first load two other variables and then the big array, it crashes. When I first load the big array and then the other data, it works.
At another point when I load an int[4] and two int[2] it crashes and when I load one int[8] and divide it by hand into three different arrays it works. This tells me that there has to be something seriously wrong. Can't you do many mfc_get consecutively? Even when I use mfc_write_tag_mask and mfc_read_tag_status_all after every mfc_get it hangs. What can be the reason for this?
[jmt_dh1]: Sounds extremely odd. Can you attach some reproducible sample code?
[kabe]: Sadly I am not allowed to post the actual code; I even had to sign a form. But I will try to reproduce the problem with standard data types...
Meanwhile I noticed that the ppu-gdb wasn't installed and that it could give info about the DMA. So I installed it, checked it... and didn't understand anything. This is the output:
(gdb) info spu dma
Tag-Group Status 0x80000000
Tag-Group Mask 0x80000000 ('all' query pending)
Stall-and-Notify 0x00000000
Atomic Cmd Status 0x00000000
Opcode Tag TId RId EA LSA Size LstAddr LstSize E
get 30 0 0 0x00000000ffa5de68 0x3ed50 0x00010 *
get 31 0 0 0x0000000010c2a0c0 0x3f040 0x00000
get 31 0 0 0x0000000010c29ea0 0x3fbb0 0x00000
get 30 0 0 0x00000000ffa5df08 0x3ed70 0x00020 *
get 0 0 0 0xd0000000002a6900 0x00e80 0x00000
putl 0 0 0 0xd0000000002f5000 0x00000 0x00000 0x00bf0 0x00008
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
In another post I read that the * is a sign for trouble. Can anyone help me understand this?
[jeshua]: I don't know what a lot of this means, but it looks to me like the two marked with a * both have bad alignment on the EA (Effective Address). The LSAs (Local Store Address) looks aligned properly.
[kabe]: That comment pointed me into the right direction, thank you! I really only forgot to align some variables in the PPE-Code. Aligning the variables in the struct leads to new problems, but I think I'll find the answer to that.
|
Can you answer this one? Why is my workload distribution so uneven for the two pipelines? Answer C.
|
b.lix wants to know what's the difference between the LOOSE and TURBO modes on the simulator: I'm working with the Cell SDK on two different computers. When I switch the Cell System Simulator into "Fast Mode" I get different messages from the simulator on the different machines. On one I get Simulator now in TURBO mode and on the other I get Simulator now in LOOSE mode.
What is the difference between these modes? And how do I get the second system also into TURBO mode?
[mkistler]: TURBO mode is only supported on 64-bit systems. It is a special flavor of FAST mode that uses just-in-time translation of PPC instructions into instructions of the host machine. Since Cell is a 64-bit architecture, it was not practical to support TURBO mode on 32-bit host systems.
[b.lix]: I think there is also a FAST mode on 32-Bit Systems. I want to know why I get Simulator now in LOOSE mode and not Simulator now in FAST mode. What is this LOOSE mode?
[mkistler]: LOOSE mode is minor tweak on FAST mode that performs "blocks" of instructions per CPU before switching to simulation instructions by the other CPU. This is far more efficient but can result in minor timing differences in interrupt delivery and/or synchronization mechanisms.
Threads worth pursuing
A fresh new license: The original license included in the Extras package allows evaluation until 2008-05-31. Users who wish to evaluate the software for longer should download and install the license refresh. When installed, this refresh will supersede the prior license and extend the evaluation period until 2008-12-31. All other aspects of the prior license will remain unchanged.
|
|
Forum statistics for May 2008 Threads: 116 | Participants: 5,016 | Replies: 480 | % threads answered: 22%
|
Categories
: [ Cell | forums ]
Jun 20 2008, 02:41:00 PM EDT
Permalink
|