IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
developerWorks  >  Blogs  >   developerWorks

author Cell Broadband Engine/Power Architecture notebook

This web log is the product of the collaborative, innovative, virtual minds of the editors of the IBM developerWorks Multicore acceleration (Cell/B.E. SDK) zone.



Tuesday July 01, 2008

Greener than speeding ethanol: Big Blue's quite a shade of green

It's all IBM at the top of the Green500 list: The Green500 list, a complement to the Top500 list (latest version here) and which provides a ranking of the most energy-efficient supercomputers in the world, is out with its latest rankings. And the winners are:

  1. IBM Germany: BladeCenter QS22 cluster, PowerXCell 8i 3.2GHz, Infiniband.
  1. Fraunhofer ITWM: BladeCenter QS22 cluster, PowerXCell 8i 3.2GHz, Infiniband.
  1. DOE/NNSA/LANL Roadrunner: BladeCenter QS22/LS21 cluster, PowerXCell 8i 3.2GHz/Opteron DC 1.8GHz, Voltaire Infiniband.
  1. Argonne National Laboratory | Dublin Institute for Advanced Studies/ICHEC | Science and Technology Facilities Council/Daresbury Laboratory: Blue Gene/P.
  1. RZG/Max-Planck-Gesellschaft MPI/IPP | Stony Brook/BNL | ASTRON/University Groningen | IBM/Rochester | DOE/Oak Ridge National Lab: Blue Gene/P.

The Green500 list uses measured power of the system if available; otherwise the peak power of the system is used.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  news  ]

Jul 01 2008, 02:30:00 PM EDT Permalink


Tuesday July 01, 2008

Games people play: Playing the deep oil game

Video game chip could unlock deepest oil: The Spanish oil conglomerate Repsol, working with the Barcelona Supercomputing Center, have tested a system powered by the Cell Broadband Engine which could scour for reserves at depths of 30,000 feet. The entities used the PowerXCell 8i chip performing a process known as Reverse Time Migration (RTM), a sophisticated subsurface imaging tool. The tool, which lets users find subsurface features, is hampered by lack of processing speed -- but the Cell/B.E. processor can handle the software's imaging algorithm's speed needs. The updated video-game chips, first used to search below the seabed in the Gulf of Mexico, run on the BladeCenter QS22 computer.

See also: PlayStation 3 Chips Help Repsol Find Deep Oil.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  news  ]

Jul 01 2008, 12:36:00 PM EDT Permalink


Tuesday July 01, 2008

Faster than a speeding bullet: Canada's new fastest to shoot down cancer

System Cluster 1350 and DCV build high-res R&D snaps: The Ontario Cancer Institute is the recipient of a new tool in the fight to eliminate cancer -- a new IBM System Cluster 1350 supercomputer using the DCS9550 Disk Storage System and Deep Computing Visualization (to create high-resolution images required for cancer research analysis). It will allow researchers to analyze millions of images of proteins via automation; high resolution imaging and sophisticated computer-based image classification can help scientists more quick identify the structure of disease-related proteins. The system includes 1,344 processor cores in the Linux cluster running at 12.5tflops with 150TB of storage -- it is one of the fastest research clusters in Canada.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jul 01 2008, 12:35:00 PM EDT Permalink


Tuesday July 01, 2008

Design on a dime: Multiple iconnects for SoC video developers

Link supports multiple memory channels for video SoC: SonicsSX (from Sonics Inc.) is a new chip-level interconnect technology that should help address the memory bandwidth problem for video processor developers (set-top ASICs, HDTVs, game consoles, media players, cell phones) since it can intelligently connect multiple processing elements on a chip with multiple channels of external memory. Sonics CTO Drew Wingard notes that the shift to DDR3 DRAMs helps but does not eliminate the bandwidth problem because many commonly used processor cores and accelerators are typically limited to transferring 32-bit chunks of video data at a time. SonicsSX expands the maximum bus width from 128 to 256 bits, increases throughput from 6- to 16GB, and supports as many as eight memory channels. To address the multichannel memory problem of load balancing, SonicsSX uses flexible interleaving techniques that are optimized to the operations of underlying memory and controller chips (the company has also designed some new data structures it claims will improve the efficiency with which the interconnect grabs related pixel data and addresses).


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jul 01 2008, 12:33:00 PM EDT Permalink


Tuesday July 01, 2008

Oddments: 10 really bad auto "enhancement" gadgets

10 worst auto "electronics": Hagerty Insurance has created a list of the top 10 quirkiest, outrageous auto options.

  1. Automatic Lit Cigarette Dispenser (1940s): Designed to eliminate the distractions of lighting a cigarette while motoring down the road, one option was to attach it to a steering wheel. And I thought the wheel was too hot to grab after the car had been cooking in the summer sun.
  2. Highway Hi-Fi (1955): Users had some of the largest collections of scratched LPs in existence.
  3. The Destroilet (1960s): A gas-fueled incinerator-type toilet for Dodge motor homes, designed to simplify waste disposal. If you got to the waste before the incineration was complete, you could leave it on an annoying neighbor's porch and ring the doorbell and run.

Amazing, isn't it, how many auto manufacturers thought driving and open flames was a good combination.

  1. Electric Shaver (1940s): So the guys won't feel left out, the male version of putting on makeup while moving.
  2. Auto Swamp Cooler (1940s-50s): Actually, not a bad idea if a little cumbersome. Also known as evaporative coolers and in use today in place of air conditioning in less humid parts of the world, these attached to one of the car's windows. There was a reservoir for water, a wick to soak it up, and air movement from driving forced cooled air into the interior. (A lot like the Peltier auto A/C designed by teenagers.)
  3. Steam Pressure Cooker (?): Mounted to the rear bumper and using exhaust gases for heat, you were probably lucky if the pot exploded and launched your monoxide-flavored meal all over the scenic drive (that way you didn't have to eat it).
  4. Steering Wheel Watch (1958): Guess it was just easier to mount in the dash -- the driver wouldn't get dizzy trying to tell what time it was.
  5. Trafficators (Before flashing turn signals): These were those cute little flags that popped up to tell other drivers which way you were planning to turn. Their descendents are those balls on the antennae.
  6. Swivel Seats (1950s-1960s): Not a bad idea. Now make them self-contained and able to leave the auto and voila -- you never have to walk again!
  7. Talking Car (1980s): Great until car makers realized people didn't want cars talking to them. ("Pardon me, but you're about to hit that tree. Oak, I think.")

All this and still no flying car.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jul 01 2008, 12:31:00 PM EDT Permalink



Tuesday June 24, 2008

Faster than a speeding bullet: Topping the Top500 list

The new Top500 list is out!: The top ten are:

Rank System Where Max. sustained tflops Peak tflops #cores Power usage in kW
01 Roadrunner LANL 1026 1375 122,400 2345
02 Blue Gene/L LLNL 478 596 212,992 2329
03 Blue Gene/P Argonne NL 450 557 163,840 1260
04 Sun Ranger TACC/UT 326 503 62,976 2000
05 Cray Jaguar Oak Ridge NL 205 260 30,976 1580
06 JUGENE Blue Gene/P FZJ 180 222 65,536 504
07 SGI Encanto NMCAC 133 172 14,336 861
08 HP Eka TATA SONS 132 172 14,384 786
09 Blue Gene/P IDRIS 112 139 40,960 315
10 SGI ICE Total Exploration Production 106 122 10,240 442

For more coverage on the latest Top500 list see

Other ways to parse this list would be, say, the maximum-sustained-tflops-per-core:

  1. Total Exploration Production SGI ICE
  2. NMCAC SGI Encanto
  3. TATA SONS HP Eka
  4. LANL Roadrunner
  5. Cray Jaguar

the peak-tflops-per-core:

  1. Tie: SGI ICE, SGI Encanto, HP Eka
  1. LANL Roadrunner
  2. Cray Jaguar

the kW-per-core:

  1. Tie: All three Blue Gene/Ps
  1. LLNL Blue Gene/L
  2. LANL Roadrunner

the maximum-sustained-tflops-per-kW:

  1. LANL Roadrunner
  2. Argonne Blue Gene/P
  3. Tie: FZJ and IDRIS Blue Gene/Ps
  1. TXP SGI ICE

the peak-tflops-per-kW:

  1. LANL Roadrunner
  2. Argonne Blue Gene/P
  3. FZJ Blue Gene/P
  4. IDRIS Blue Gene/P
  5. LLNL Blue Gene/L

Notice that Roadrunner shows up in the top five no matter how you slice the data.

A supercomputer you can run Windows apps on: The Akka system, installed at the High Performance Computing Center North in Sweden, is a new HPC cluster that comprises a total of 672 nodes, each loaded with two low-power Intel Xeon quad-core L5420 CPUs/16GB of RAM (total of 5376 cores and 10.7TB RAM). With a a theoretical peak performance of 53.8tflops, Akka is ranked 39 on the June 2008 Top 500 list.

What's important about this supercomputer, though, is that a small part of the cluster will use IBM Cell/B.E. and Power chips (mostly for the development of parallel algorithms).

Unlike supercomputer designs that connect clusters of processors PC-style, this system requires less electricity to run and cool thanks to its compact configuration based on IBM's BladeCenter technology -- the system can perform about 266 million calculations per second per Watt (266mflops/W) based on sustained performance.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  general  |  news  ]

Jun 24 2008, 12:17:00 PM EDT Permalink


Tuesday June 24, 2008

Games people play: 40 PS3s to solve climate challenges

40-cluster PS3 system models magnetosphere/solar wind interaction: UNH researchers have cobbled together a 40-PS3 cluster so they can solve a climate problem -- the interaction between Earth's magnetic field and the solar wind. Until now, the UNH EOS (Earth, Oceans, and Space) Space Science Center has been running its Open Geospace General Circulation Model (a magneto-hydro-dynamic simulation of the previously mentioned interaction) on a US$750,000 distributed system. Now they can run it on a US$16,000 cluster of PS3s. The heavy investment came in the two-plus months of tweaking the system to accommodate an open-source operating system and in rewriting the simulation program to run on the system.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  news  ]

Jun 24 2008, 11:56:00 AM EDT Permalink


Tuesday June 24, 2008

Trends and tradeoffs: Why MEMS may get cheaper

No moving parts: According to MEMS accelerometer chips manufacturer MEMSIC, accelerometer chips becoming standard equipment on consumer devices (enabling volume production) is just one of the reasons MEMS device prices are dropping. The other reason is that new no-moving-parts designs (like MEMSIC's) are getting rid of one of the most expensive components of making MEMS, the moving parts. In fact, according to MEMSIC, the company can run its design on a standard CMOS assembly line. Also, no moving parts means a lower failure rate off the line and a wider range of operating conditions for the product -- there's no moving bits that shock (what acceleromters measure) can damage. An interesting look into the manufacturing process.

"Embedded Everywhere" motto for Freescale future: Freescale CTO Lisa Su thinks embedded processors with on-chip sensors will rule the IC business in the future of 2015, redefining the chip industry away from PC-oriented makers towards embedded providers. Already, Su said, "there are about 150 embedded microprocessors around the home ... [plus] another 40 or 50 in your car ... We see that trend accelerating and we predict that there will be over 1,000 embedded devices per person by 2015." Su noted that the big changes in embedded processing won't be from the hardware or software but from the way people's lifestyles will change to accommodate the ubiquitous invisible processor. Su also forecast the three most important trends:

  • green energy (smart consumption provided by embedded devices and monitoring capabilities),
  • health-related electronics (more monitoring, this time of your biological systems), and
  • ubiquitous networking (or "The Net Effect").

Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jun 24 2008, 11:54:00 AM EDT Permalink


Tuesday June 24, 2008

Clear conference calendars: Register for Cell/B.E. apps workshop

Workshop coming in July at GA Tech: This two-day workshop (July 10-11, 2008 | agenda) will cover from ray tracing to Roadrunner, from applications on low-cost Cell/B.E. clusters to computer vision and digital imaging. It will address programmability issues like language and compiler, programming models and common runtime, and ISV programmability framework and tooling. There is no charge to attend; registrants must be registered by June 30, 2008.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  events  ]

Jun 24 2008, 11:52:00 AM EDT Permalink


Tuesday June 24, 2008

It came from the Lab: Lowest-power chip yet

Picowatt chip sets low-power record: UMich developers think that their 30picowatt Phoenix processor, designed for medical implants and announced at the IEEE Symposium on VLSI Circuits in Honolulu, could be the lowest-power processor yet developed. It is (chip and thin-flim battery) about 1,000 times smaller than the technologies being used for implants today and since it is only in operation part of the time (and consumes 30,000 times less power when in sleep mode), the developers predict it could last up to three years -- by contrast, a watch battery would power the Phoenix for about 263 years.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jun 24 2008, 11:51:00 AM EDT Permalink


Tuesday June 24, 2008

Product watch: Toshiba SpursEngine laptops debut

Qosmio G55 laptop goes Cell/B.E. quad core: What's most exciting about the Toshiba Qosmio G55 laptop set to debut in sometime in July 2008? Probably the fact it is a quad-core portable with a Cell/B.E.-based processor and all that entails -- performance and relatively low-power operations. The four processing elements inside the SpursEngine sport a performance level of 48gflops (look out high-def video-stream coding!), a clock frequency of 1.5GHz, and a power envelope of 10/20W (typical mobile processors run at about 35W).


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  news  ]

Jun 24 2008, 11:49:00 AM EDT Permalink


Tuesday June 24, 2008

Conventional Wisdom alert: Glass is a solid, right?

See-through wonder straddles the line between solid and other: There's the macro-obvious: Glass is a solid. (Toss a rock through it and you see this demonstrated.) Then there's the cool answer: Glass is in a transitive state; our observational lifetimes are too short to notice. So what's the truth?

Bizarrely, probably somewhere in the middle. The properties of glass allow it to behave at times like both a solid and a liquid. Glass exists in a "jammed" state of matter between liquid and solid that moves slowly. Atoms in glass try to drip down, but because their routes are blocked by neighbors, it acts somewhat like a solid. Glass is trying to express a crystalline lattice structure like so many solids, but its atoms get stuck in an almost random arrangement -- an icosahedron (like a 3D pentagon). You cannot fill 3D space with icosahedrons (just like you can't tile a floor with pentagonal shapes) so it will not form a lattice.

Some scientists think that glass is trending toward the crystalline and that eventually (maybe billions of years) it will get there and become a solid. Some don't.

Aside: Normally metals form crystals when they cool; this causes weakness along the crystal boundaries (metal fatigue). When metals are made to cool with icosahedrons (as metallic glasses), they don't crystallize and are not subject to the same fatigues.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jun 24 2008, 11:48:00 AM EDT Permalink



Friday June 20, 2008

From the Cell/B.E. Architecture forum

CELL/B.E.
Forum watch: That mean old libspe2 DMA transfer
From May 1 to June 1, 2008: Is context switching supported in libspe2? Help me install the SDK on Ubuntu! I've got a problem with DMA transfer in libspe2. Playing FAST and LOOSE (and TURBO?). I've got three related questions: Why are the fma, lqd pairs not dual issued in timing analysis? | How do I view the pipe trace output in Debug Control? | Why is the workload distribution so uneven for the two pipelines? It's license-refresh time. Plus hot topics roundup and forum statistics!

This blog-based column looks at some of the more interesting problems and challenges posed recently in the Cell Broadband Engine Architecture forum.


Originally, paulsimon wanted to know about expected behavior when starting more pthreads than SPEs: I have been using libspe1 and am now learning libspe2. Running the example code from program "simple" from the SDK 2.1, I don't understand the results I'm getting. I'm using SDK 1.1 on a PS3, so I only have 6 SPEs, not the 8 which are assumed in the example. The program just starts 8 threads, each is supposed to print "Hello Cell...". When using more than 6 SPEs, the program usually hangs, but occasionally executes completely. It doesn't block or return errors for any of spe_create_context, spe_program_load, pthread_create, or spe_context_run. In ppu-gdb, it reliably hangs on the join of the 7th thread. This behavior is quite different from the "simple" program that used libspe1.

Is this the expected behavior when starting more pthreads than SPEs? Have others seen this when running the libspe2 example code on the PS3? I have seen the advice not to have more SPE threads than actual SPEs, but I understood that this was just for efficiency reasons. But it should still execute, shouldn't it? What is the consideration or syncronization that I'm overlooking here? Or maybe this is something to do with me using libspe2 in SDK1.1?

The IBM SDK Service Administrator answered with: In 1.1 the threads would block, so that if 6 are in use and you start a 7th, it will block until one of those 6 exits. 2.1 has the pre-emptive scheduler so it can time-slice for realtime threads. Also, there are some known problems with pre-emption in SDK 2.1 that will be fixed in the next release, so it's possible you're hitting those and that's why the behavior is not consistent.

Now there's more.

[lowellns]: Is this context switching supported by the libspe2? Or the spufs interface? Perhaps a scenario will help.

If I have 6 SPEs and 8 pthreads that do this:

<pre class="jive-pre"><code class="jive-code jive-java">spe_context_create()
spe_program_load()
spe_context_run()
</code></pre>

what will happen? Won't two threads wait until two spes return or are you telling me that there will be some context switching?

And possibly related: How would setting pthread schedule policies affect spe threads once they are running?

[SDK Service Administrator]: I don't believe this is supported on the PS3 since it's kernel dependent, so in your case the 2 extra threads will wait for two of the others to exit.

Editor: You might also want to see "Changes in libspe: How libspe2 affects Cell Broadband Engine programming" (developerWorks, July 2007).


Can you answer this one? Why are the fma (floating multiply and add) and lqd (load quad-word, d form) pairs not dual issued in timing analysis? Answer A.

iamrohitbanga wants to know if there are resources for getting the SDK going on Debian: I need help to install Cell SDK 2.0 on Ubuntu Hardy Heron. The reason is Fedora would not support my wifi card. i tried to convert rpm packages using alien, but not succeeded so far.

[davidhi] (from 2006): I wanted to get the CBE SDK up and running on my laptop, so I thought I'd share my experiences getting the CBE simulator running on Debian. I went ahead and made debs out of all the packages and did away with install.sh (the postinst scripts do the appropriate work). I ran into two basic problems, one having to do with a Makefile and one having to do with tcl/tk compile-time options.

At first the command PATH=$SCE_INST_DIR/ppu/bin:$PATH make -C src was breaking under Debian. After a bit of digging I found out that the Makefile doesn't work properly under make 3.81beta4 (which Debian uses), but it still works under 3.80 (which FC4 uses). I just ended up grabbing a debian package of 3.80-9 and using that for the time being. The breakage has something to do with the $$(@F) sysV make compatibility feature.

After I got everything built/installed, the simulator would fail like so:

Segmentation fault on address 0x2c6cc528
Restored previous handler.
../run_gui: line 30: 6483 Segmentation fault $TOP/../systemsim -cell $* -g

After more digging around (noticing that the ldd output for systemsim-cell had libpthread on Debian but not on FC4), I found that this was due to Debian's tk8.4 and tcl8.4 being built with --enable-threads. I don't know why this causes the simulator to fail, but I built new tk and tcl packages without this flag and now everything works peachy. This will be an important thing for IBM to fix, however, because the maintainer of the FC tcl and tk packages has said they are now being built with --enable-threads as of last week.

[davidhi] (later): Well, just one day after I finished making Debian packages of all the tools, they released new versions of almost everything. I went ahead and built new packages and tried them all out and I didn't encounter any new problems. The Makefile structure in cell-sdk-lib-samples-1.0.1 still fails under make 3.81beta4 and the simulator 1.0.1 still crashes when tcl8.4 and tk8.4 are built with --enable-threads.

So again, everything works great as long as I use the slightly older make 3.80 to build the stuff in cell-sdk-lib-samples and rebuild my own tcl and tk packages without --enable-threads.

For fun, I also tried to see if I could bootstrap the ppc64 version of Debian in the simulator with cdebootstrap/debootstrap as my system image (I mean, why not remove FC4 all the way). It works fine, although it takes a long time to boot because by default it'll try to add swap, fsck, start up an MTA, cron, atd, inetd, syslogd, klogd, etc. So I disabled most of that. Of course, you also need to change inittab so it doesn't try to spawn getty on the first ttys. Also, I needed to make /bin/sh a statically linked shell (I used zsh) rather than the default ppc64 bash. Otherwise, it'd crash when trying to interpret the scripts called by init. The error:

malloc: ../bash/variables.c:1854: assertion botched 
malloc: block on free list clobbered

[davidhi] (later): Some people have asked me how to rebuild the tk and tcl packages without the --enable-threads flag and I thought I'd post some directions in case other people find it useful. First, you can use my packages if you'd like. They're i386 arch and based off of the testing version of tcl and tk.

If you are using a different architecture (like x86_64) or flavor (stable) or you'd just like to build them yourself, here's a quick tutorial.

You'll need a couple of tools (fakeroot and debhelper): apt-get install fakeroot debhelper. Run apt-get source tk8.4 and apt-get source tcl8.4 in some directory, let's call it "pkg-tmp."

For me, this unpacks some source in the tk8.4-8.4.12 subdirectory and in the tcl8.4-8.4.12 subdirectory. Their names might be slightly different depending on whether you're running stable, testing or unstable.

In the tcl8.4-8.4.12/debian directory, edit the "rules" file and remove --enable-threads from the ./configure line. Then, from the tcl8.4-8.4.12 directory run the command fakeroot debian/rules binary and it will build new debs in the top-level "pkg-tmp" directory.

tcl8.4-doc_8.4.12-1_all.deb
tcl8.4_8.4.12-1_i386.deb
tcl8.4-dev_8.4.12-1_i386.deb

It rebuilds the documentation too, but you only need to reinstall the main and -dev libraries. Use dpkg -i tcl8.4_8.4.12-1_i386.deb tcl8.4-dev_8.4.12-1_i386.deb. Then do the same process for tk8.4-8.4.12 (edit debian/rules, run fakeroot debian/rules binary and install).

There is some order you must do these in (one package will build with threads if the other is installed with threads no matter what configure says): Based on my experience, I'm almost certain tcl needs to be built and installed with no threads before tk.

Run

ldd /usr/lib/libtcl8.4.so
ldd /usr/lib/libtk8.4.so

and make sure that there is no line like libpthread.so.0 => /lib/tls/libpthread.so.0 in the output (that would imply that they were built with threads).


Can you answer this one? How do I view the pipe trace output in Debug Controls SPU_DISPLAY_EXEC & SPU_DISPLAY_ISSUE? Answer B.

kabe wants to know why he/she's having a problem with DMA transfer in libspe2: Hey! I have a problem with the DMA transfer. First it worked all fine. I used really simple code and the libspe.h. This worked. Then i noticed that opening an SPE-Thread is quite expensive. So now I wanted to open a thread and keep it. I found some bits about using pthread with the SPEs and the Program I'm trying to convert to work on the Cell already uses pthreads on other architectures to use multiple cores. So I thought that would be a good idea.

But now I have switched to libspe2 and the execution of the code just stops in the SPE-code when I use mfc_get. In one instance without changing the code in that area it actually stopped after the mfc_write_tag_mask, so it looks a bit like undefined behaviour.

PPE-Code I'm using:

void *evaluateOnSpu(void *data) 
{

int retval;
unsigned int entry_point = SPE_DEFAULT_ENTRY; // Required for continuing execution, 
                                              // SPE_DEFAULT_ENTRY is the standard starting offset. 
spe_context_ptr_t my_context;

// Create the SPE Context 
my_context = spe_context_create(SPE_EVENTS_ENABLE|SPE_MAP_PS, NULL);

// Load the embedded code into this context 
spe_program_load(my_context, &spuevaluate_handle);

//evaluateInfo* info = AB::BSplineSurface::mInfo; 

// Run the SPE program until completion 
do {
retval = spe_context_run(my_context, &entry_point, 0, &info, NULL, NULL);
} while (retval > 0); // Run until exit or error 

pthread_exit(NULL);

}

void AB::BSplineSurface::testEvaluateThread()
{
std::cout << "test" << std::endl;
info = mInfo;
pthread_t my_thread;
int retval;

// Create Thread 
retval = pthread_create(
&my_thread, // Thread object 
NULL, // Thread attributes 
evaluateOnSpu, // Thread function 
NULL // Thread argument 
);

// Check for thread creation errors 
if(retval) {
fprintf(stderr, "Error creating thread! Exit code is: %d\n", retval);
exit(1);
}

// Wait for Thread Completion 
retval = pthread_join(my_thread, NULL);
//* Check for thread joining errors 
if(retval) {
fprintf(stderr, "Error joining thread! Exit code is: %d\n", retval);
exit(1);
}

}

SPE-Code:

#ifndef EVALUATEINFO
#define EVALUATEINFO
typedef struct {
AB::VecPack2 param;
Bool4 mask;
AB::VecPack3 result[3];
Float4 cDBItemList83;
unsigned int cDBOffset[4];
unsigned int count[2];
unsigned int degree[2];
Float4 controlPointList31;
} evaluateInfo;
#endif


/**
*\brief This function is supposed to realize the evaluation on an SPU
*/
int main(unsigned long long speid __attribute__ ((unused)),
unsigned long long argp,
unsigned long long envp __attribute__ ((unused))) 
{

evaluateInfo pd __attribute__((aligned(sizeof(evaluateInfo)))); 
int tag_id = 0;

//READ DATA IN
//Initiate copy
//program_data_ea >>= 32;
mfc_get(&pd, argp, sizeof(pd), tag_id, 0, 0);
// ************* Normally it just stops here ******************
//Wait for completion
mfc_write_tag_mask(1<<tag_id);
mfc_read_tag_status_any();

http://...

I confess I don't really understand everything I'm doing here. Most of the code is from examples found around the net. Did I probably only progress half-heartedly to libspe2? Did i forget something?

I found a thread with a similar problem here, bit it was still with libspe1, I think. And it wasn't solved.

P.S. I found out that it works perfectly if I transfer a simple char of the length 128. Is there a problem with transfering structs? Am I doing it wrong? Can I use something else?

On a sidenote: I've read that spe_context_run should be pretty slow, too. Is there a way to keep the SPU busy while transfering new data? Is that what mailboxes are for? Do I have to use completely different code then anyway?

[jmt_dh1]: Well I'll pick up on one thing in the small extracts of code you posted:

evaluateInfo pd __attribute__((aligned(sizeof(evaluateInfo))));

You should be aligning it to 16 bytes (or larger). Aligning it to a multiple of the structure size is not conceptually what you should be doing to follow the DMA alignment rules (though you may of course get lucky if the size is a multiple of 16).

On a sidenote: I've read that spe_context_run should be pretty slow, too. Is there a way to keep the SPU busy while transfering new data? Is that what mailboxes are for? Do I have to use completely different code then anyway? That's a lot of questions, but have a read about double buffering...

[NotZed]: As jmt said, alignment is important. And for the dma size as well. It must be <16 or a multiple of 16. I always seem to hit that problem as I only dabble with cell coding rarely and forget it the next time I get back to it. My own side note: Whenever I tried something like the loop:

// Run the SPE program until completion 
do {
retval = spe_context_run(my_context, &entry_point, 0, &info, NULL, NULL);
} while (retval > 0); // Run until exit or error 

It would always seem to "lose the executable" at some point after a second or so and crash -- It seems you need to re-load the programme every time. I'm wondering if this is expected or is it a bug or just a coincidence?

[jmt_dh1]: You need to re-initialize the entry point variable every time. So many things to remember :)

[kabe]: Thanks for your comments so far. Yes, I constructed the struct so it is aligned to a multiple of 16. Otherwise it won't even compile. I'm not that far that I could run my SPU-programme for a longer time, so I don't know if I have to reload the code.

I tried another route now and it kinda works... I'm using the mailbox to tell the SPU where the data it needs starts, then I load for example an int, then I load a float from the starting address + sizeof(int) etc. Doesn't look that elegant.

Another point is, that I have to load the data in a special order. There is a bigger array in my data set of 128 Float4. When I first load two other variables and then the big array, it crashes. When I first load the big array and then the other data, it works.

At another point when I load an int[4] and two int[2] it crashes and when I load one int[8] and divide it by hand into three different arrays it works. This tells me that there has to be something seriously wrong. Can't you do many mfc_get consecutively? Even when I use mfc_write_tag_mask and mfc_read_tag_status_all after every mfc_get it hangs. What can be the reason for this?

[jmt_dh1]: Sounds extremely odd. Can you attach some reproducible sample code?

[kabe]: Sadly I am not allowed to post the actual code; I even had to sign a form. But I will try to reproduce the problem with standard data types...

Meanwhile I noticed that the ppu-gdb wasn't installed and that it could give info about the DMA. So I installed it, checked it... and didn't understand anything. This is the output:

(gdb) info spu dma
Tag-Group Status 0x80000000
Tag-Group Mask 0x80000000 ('all' query pending)
Stall-and-Notify 0x00000000
Atomic Cmd Status 0x00000000

Opcode Tag TId RId EA LSA Size LstAddr LstSize E 
get 30 0 0 0x00000000ffa5de68 0x3ed50 0x00010 * 
get 31 0 0 0x0000000010c2a0c0 0x3f040 0x00000 
get 31 0 0 0x0000000010c29ea0 0x3fbb0 0x00000 
get 30 0 0 0x00000000ffa5df08 0x3ed70 0x00020 * 
get 0 0 0 0xd0000000002a6900 0x00e80 0x00000 
putl 0 0 0 0xd0000000002f5000 0x00000 0x00000 0x00bf0 0x00008 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 
0 0 0 0 0x00000 0x00000 

In another post I read that the * is a sign for trouble. Can anyone help me understand this?

[jeshua]: I don't know what a lot of this means, but it looks to me like the two marked with a * both have bad alignment on the EA (Effective Address). The LSAs (Local Store Address) looks aligned properly.

[kabe]: That comment pointed me into the right direction, thank you! I really only forgot to align some variables in the PPE-Code. Aligning the variables in the struct leads to new problems, but I think I'll find the answer to that.


Can you answer this one? Why is my workload distribution so uneven for the two pipelines? Answer C.

b.lix wants to know what's the difference between the LOOSE and TURBO modes on the simulator: I'm working with the Cell SDK on two different computers. When I switch the Cell System Simulator into "Fast Mode" I get different messages from the simulator on the different machines. On one I get Simulator now in TURBO mode and on the other I get Simulator now in LOOSE mode.

What is the difference between these modes? And how do I get the second system also into TURBO mode?

[mkistler]: TURBO mode is only supported on 64-bit systems. It is a special flavor of FAST mode that uses just-in-time translation of PPC instructions into instructions of the host machine. Since Cell is a 64-bit architecture, it was not practical to support TURBO mode on 32-bit host systems.

[b.lix]: I think there is also a FAST mode on 32-Bit Systems. I want to know why I get Simulator now in LOOSE mode and not Simulator now in FAST mode. What is this LOOSE mode?

[mkistler]: LOOSE mode is minor tweak on FAST mode that performs "blocks" of instructions per CPU before switching to simulation instructions by the other CPU. This is far more efficient but can result in minor timing differences in interrupt delivery and/or synchronization mechanisms.


Threads worth pursuing
A fresh new license: The original license included in the Extras package allows evaluation until 2008-05-31. Users who wish to evaluate the software for longer should download and install the license refresh. When installed, this refresh will supersede the prior license and extend the evaluation period until 2008-12-31. All other aspects of the prior license will remain unchanged.

Hot topics roundup


Forum statistics for May 2008 Threads: 116 | Participants: 5,016 | Replies: 480 | % threads answered: 22%


Return to zone ||| Return to blog ||| Last From the forums



Categories : [   Cell  |  forums  ]

Jun 20 2008, 02:41:00 PM EDT Permalink



Tuesday June 17, 2008

Clear conference calendars: Register for Cell/B.E. apps workshop

Workshop coming in July at GA Tech: This two-day workshop (July 10-11, 2008 | agenda) will cover from ray tracing to Roadrunner, from applications on low-cost Cell/B.E. clusters to computer vision and digital imaging. It will address programmability issues like language and compiler, programming models and common runtime, and ISV programmability framework and tooling. There is no charge to attend; registrants must be registered by June 30, 2008.


Return to zone ||| Return to blog ||| Previous postings



Categories : [   Cell  |  events  ]

Jun 17 2008, 02:10:00 PM EDT Permalink


Tuesday June 17, 2008

Design on a dime: Beyond 45nm, is multithreading dead?

From DAC: Is multithreading really the best way to exploit multicore systems effectively?: A concerning question popped up at the recent 45th Design Automation Conference: "Is multithreading really the best way to exploit multicore systems effectively?" This reflected the efforts EDA vendors have been putting into adding mthreading capabilities to their tools to help with multicore design; problem is, at the 45nm node, more designs climb over the 100 million-gate mark and break current IC CAD tools. Parallel processing has traditionally relied on threads but threads sort of start bottoming out at four processors.

Read the detailed report to see what some of the best thinkers in the industry think about this question, including Gary Smith of Gary Smith EDA -- he thinks threads are dead: "It is a short-term solution to a long-term problem. Library- or model-based concurrency is the best midterm approach."


Return to zone ||| Return to blog ||| Previous postings



Categories : [   general  |  news  ]

Jun 17 2008, 12:34:00 PM EDT Permalink

Previous month
  July 2008
S M T W T F S
  12345
6
789101112
13141516171819
20212223242526
2728293031  
       
Today

RSS for

RSS for

Favorites
Cell Broadband Engine Architecture forum
Cell Resource Center
IBM microNews newsletter
Multicore acceleration zone (formerly Power Architecture)
alphaWorks Cell technologies

Categories
Cell (87)
downloads (21)
events (131)
forums (18)
general (109)
infobombs (40)
misc (6)
news (816)
newsletter (3)
papers (137)
tech update (1)

Recent Entries
Greener than speeding ethanol: B...
Games people play: Playing the d...
Faster than a speeding bullet: C...
Design on a dime: Multiple iconn...
Oddments: 10 really bad auto "en...
Faster than a speeding bullet: T...
Games people play: 40 PS3s to so...
Trends and tradeoffs: Why MEMS m...
Clear conference calendars: Regi...
It came from the Lab: Lowest-pow...
Product watch: Toshiba SpursEngi...
Conventional Wisdom alert: Glass...
From the Cell/B.E. Architecture ...
Clear conference calendars: Regi...
Design on a dime: Beyond 45nm, i...

Blogs I read
CellPerformance.com
Power.org blog
Slashdot

Special offers
Save on Rational testing software
Download trial versions of popular IBM software
Register for the DB2 Information Management Technical Conference

More offers


 
    About IBM Privacy Contact