Feed on

The Core Problem of Computing

This note began as a review of the tech report “The Landscape of Parallel Computing Research: A View from Berkeley“, with David Patterson among the many authors. It turned into a trip down memory lane. By the end, I agreed with them that Something Big is in play.
As ever, I urge you to read the paper. It has received quite a bit of attention, so there is also much surrounding material - the group’s blog is one good jumping-off point.

Patterson et al. tag Intel’s decision to cancel the follow-on to the Pentium 4 as historically significant for computing: Moore’s Law was over (doubling of CPU power every 18 months). Intel said, in essence, “Sorry, we’re not going to build you 10GHz processors - but we will let you have lots of 3GHz ones.” Welcome to the world of “multi-core” processor chips. (In fact, good luck finding a consumer PC with only one core.)

The biggest wall that CPU designers have crashed into was the power wall. But there is also the omnipresent “memory wall”: whereas it took a handful of cycles to get to main memory in a 1980 computer, it is now apt to take a few hundred cycles.

Slightly more obscure is the “ILP wall” (Instruction-Level Parallelism). The great hope - unrealized, it must be said - of architectures like IA64 (Itanium(r)) was that a compiler could take a good hard look at conventional code and find lots of down-in-the-trenches parallelism for the hardware to be getting on with. On a good day, ILP can keep a few cores busy, but otherwise that well is dry. So to summarize, the Berkeleyites say:

Power Wall + Memory Wall + ILP Wall = Brick Wall

The “answer” from the vendors has been to go “multi-core”: 2 or 4 CPUs per chip is common; 8 CPUs per chip is available (Sun’s Niagara); and there is every reason to think 1000-ish CPU cores per chip is entirely feasible.

Too bad nobody knows what to do with all those CPUs. When they wheel out the argument, “Well, you could use one CPU to run your anti-virus program…”, you know they’re drowning, not waving.

You see, making many cores do something sensible is parallel computing and that is a paradoxical trade. Since at least the 1960s (and certainly since the “killer micros” of the ’80s), it has been comparatively easy to bundle up hardware that could theoretically compute at a whopping pace. At the same time, nobody - but nobody - has had a compelling story on how such “parallel computers” might be programmed successfully by mere mortals.

There is a small set of computing problems that are “embarrassingly parallel” (yes, that’s the technical term), and can be made to work on any parallel-computing platform (including screensavers scattered across the public Internet). Looking for extra-terrestrial life was one of those. And then there is a small set of clients who can afford to throw phalanx after phalanx of PhDs at getting their less-embarrassing problem running on the parallel box at hand. (A recent note about programming the new NVidia GPUs gives a notion of what it takes.

The ultimate problem with parallel programs is that they are non-deterministic, whereas conventional (sequential) programs are more-or-less deterministic. Bugs pop up and then cannot be replicated. Deadlocks leap out of the blue (and cannot be replicated). It is a software verification nightmare.

Lest we forget, the only reason for parallel computing is to go faster. Another major practical problem for would-be parallel pioneers has been competition from what the Berkeley folks call  La-Z-Boy performance engineering: write a conventional sequential program, sit in your La-Z-Boy chair for 18 months, and you will have doubled your performance. So, a fervent grad student cobbles together (say) twenty cheap microprocessors, works tireless nights for a couple of years to wire and program them up, thus demonstrating some fiendishly-clever parallel innovation - and…? Everybody yawns - the La-Z-Boy guy (plus Intel’s billions spent on conventional silicon advances) comes out looking better.

In the ’80s, people really tried building parallel computers that stood a chance of being programmable. In the ’90s, they just gave up - it was simply too hard to impress (funding bodies).
It is utterly amazing to me (and the Berkeley people, too, I think) that vendors are flogging their multi-core chips and, actually, they don’t have the slightest idea how they might sensibly be programmed. Nobody does. The Berkeley paper is, therefore, mostly about setting a research agenda for a way forward.

Passing remark: I found it interesting that they dragged embedded computing into their discussion, as well as the obvious “high performance computing”. They argue that these two ends of the spectrum are becoming more alike, not less.

I’m not too happy with their choice of application “dwarfs” (essentially “kernels” of key applications) as their suggested benchmarking baseline - but I see their point. I am a hard-core believer in benchmarking with real programs, but they argue it isn’t really possible. With parallel computing, the programming model is still up for grabs, so expecting entire real applications to be recoded every time some professor has a neat idea just isn’t going to happen. But a “dwarf” might get stomped on (again).

The section on hardware - “optimal processor size”, that kind of thing… - is good clean fun. You’ll enjoy it.

Section 5, on “programming models”, makes clear the weediness of the field. The Berkeleyites suggest a line of attack I hadn’t heard of. How to express a parallel computation effectively is, in some sense, a Human-Computer Interaction (HCI) problem, and they suggest using some of the same techniques to decide what works and what doesn’t. It’s not language design by focus group, but close.

The use of “autotuners” (in Section 6) - weren’t these called “supercompilers” at one time? - is interesting. Instead of requiring a super-whizzy compiler, have an “ordinary” compiler but with some “autotuners” to have a go at critical code. For each routine in an important library, an autotuner tries lots of correct-but-possibly-ludicrous code generation techniques/combinations; you have some chance that one combo will be great, so that’s the one you install. This is better than having clever boffins sweat their guts out to tune the library for the new parallel kit before them.

Finally, the Berkeley lads reveal that they and their friends have got together and built an FPGA-based box to practice on (called RAMP).

In summary, the present direction is ludicrous. Silicon vendors are gearing up to sell us multi-core silicon that no-one knows how to program. So, what’s going to happen - are we really at a major technology “inflection point”?

I would vote for ‘yes’ simply because it seems unlikely we will collectively decide, “OK, a couple of 3GHz CPUs are plenty for me”, forever. Another possibility is that the silicon budget will be redirected in some more effective direction - after all, the GHz-at-any-cost bandwagon must have led to some poor engineering decisions that need to be unwound. Or maybe the parallel-programming community will come through - fifth time lucky, perhaps?

Or maybe another Finnish kid is going to surprise us.

2 Responses to “The Core Problem of Computing”

  1. John Busco Says:

    I’m starting to see multi-threaded EDA applications being introduced, and they sound very attractive for 4-8 CPU servers.

    Do you think this trickle of applications will expand, or remain isolated? From a user perspective, they sound perfectly sensible. From a developer perspective, are they prohibitively complex? Are these techniques only applicable to a few applications in EDA?

  2. Will Partain Says:

    @John Busco:

    Disclaimer: I don’t know beans about what goes on inside various EDA tools, how "parallelizable" their code might be, or much else pertinent to the question :-)
    I’ll vote for "the trickle of applications will remain isolated and/or not do much (speedup-wise)". There may be one or two sweet-spot apps (you know, like the 32-way Niagara chips and web serving…), which would be cool. But as a general thing? — trickle.

    Chomping through simulations might be a sweet spot. But I’d worry about memory contention. Having 8 cores waiting around on memory instead of one core isn’t great progress.

    The point of the Berkeley paper, which I was trying to reflect, is that we will soon be offering the ordinary programmer (perhaps) 128 cores, saying "Want fast code? What more do you need?" Well, the same as in 1980 — a programming model that might possibly work. Unless you’ve got armloads of PhDs to throw at the programming job, we’re still in nowheres-ville.

Work For Verilab