Beginner's Mind
Algorithms for scalable synchronization on shared-memory multiprocessors
From the abstract:
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes. We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors.
Quicksort in Haskell
Haskell is certainly concise; I’ll give it that.
qsort [] = []
qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)
But is it easy to form a mental model of how this will execute under optimization? A transliteration into an imperative language creates copies of xs on each execution, as well as traversing the list twice.
I can’t recall where I originally heard the idea, but one argument made for the success of C and C++ was their simplicity: specifically, that programmers could construct a reasonable mental model representing the execution of their program. The optimizer might speed things up, but without any radical changes to observable semantics. In the Haskell program above, I suppose Haskell’s lazy semantics and optimization mean the program can run efficiently, but don’t exactly know how!
James Hague, in On Being Sufficiently Smart, reaches a similar conclusion about Haskell:
Let’s say that a compiler can detect O(N^2) algorithms and replace them with O(N) equivalents. This is a classic example of being sufficiently smart. You can write code knowing that the compiler will transform and fix it for you. But what if the compiler isn’t perfect (and it clearly won’t be, as there aren’t O(N) versions all algorithms)? It will fix some parts of your code and leave others as-is. Now you run your program, and it’s slow, but why? You need insight into what’s going on behind the scenes to figure that out, and if you find the problem then you’ll have to manually recode that section to use a linear approach. Wouldn’t it be more transparent to simply use linear algorithms where possible in the first place, rather than having to second guess the system? …
The GHC developers know that laziness can be expensive (or at least unnecessary in many cases), so strictness analysis is done to try to convert lazy code to non-lazy code. If and when that’s successful, wonderful! Maybe some programs that would have previously blown-up now won’t. But this only works in some cases, so as a Haskell coder you’ve got to worry about the cases where it doesn’t happen. As much as I admire the Haskell language and the GHC implementation, I find it difficult to form a solid mental model of how Haskell code is executed, partially because that model can change drastically depending on what the compiler does. And that’s the price of being sufficiently smart.