Problem of huge number of interleavings in MPI

Written on Wed 11 March 2009

I've just came across a verification tool for MPI programs. Even debugging sequential programs written in a language like C/C++ can be a quite daunting task because of the complexity it involves; all those pointer operations including memory management, cryptic C++ template errors, the time wasting edit-compile-test procedure of compiled languages etc. Dynamic languages are easier in my opinion because they are more simple with automatic garbage collection, duck typing and the interactive shells that enables you to test your programs and/or snippets much faster. Parallel programming introduces other complexities. You have to watch out for deadlocks, race conditions etc. and thinking/visualizing them is much more harder than sequential programs. The human brain is more suitable for small number of parallel jobs, it is most intuitive to understand serial programs. Maybe this is because of the way we had been taught at the universities because parallel programming is not so widespread right now.

Simply thinking about the number of different runnings of parallel programs can be daunting. The simple example in the ISP website is enlightening:

It is well known that even an MPI program with five process, each making five MPI calls each, has over 10 billion interleavings (schedules). This is what bogs testing tools down -- they do not know which interleavings matter, thus end up missing some interleavings while pursuing others redundantly.

We had a discussion with my friend from mathematics department, Hakan Özadam, about the number of interleavings. Five processes, making five calls each to MPI functions means 25 function calls. For each process, you have to choose 5 function calls. If you number each from 1 to 25, you can have each process choose its functions and move on to the next one. Thus, the number becomes:


C(25/5) * C(20/5) * C(15/5) * C(10/5)

Where C(x/y) = x! / ((x - y)! * y!) The last process simply has to select all the left calls for itself, and thus its number of choices is C(5/5) = 1, only one.

That makes the number quite high. No wonder it is hard to formally verify that a parallel program just doesn't deadlock somewhere and wait indefinitely for an input it is to process (halting problem, anyone?). ISP debugger looks promising on finding certain types of bugs in MPI programs. It seems to have formal framework for following promising interleavings and not checking ones that are sure not to be executed. I wonder how it could be used with Boost libraries, which are the libraries I am using for C++ MPI development.

[thanks to Hakan for his help ]