Changing How Programmers Think about Parallel Programming
For context, please make sure you've attended the webinar, presented by William Gropp.
Q: So, are you assuming "distributed" non-shared memory? I thought we were talking about parallel programming on a shared-memory machine?
A: No, I'm not assuming distributed memory, though many of my examples were drawn from experiences with distributed memory. The issues apply to shared memory; being more dynamic (fixed block-size data-parallel loop decompositions are not appropriate), use of atomic operations, and understanding the memory model (for your execution model) are essential.
Q: Since the world has largely gone "multi-core," then we should also discuss parallelism within a shared memory domain—for me, often within a JVM?
A: Yes, but don't stop there.
Q: Will CUDA be obsolete in a few years and be replaced by OpenCL?
A: CUDA and OpenCL (and OpenACC) have different targets—CUDA is lower level and is closer to a programming system that matches the NVIDIA execution model. The others are programming systems that are for an abstract GPU execution model. If the latter are successful, only experts will use CUDA.
Q: "Hard problems" includes NP?
A: Yes, but parallelism isn't magic—for NP problems, the key is better algorithms that also exploit parallelism.
Q: If and how does the messaging of this webcast change when thinking about parallel programming for mobile devices and on iOS and Android? The majority of smartphones being shipped now are multicore.
A: Extreme scale isn't the issue there, but performance variation needs to be part of the design, as well as a realistic execution model. Such a model might include the possibilities of long delays with some network operations; programming models and algorithms that took that into account would be less susceptible to the "spinning beach ball of death."
Q: Amdahl's law ⇒ when we have many cores next year, the time will be driven by the sequential parts and the locking overhead.
A: Under those assumptions, yes. But some applications have essentially no serial fraction (some HPC applications have less than 10 to the minus 10 serial fraction) and locking overhead assumes locking—there are other ways to achieve parallel coordination without locks, though it requires that the programming system or language support it (there's a nice paper on why threads can't be implemented as a library). Thinking parallel from the start is a good way to reduce the serial fraction; trying to parallelize a sequential code often leaves a significant serial fraction and thus limited potential for parallel speedup.
Q: Why does adding barriers sometimes increase code performance? Doesn't this indicate synchrony is sometimes better?
A: One example is using barriers to reduce communication contention, particularly when the communication is fairly synchronous to begin with. And yes, reality is more complicated; sometimes synchrony can be better.
Q: Bill, what do you think about the February CACM article about power requirements possibly limiting the future number of cores?
A: Very nice article. There are some assumptions in there about the type of application that make the result less universal than it might appear.
Q: So a barrier is a synchronization point?
A: They are not always used for (only) that, but yes, using a barrier forces a synchronization.
Q: Would you characterize Hadoop as parallel computation or simply distributed computation?
A: There isn't a precise line between parallel and distributed, but I would put Hadoop and MapReduce more in the distributed category because of the emphasis on separate tasks that communicate rarely (in this case, in the reduce step).
Q: Isn't there also a point where we cannot extract any more parallelism? In the spirit of Ahmdal's Law, for problems that don't perfectly break down in parallelizable portions, that sequential part will be the bottleneck, which could take a long time.
A: Yes, but many applications have been demonstrated with extremely small serial fractions (those computational science codes running on 1 million cores today). Parallel applications created by parallelizing a sequential application often have large serial fractions, limited speedup. Some problems, particularly in the sciences, have very little non-parallel computation and are expected to run efficiently even on 1 billion cores. Of course, not all problems have that level of concurrency, and for them we need a different solution.
Q: Slide 21: why log base 2? Base can be higher. That's only a constant factor, but that's very important in HPC.
A: True, higher radix trees can be used, particularly if the execution model support multiple communication paths at the same time (as on IBM's Blue Gene).
Q: In nature, weather (for example) does not need to fully synchronize globally to proceed locally. How can we mirror this in computation (e.g., computational "light cone")?
A: We can implement this model nearly directly. However, there are challenges (in weather, sound speed is much higher than typical wind speeds, but without carefully formulating the math model, the sound speeds limit the time step). And we must not ask for a global view of, for example, the fastest wind speed—if so, we've introduced a global constraint.
Q: Just to complement, this paper may be interesting for this current subject.
Q: Do we have the issue of deadlock when using a halo exchange among processors?
A: We can if it's not implemented correctly. Using synchronous communication makes this "more" likely; non-blocking communication permits more flexible progress. We illustrate this example in our MPI tutorials.
Q: Regarding susceptibility to large-scale variance: CACM had a nice article from Google about how they launch redundant requests to avoid the risk that one is by chance delayed: "The Tail at Scale" (CACM, February 2013).
A: Thanks—a nice example of rethinking the problem.
Q: Is "Execution Model" another way of saying "Architecture" (a structural and cost model for computation on a particular class of machines)?
A: They're related—an execution model may be more abstract than we usually mean when we talk about the architecture. And I include runtime services in the execution model—since as a programmer or algorithm designer, I don't care whether a feature is in hardware or software, just that it works and is fast enough.
Q: Didn't the difference between execution model and programming model move us from low-level to high-level programming languages? So isn't is just a step back?
A: Good question, and the key is to use as high level an abstraction for the execution model as possible. Relying only on a low-level model, such as the low-level architecture, would be a step back.
Q: Is portability lost when I move toward the execution model?
A: Not necessarily—if your abstraction is a high enough level, portability can be maintained. However, there is a tradeoff between performance and maximum portability.
Q: You started the talk with saying that parallel programming is (already) hard. Now you're proposing an execution model (and matching programming model) that makes it a lot harder. Do you think that programmers will be able to make sense of your new model?
A: The other option is to hack without a model—so yes, a little harder, but not necessarily much harder. I'm not asking programmers to understand everything. Think OpenACC instead of CUDA.
Q: Doesn't overdecomposition increase "congestion"? How does on balance performance gain of overdecompensation vs. performance loss due to additional overhead?
A: Good question—the answer, at least in part, is in balance. Don't overdo the decomposition, and maintain performance-important properties such as locality. There is research on building systems that manage the details and make it easier for programmers to build effective programs.
Q: I'm an undergrad student. What math do you recommend studying to prepare for thinking about parallelism?
A: Math is good—there is usually a lot of concurrency in the mathematical representation of the problem, and a top-down view of the problem often gives you lots of parallelism. So study the math in application areas in which you are interested; that's what I did.
Q: Testing approach was not discussed. Any comments?
A: Good point. That's part of what's hard in parallel computing. One property that I like to see is that the programming model and system preserves deterministic algorithms—that is, if the algorithm is deterministic, then verifiably so is the implementation. This is one reason why some practitioners prefer message-passing or more restrictive shared-memory approaches.
Q: Can DSLs (Domain Specific Languages) such as Liszt be used to abstract away the complexity of sw (coding synchronization via directives/OpenMP and/or MPI) and hw complexity (increasingly large memory hierarchy latencies)?
A: Yes, as long as they can be integrated into the overall application. That's why I insist on calling DSLs "data-structure specific languages."
Q: When is MPI-3 going to be widely available on systems?
A: Probably within the next year. Several implementations, including MPICH, already provide functional MPI-3 libraries, and tuning for big systems is in progress.
Q: It seems that barriers are becoming such a big problem that new algorithms are needed that alleviate them altogether, instead of a new execution model that takes the problems into better account. Can you comment?
A: Both are important. We need to move away from algorithms and programming models that encourage the use of barriers and to a lesser extent synchronizing communication. But that is easier said than done—some of the most efficient numerical algorithms (e.g., Conjugate Gradient) have an Allreduce—a strongly synchronizing communication. A better execution model will give us more to work with as we look for alternatives, including, for example, non-blocking Allreduce.
Q: Are the programming model(s) on which MPI is based a good match for the execution models of the highest performance systems?
A: They're a pretty good match today, partly because the hardware has been designed to support MPI programs. One exception is one-sided, RDMA, or put-get models. While MPI-2 had some support for one-sided, it was limited. MPI-3 provides much greater support, and this is expected to be a good match (we'll see when high performance implementations become available).
Q: Are there special considerations dependent on time scales? We seek to increase performance on a problem that takes 5–50 seconds on a single core—it must run in total time of < 0.2 seconds.
A: Since you need O (100x) speedup, yes. First, you need to get a quantitative understanding of your performance—it's possible to lose 10x or more on a single core just through unfortunate memory access patterns, for example. Once you know that the single core performance is close to the achievable limits, then you'll need to think very parallel—start by finding as much concurrency as possible, then look for ways to aggregate that work into efficient units.
Q: I would like to know your opinion about developments on FPGA?
A: A short question with a long answer. Short form—for specialized uses, FPGAs can be very effective, but not for all problems.
Q: The hard parts for most programmers is probably the correctness aspect. How can we get some help with this? We need compilers/tools that can analyze our code for write-write conflicts, etc.
A: One way is to build correctness into the programming model and then the system, rather than trying to back-fit it. For example, a general shared-memory programming model is very hard to reason about and thus hard to prove even simple programs correct.
Q: You did not mention "eventual consistency," why is that?
A: We only had an hour. Yes, the various consistency models are important, and can be considered a part of your execution model (e.g., in your execution model, what is the consistency model)? Is it even well defined? (For many programmers working with shared memory, the answer is often no.)
Q: Can you give some examples of moving the computation to the data?
A: The simplest one is following links in a graph that is distributed. A compute-centric approach fetches information on a remote vertex, finds the edges to follow, and fetches them. A data-centric approach sends code to the node hosting the vertex, executing it there.
Q: Is Illinois the best in parallel computing? I mean, the best research center in the USA?
A: I think so:). My friends might disagree.
Q: Do you think HPC on the cloud is feasible?
A: HPC, yes, because there is HPC that you can do on a node in a cloud system. But there are HPC applications that are not effective or feasible on cloud services (unless you include a parallel supercomputer as a cloud resource).
Q: Is MapReduce/Hadoop a programming systems or programming model?
A: MapReduce is a programming model. Hadoop is a programming system that implements the MapReduce model; there are other implementations of MapReduce. MapReduce is inspired by an execution model that includes frequent failures, as well as massive parallelism.
Q: Should IO be part of a programming model?
A: Yes. In HPC, many applications are limited by their I/O, in large part because I/O wasn't considered. Also, POSIX I/O is singularly unsuited for high performance, parallel I/O.
Q: What's your opinion of auto tools (e.g., auto-vectorization or auto-parallelization) for parallel programming?
A: A wonderful dream and a very challenging problem.
Q: The brain is a massively parallel system, isn't the structure of the mesh (interconnectedness) missed as a distinguishing factor in the performance of a parallel system?
A: Yes! For example, what distinguishes a supercomputer today is the interconnect. Too many parallel applications today assume an "execution model" of the interconnect that acts like a contention-free, complete connection network. There are better models that do not require diving into the details of a specific interconnect, for example.
Q: I am a Java developer, so which way can I think about parallel programming?
A: For single chip/node, you have workable shared memory models. Locality remains important, so you need a model of where the data goes and where computation is done (somewhat hidden). For example, an "execution model" that says that the core that first accesses memory will be the one that can most efficiently process it suggests a programming style often called "first touch"—ensure that the first reference to data is distributed across all of your cores.
Q: How do you see programming of heterogeneous systems progressing—are you aware of any languages that allow the programmer to specify multiple paths to solution and choose the best one at runtime?
A: You can already do this with some research systems and in simple ways with OpenACC. I expect this work to expand as well as change, as I don't believe accelerators on separate chips will continue to be the most common heterogeneous system.
Q: What operating systems are best for parallel programming?.
A: At extreme scale, ones that stay out of the way of the user's applications. There are a number of both research and commercial OSs that can do that; look at the version of Linux that Cray uses, the OS that IBM uses for Blue Gene, for example.
Q: What do you think of parallel systems taking advantage of hardware optimizations (e.g., speculative parallelization or transactional memories)? In my opinion, depending on hardware makes it difficult to see the level to which we can parallelize an application.
A: Portability with performance is hard, no doubt about it. Sometimes you can hide the details from most users—for example, some numerical libraries exploit special hardware and even control of rounding modes; if those features are not present, other, if slower, approaches can be used. So I would not build a general purpose language on top of hardware operations that are not nearly ubiquitous. But I would define a library routine in a way that would let me exploit those features if they were present, and accept that I might have to write (and test) several versions.
Q: For some projects, it is really difficult to find time to sit down with end-users. [Is this] the "Achilles Heel" for the Agile method, or is there a way to accommodate such circumstances?
A: You need to have someone act in that role for Agile to work: A more exaggerated version of a Scrum customer advocate.
Q: OK, if we have complex systems to get the most from devices, why don't we develop some other programming paradigms?
A: I'm encouraging you to think about it—the specifics of these complex systems are new and we need experience with them to better understand the persistent and effective abstractions that would inform a good programming model.
Q: As we go to systems with 5M cores, systems will become more fragile. Do you see how the programming systems will deal with this (think SDC, core going down, etc.)?
A: Good question, and one that is a current topic of research. Important is the fault-model—just exactly what sort of faults do you have to deal with. Think of that as the "execution model for faults," and the programming systems as designed to let you work with them (and no, I haven't seen anything I really like yet).
Q: Is there any way to detect delay while synchronouzing?
A: With some help from the system. MPI-3 added access to internal performance variables, some of those (depending on the implementation) measure those delays. For synchronizing collectives, many libraries have an option to measure the time around a barrier (MPI_Barrier for MPI programs) before the collective operation—any time over the typical barrier time is a delay.
Q: Statement, not question: I've worked extensively with geographically-distributed teams with pair-programming. I've found they need to be in close time zones; in my experience no more than ~2 hours.
A: Makes sense. Thanks for the note.
Q: Do you see an important role for functional languages in the future of parallel programming?
A: Not in the short term; it's still too hard to get good performance in real applications. That doesn't mean people shouldn't keep researching functional languages, but to date we still depend on (often an expert) a programmer to achieve the performance goals we have.
Q: One of my research arguments is that the first language learned shape further develolpment of expertise. In your view, what type of language may be introduced first so that it better prepare students to learn HPC later?
A: Tough question. But I'd say that what you emphasize is also important. So a language that lets you focus on good abstractions but also recognize and work with performance-important issues, such as memory locality, would be ideal. And sometimes the second language might help by contrast.
Q: What about transactional memory?
A: Exactly—transactional memory (or an abstraction for it) could be part of your execution model. How you take advantage of that in your algorithms and programs depends on your programming model and system.