John Mellor-Crummey, Ph.D. |
[ Home | Research | Teaching | Papers | Affiliations ] |
High Performance Fortran (HPF) was designed in the early 1990s with the intent of harnessing the power of automatic parallelization technology to provide a commercially viable high-level programming model for developing portable parallel programs. HPF provides an attractive model for parallel programming because of its relative simplicity. Principally, programmers write a single-threaded Fortran program and augment it with layout directives to map data elements onto an array of processors. HPF compilers use these directives to partition a program's computation among processors and to synthesize code for required data movement and synchronization. As of 1997, fifteen companies were offering HPF compilers, including all major parallel computer vendors, and nearly forty major applications had been developed in the language, some over 100,000 lines.
Despite considerable initial enthusiasm for HPF, it has not achieved widespread acceptance by scientists as the model of choice for developing parallel applications. The success of HPF has been principally limited by the shortcomings of its compilers, which were adapted from technology for automatic parallelizers of the late 80s and early 90s. This technology has not been sophisticated enough to deliver scalable performance across a variety of architectures and applications. Compilation techniques in use by commercial HPF compilers fail to generate code that achieves performance competitive with that of hand-coded programs for all but the most straightforward applications. As a result, application developers have been reluctant to use HPF. Some have chosen to hand-code applications using message-passing instead, while many others have simply delayed the transition to scalable parallelism.
It has been difficult for developers to use HPF compilers to parallelize existing codes. Commercial HPF compilers lack analysis and code generation capabilities to parallelize a wide range of loops written in a Fortran 77 style effectively. Therefore, for codes to achieve reasonable parallelism when compiled with commercial HPF compilers, they have had to be rewritten substantially. For example, the Portland Group developed HPF versions of the NAS parallel application benchmarks by extensively rewriting the codes to avoid limitations of their compiler. Key changes included selective unrolling of loops, inlining of procedures, and repeated data transposition to avoid wavefront (pipelined) communication. In addition, forward substitutions and loop realignments were necessary to avoid using privatizable arrays. The resulting codes were nearly twice the length of original serial versions of the benchmarks. Despite the rewriting, the performance of the NAS SP and BT codes is still about a factor of two off from the performance and scalability of hand-coded message-passing (MPI) versions of the same benchmarks.
In addition, the restricted set of data partitionings allowed in HPF has turned out to be a significant limitation, particularly for tightly-coupled codes. For example, hand-written parallelizations of the NAS SP and BT application benchmarks use multipartitioning--a skewed-cyclic block distribution--that is not available in standard HPF, but which delivers exceptional performance and scalability for these applications. Alternative parallelizations based on either static or dynamic block distributions, which are supported by HPF, don't yield comparable efficiency, even in hand-coded implementations Ideally, HPF compiler technology should be extensible to support new user-specified distribution strategies.
A principal goal of the Rice dHPF compiler project has been to develop compiler optimization techniques necessary to deliver scalable parallel performance for a broad class of data-parallel applications. In particular, an aim of the project has been to develop compiler techniques that simplify the conversion of well-written codes into efficient HPF, and to preserve the benefits of any code restructuring a user may have done to improve memory hierarchy performance.
In the dHPF compiler project, we have developed a wide spectrum of program analysis and optimization techniques that enable us to generate code that closely approximates hand-coded performance for complex (but regular) data-parallel applications. These include
A particularly
challenging class of applications is
tightly-coupled line sweep applications such as the NAS SP and
BT benchmark codes. From lightly-modified copies of standard
serial versions of
these benchmarks, dHPF generates MPI-based
parallel code that is within a few percent of the performance of the
hand-crafted MPI implementations of these codes
for a 102x102x102 problem size (Class B) on 64 processors.
The adjacent figure shows how dHPF-generated code using
compiler-generated multipartitioning yields similar performance
to hand-coded MPI - both achieve roughly linear speedup - for the NAS SP computational fluid dynamics code.
In our papers, we describe and
quantitatively evaluate the impact of partitioning, communication and memory
hierarchy optimizations implemented in the dHPF compiler.
For more information, see my papers in this area.
Coarray Fortran (CAF) is a SPMD parallel programming model based on a small set of language extensions to Fortran 90. CAF supports access to non-local data using a natural extension to Fortran 90 syntax, lightweight and flexible synchronization primitives, pointers and dynamic allocation of shared data, and parallel I/O. An executing CAF program consists of a static collection of asynchronous process images. Like MPI programs, CAF programs explicitly manage locality, data and computation distribution; however, CAF is a shared-memory programming model based on one-sided communication. Rather than explicitly coding message exchanges to obtain off-processor data, CAF programs can directly reference off-processor values using an extension of Fortran 90 syntax for subscripted references. Since both remote data access and synchronization are expressed in the language, communication and synchronization are amenable to compiler-based optimizing transformations.
To date, CAF has not appealed to application scientists as a model
for developing scalable, portable codes, because the language is still
somewhat immature and a fledgling compiler is only available on Cray
platforms. We are working to create an
open-source, portable, retargetable, high-quality CAF compiler
suitable for use with production codes. Our compiler translates CAF
into Fortran 90plus calls to ARMCI, a multi-platform library for
one-sided communication. Recently, we completed implementation of the
core CAF language features, enabling us to begin experimentation to
assess the potential of CAF as a high-performance programming
model. Preliminary experiments comparing CAF and MPI versions of the
BT, MG, SP and CG NAS parallel benchmarks on a large Itanium 2
cluster with a Myrinet 2000 interconnect, show that our CAF compiler
prototype already yields code with performance that is roughly equal
to hand-tuned MPI. The adjacent figure shows that a CAF implementation
of the NAS multigrid benchmark is has equal or better efficiency than
MPI for all problem sizes.
We are in the process of designing and implementing a portable high performance source-to-source CAF compiler. To achieve portability, our compiler performs a source-to-source translation from CAF to Fortran 90 with calls to run-time library primitives. To achieve high performance, we want Fortran 90 code that we generate from CAF to be optimizable by vendor compilers. To generate high-performance code for a variety of target platforms our CAF compiler will use a parameterization of target platform characteristics to guide optimization. In our design, CAF optimization strategies are guided by platform-specific cost models that contain information about the costs of various operations for a particular architecture and interconnect. Among other things, platform-specific cost models will be used to make decisions of how best to implement communication. For example, on a platform that has an interconnect suited to coarse-grain communication, the communication cost model should guide our CAF compiler to vectorize communication; on a platform with hardware shared memory, the communication cost model should guide our CAF compiler to generate code that uses loads and stores to directly access non-local data.
More information about the Rice CAF compiler, including slides, publications, and source code can be found at the project WWW site http://caf.rice.edu.
It is increasingly difficult for application developers writing complex scientific programs to attain a significant fraction of peak performance on modern microprocessor-based computer systems. Largely, this problem stems from the difficulty of expressing the application in a form that can effectively exploit the high-degree of instruction-level parallelism and deep memory hierarchies present in these systems. Furthermore, the complexity of these systems makes it difficult to pinpoint performance bottlenecks.
To address this issue,
we have developed HPCToolkit -- a novel suite of
multi-platform
tools for performance analysis of sequential and parallel
programs. The tools are designed for analyzing the node performance of
optimized application binaries without prior arrangement. The figure
below shows the organization of the toolkit.
The toolkit
includes support for collecting sample-based profiles using
performance monitoring hardware, analyzing application binaries to
decode instructions and recover loop nesting structure, interpreting
raw performance measurements by combining them with opcode
information obtained by binary analysis, correlating metrics with
source code using symbol table information, and producing a
hierarchically structured representation of application
performance.
The multi-platform binary analysis capabilities of the
toolkit are the key to its effectiveness.
They (1) enable performance metrics to be mapped to individual loop nests, which is the level at
which key program optimizations such as software pipelining and tiling
are applied, and (2) enable raw machine level metrics to be distilled
to higher level measures such as relative frequency measures of
instructions of different classes for each scope.
A carefully designed user interface (shown in the adjacent figure) facilitates
interactive top-down
exploration of profile-based program performance data that has been
collected and organized by the toolkit. In
addition to daily use at Rice by the parallel compiler group, the
toolkit is being used regularly at Los Alamos National Laboratory by
the ASCI code teams for performance analysis of their
stockpile stewardship applications.
More information about HPCToolkit including papers, slides and the source code can be found at the project WWW site http://hpctoolkit.org.
HPCToolkit currently collects and presents flat profiles for programs. All profilers of this class have two key limitations. First, they record no dynamic calling context for when samples are collected: there is no information about where a procedure is called from and if called from multiple places, whether the no way to tell whether it behaves differently when called from different contexts. Second, this class of profilers don't collect information that enables one to determine whether the time attributed to a function results from a few costly invocations or many cheaper invocations.
The Gprof call-graph profiler partially addresses these concerns; however, (1) it requires compile and/or link-time instrumentation, (2) it approximates costs to attribute to a routine's calling contexts based on call frequency, (3) it overstates the cost of small procedures because of instrumentation invoked at every procedure call, and (4) it does not support monitoring of multi-threaded programs. We are developing a call-graph profiler that (1) can be applied at launch time to unmodified, optimized application binaries, (2) uses a stack sampling technique to precisely attribute costs to calling contexts, and (3) collects call-graph edge counts with an overhead proportional to the sampling frequency rather than the program's call frequency. Our prototype is designed for monitoring programs on Alpha processors. It copes with the complexity of the Alpha/Tru64 calling conventions for optimized programs, including call chains containing multiple register-frame procedures. Ongoing work is focused on completing run-time support for multi-threaded programs to enable accurate monitoring of MPI programs, which are naturally multi-threaded.
Characterizing and modeling the performance of applications in an automatic way has been a long-standing goal of computer science research. Understanding how an application's performance will scale given different problem sizes and predicting how it will perform on proposed future architectures are two important problems.
Building accurate performance models for sequential or parallel applications is difficult. Simple metrics such as the number of floating-point operations that a scientific application executes provide little indication of its performance. Scientific applications rarely approach peak performance on microprocessor-based systems; memory hierarchy bandwidth and latency are significant limiting factors. Also, an application's instruction mix can dramatically affect performance. Today's superscalar processors can execute multiple instructions in parallel if they are provided with the right mix of instructions. For parallel programs, communication frequency, communication bandwidth and serialization complicate the situation further. While architecture simulators can provide detailed information about a program's behavior on a particular input, to understand how a program's behavior will scale as data and system vary, scientists typically manually construct analytical models of an application's performance. Although this approach can produce highly accurate models, constructing such models is enormously labor intensive and requires a thorough understanding of the algorithms used, as well as their implementation. Our research aims to simplify performnce modeling by characterizing the performance of sequential and parallel applications in a semi-automatic way with a reasonable accuracy.
To date, we have been working to model the characteristics of sequential applications and predict their execution behavior on different architectures. Building parameterized performance models of an application is difficult because of the large number of variables that affect performance, including architecture-dependent factors, algorithm/application choices, and input data parameters. Moreover, these factors interact in complex ways, yielding a performance function that is non-convex and non-smooth in this multivariate parameter space.
We have developed a strategy for (1) separating the contribution of application-specific factors from the contribution of architectural characteristics to the overall performance of an application and (2) constructing models of an application's characteristics parameterized by problem size or other input parameter. The benefits of this approach are twofold. First, by modeling application-specific factors in isolation, we build architecture-neutral models that are portable across different platforms. Second, models that describe the algorithmic and application choices are typically monotonic polynomial functions that are easier to synthesize.
We synthesize models to predict the behavioral characteristics and
execution time of applications
by using a combination of static and dynamic analysis of application binaries.
The figure shows the organization of our performance prediction
toolkit.
By operating on application binaries instead of program source code,
we are able to build language-independent tools that can naturally analyze
applications with modules written in different languages or linked
with third party libraries. To accurately predict performance, it is easier
to analyze a mix of machine instructions with a predictable
latency than to estimate the execution cost for high-level language constructs.
Our tools can be useful to both
application writers and compiler developers as they enable users to evaluate
performance and scalability of algorithms as well as to verify the
effectiveness of program optimizations.
To provide predictions of how executions will scale as problem size is changed, we use quadratic programming to fit detailed polynomial models (using a suitable set of basis functions) to measurements of basic block execution frequency and to measurements of memory hierarchy reuse distance for each load/store instruction.
For more information, see my papers in this area.
The software gap - the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it - is a serious problem for scientific software. Although users appreciate the convenience (and thus improved productivity) of using relatively high-level scripting languages, the slow execution speeds of these languages remain a problem. Lower-level languages, such as C and Fortran, provide better performance for production applications, but at the cost of tedious programming and optimization by experts. If applications written in scripting languages could be routinely compiled into highly-optimized machine code, a huge productivity gain would be realized. Our aim is to conduct research on technologies that can help achieve this goal.
Many previous research projects have reported excellent results in compiling and optimizing particular scripting languages such as Matlab. However, in practice, scientists typically extend these languages with their own domain-centric components, such as the Matlab signal processing toolbox. Doing so effectively defines a new domain-specific language. If we are to address efficiency problems for such extended languages, we must develop a framework for automatically generating optimizing compilers for them.
To accomplish this goal we are working on an innovative strategy we call telescoping languages that uses a library pre-processing phase to extensively analyze and optimize collections of libraries that define an extended language. The adjacent figure shows how a domain language toolbox and a domain language's runtime support are preprocessed to create an optimized extended library and a library aware compiler.
Results of this analysis are collected into annotated libraries and used to generate a library-aware optimizer. Since this pre-processing phase need be done only at infrequent 'language-generation' times, its cost can be amortized over many compilations of individual scripts that use the library. The generated library-aware optimizer, which will be run much more frequently to translate individual scripts, uses the knowledge gathered during pre-processing to carry out fast and effective optimization of high-level scripts. The figure below shows how a library aware optimizer will be used to optimize a domain language script in conjunction with a vendor compiler.
This enables scripts to benefit from the intense analysis performed during pre-processing without repaying its price. We call the strategy 'telescoping languages' because it merges knowledge of a hierarchy of extended languages into a single library-aware optimizer.
We have begun designing and building a prototype telescoping language generation framework. We will use this framework to construct prototype compilation systems for three domains: (1) statistical calculations from medical research written in R (a public domain enhanced version of the award-winning S language for statistical computing), (2) image processing applications in Matlab, and (3) component integration frameworks for scientific software. The project will develop domain-independent compiler technology that it will apply to optimize both R and Matlab themselves (these languages are quite similar) and collections of procedures written in them.
The intellectual merit of our work lies in the development of several important new compiler technologies, including (1) optimizations applied to procedure invocations, specifically analogs of common instruction-level optimizations; (2) mechanisms for annotating programs with specifications of abstract properties and their interprocedural propagation; (3) analysis of library procedures to uncover opportunities for optimization; (4) mechanisms for specifying complex high-level transformations and their integration into a high-level compilation system; (5) strategies for using contextual knowledge to drive library specialization and optimization; and (6) methods for determining how best to apply sequences of context-sensitive transformations in a library-aware optimizer. As a byproduct of this research, we expect to produce and distribute an open-source infrastructure for telescoping languages.
Our first target for this research is to support high performance statistical computation in R by compiling R programs into efficient code. To date, we have implemented a prototype translator from R to C and are currently working to build a framework to analyze and optimize the R runtime system + translated scripts + user annotations into efficient C code.
If the telescoping language research succeeds, it will make it possible to develop high-performance applications in high-level domain-specific programming environments. This will significantly increase productivity for the community of computational scientists and engineers.
Over the past twenty years, the cycle time of microprocessors has shortened about 60% annually while the cycle time of DRAM has reduced only about 7% per year. To bridge the widening gap between processor and memory speeds, various hardware technologies for improving memory bandwidth and tolerating memory latency, including multiple levels of cache and hardware support for prefetching, have been become commonplace. However, exploiting these technologies most effectively remains a formidable challenge, particularly for complex scientific applications. When the characteristics of a scientific application aren't well matched to the characteristics of the system on which it executes, performance can fall far below the architecture's peak. Furthermore, since different computer systems have different characteristics, achieving the best possible performance can require architecture-specific tuning. For data-intensive applications, improving spatial and temporal data reuse is often the most effective way to boost performance.
While strategies for improving spatial and temporal reuse through data and loop transformations have been widely studied by the compiler research community and automatic techniques for such transformations work well on application kernels, their results on real applications often leave much room for improvement.
In our research, we have investigated
- library-based strategies for improving locality in programs with irregular data accesses, including reordering both data and computation using space filling curves,
- compiler-based strategies for improving locality and performance in loop nests in regular programs by applying an integrated set of transformatons including loop fusion, array contraction, iteration space slicing, tiling, time skewing, unroll-and-jam, and scalar replacement among others.
Since automatic techniques for improving spatial and temporal reuse in complex regular scientific applications often fail to achieve the best possible result, we have developed a tool that enables application developers to choreograph the application of loop transformations to scientific programs. This provides a useful alternative to manually transforming programs when automatic techniques are ineffective.
With both regular and irregular applications, we have found that improving memory hierarchy utilization by reorganizing data and computation can often double application performance.
For more information, see my papers in this area.
The Open64/sl project at Rice University is a project to adapt the Open64 compiler infrastructure, released as open source by SGI, into an infrastructure to support source-to-source transformation of production programs. The Open64 infrastructure includes a near commercial-quality front end for Fortran 90 from Cray and gcc-based front ends for C and C++. Our Open64/sl infrastructure is currently separate from the Open64 project as a whole because we have modified the Cray Fortran 90 front end to generate a slightly higher-level intermediate form so it can be regenerated as correct Fortran 90. It will remain separate until a compiler pass is added to lower this representation to Open64's mid-level Whirl representation.
The project has collaborators from the European Center for Parallelism of Barcelona at the Technical University of Catalonia, the Institute for Software Science at the University of Vienna, and the Adjoint Compiler Technology and Standards project -- a multi-institutional project with collaborators from Argonne National Laboratory, MIT, and Rice. At Rice, this infrastructure currently supports the Rice Co-array Fortran compiler and research on compiler technology for library-based programming models, known as the Telescoping Languages research project.
Source code for Open64/sl is available from the project WWW site www.hipersoft.rice.edu/open64.
The OpenAnalysis open-source project grew out of an effort to build some simple compiler-infrastructure independent analyses that were suitable for for use with both abstract syntax tree level intermediate representations as well as machine-code level representations. At present, OpenAnalysis code base includes intermediate form independent code for building control flow graphs, call graphs, and performing interval analysis on control flow graphs. Ongoing work is developing support for value numbering and alias analysis. OpenAnalysis components are currently being used to analyze high-level intermediate form programs in the Open64/sl compiler infrastructure and application binaries in tools that are part of the HPCToolkit performance analysis tools.
This project includes collaborators at Argonne National Laboratory and Lawrence Livermore National Laboratory.
Source code for OpenAnalysis is available at http://www.hipersoft.rice.edu/openanalysis.
I have been a minor participant in the Grid Application Development Software Project. My work related to this project has been developing a strategy for modeling the performance of programs to understand how to map them to grid systems and when to remap a program because of changes in resource availability. Initial work on this problem grew into the black-box performance modeling work described earlier on this page.
For more information see the GrADS WWW site
On shared-memory multiprocessors, efficient synchronization is important for parallel applications to be efficient and scalable.
With Michael Scott (University of Rochester), I have developed a number of efficient software synchronization algorithms that are widely considered the software synchronization algorithms of choice for commercially-available, large-scale, shared-memory multiprocessors. These algorithms have been widely used in practice, studied as a principal point of comparison for proposed hardware synchronization primitives and have influenced the design of general-purpose hardware support in shared-memory architectures.
We showed that by exploiting locality in the memory hierarchy of shared-memory multiprocessors, efficient software synchronization algorithms could be constructed for a broad spectrum of architectures ranging from cache-coherent parallel architectures (which themselves range from shared-bus architectures that use a snoopy cache coherence protocol, to modern distributed shared memory multiprocessors that use directory-based caching) to non-cache coherent distributed-shared-memory machines such as the Cray T3D. We developed efficient algorithms for queue-based mutual exclusion locks and synchronous barriers reader-writer locks that support various priority schemes and fuzzy barriers that minimize processor waiting time by (a) splitting barrier arrival and wakeup into separate phases and (b) using asynchronous, adaptive restructuring of barrier data structures to minimize the latency from the last arrival to wakeup. In each of these synchronization algorithms, a processor only busy-waits on memory accessible locally and uses only a small, bounded number of remote memory references. This property enables these algorithms to avoid causing network congestion which is the key to their efficiency and scalability.
For more information, see my papers in this area.
Debugging parallel programs is notoriously difficult because any two executions of the same program can have different outcomes depending on the precise interleaving of operations executed by each of the processors. The principal strategy used for debugging sequential programs consists of executing a program until errors are observed, and then gathering more detailed information about the observed errors in a series of subsequent identical re-executions until the errors are isolated. I showed that this debugging methodology could be extended effectively to parallel programs by collecting compact synchronization traces during an execution and using them to control subsequent executions to ensure that they exhibit equivalent behaviors. These synchronization traces are also useful for constructing a rich set of visualizations to aid in debugging and performance analysis. I also developed a technique for a 'software instruction counter' that can uniquely identify points in the execution of a process. Among other things, this technique serves as the basis for supporting execution replay for programs that handle asynchronous events, such as parallel programs using an `active messages' protocol for communication.
A difficult type of error to pinpoint in shared-memory parallel programs is a data race, a conflicting, unsynchronized access to a shared variable. I developed a practical and asymptotically efficient run-time technique to detect and pinpoint data races in programs with nested fork-join parallelism. Using this technique in practice requires inserting instrumentation in programs to monitor accesses to shared data. I developed Eraser, a compile-time instrumentation system for parallel Fortran programs that exploits a combination of data dependence analysis and interprocedural analysis to identify which accesses would never cause a data race; other accesses are instrumented for run-time monitoring. This compile-time optimization of access instrumentation dramatically reduced the monitoring overhead for detecting data races, making it suitable for use in a program testing phase.
For more information, see my papers in this area.
For more information, see my papers in this area and the (somewhat stale) D System project WWW site.
For more information, see my paper in this area.
For more information, see my paper in this area.
For more information, see my paper in this area.
For more information, see my papers in this area.