FLASH: Integration of Distributed Shared Memory and Message Passing

Stanford University
Computer Systems Laboratory

Semiannual Technical Report
Advanced Research Projects Agency

For the period October 1994 - March 1995
Contract Number: DABT63-94-C-0054
AO #B837

Principal Investigator
John L. Hennessy
(jlh@vsop.stanford.edu)
415-725-3712 / 415-725-6949 (fax)

Co-principal Investigator
Mark A. Horowitz
(horowitz@chroma.stanford.edu)
415-725-3707 / 415-725-6949 (fax)


Table of Contents


Executive Summary

FLASH Design: Over the last few months, the hardware design team has completed the coding of the MAGIC RTL. We have also partitioned the RTL into control and datapath logic so that the RTL is amenable to logic synthesis and datapath layout. The hardware efforts over the next few months will be concerned mostly with synthesizing the RTL model and addressing any resulting timing or layout difficulties. Of course, we also will be focusing heavily on design validation, but that topic is detailed in a separate section of this report.

DSM Architecture Studies: We have developed a model for DSM multiprocessors that extends the logP ideas to shared memory architectures. Using the model, we show the importance of occupancy, the DSM counterpart to the message passing overhead. The results show that several proposed design points have poor performance.

SPLASH-2 Benchmarks: We released the SPLASH-2 benchmarks, which includes both new benchmarks and improvements over SPEC-1. Extensive documentation, including scaling instructions, is included. The benchmark documentation, and an extensive set of benchmarks are available by the Internet.

Architecture Simulation Tools: We have expanded the functionality of FlashLite, our system-level simulator, and have continued to develop multiple protocols to run on FLASH. We have used FlashLite both as a tool for performance studies [10] and as a hardware verification tool. Our FlashLite/Verilog environment has enabled us to test basic protocol operations within the MAGIC chip and has been instrumental in finding most of the early bugs in the design. We are currently working on a lower-level interface that will allow us to write more controlled directed-diagnostics to further verify the chip. We are also in the process of converting the front-end reference generator for FlashLite to a processor emulator so that we can more thoroughly model data transfer throughout the processor-memory system hierarchy. This accuracy in data handling will further increase the test coverage of our protocols as well as the test coverage for the MAGIC chip itself.

Parallel Applications: We have continued our efforts in studying the characteristics of parallel applications on shared-address-space multiprocessors. As part of this effort we have released the SPLASH-2 suite of parallel applications to facilitate the study of centralized and distributed shared-address-space multiprocessors. The SPLASH-2 suite is the successor to the SPLASH suite, and includes several new codes in application domains not previously explored in SPLASH

Shared Cache Architecture Studies: We have studied the impact of clustering at the second level cache on the performance of small scale bus-based multiprocessors. We show that in an eight-processor system, the use of a large shared secondary cache that might be implemented with MCM technology can significantly reduce bus contention and overall execution time.

FLASH Operating System: A first prototype of Hive, the operating system for the FLASH machine, has been booted and runs on a machine simulator. This version contains most of the mechanisms for the internal fault containment system. We have been able to do some performance studies using complex workloads as well as some fault injection studies to evaluate its effectiveness. We also have an initial implementation of the automatic page migration and replication system that is in the process of being debugged and tuned.

Simulation Environment: We continue to enhance the SimOS simulation environment to make it a faster, more accurate model of the FLASH machine. We have added to SimOS an accurate CPU model of a superscalar dynamically-scheduled processor (like the MIPS R10000), which the FLASH machine will use. We have also added more accurate disk models to the simulator as well as improved its statistics gathering capabilities.

Efficient Logic Simulation: We have developed a new approach to event-driven simulation that does not use a centralized run-time event queue, yet is capable of handling arbitrary models, including those with unclocked feedback and nonunit delay. The elimination of the event queue significantly reduces run-time overhead, resulting in faster simulation. We have implemented our algorithm in a prototype Verilog simulator called VeriSUIF. Using this simulator we demonstrate improved performance vs. a commercial simulator on a small set of programs.

Compiler Tools: Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance.

Parallel VLSI Simulation: Development of a heterogeneous multi-level mixed-mode simulation environment implemented on multiprocessors. The project goal is to speedup the simulation performance for large system design tasks through parallel computing. The research work includes distributed scheduling, parallel execution of feedback networks and integration of various simulation paradigms.

Formal Verification: We have been developing formal verification methods for use in FLASH. We have successfully used a new verification method to formally verify the complete pipeline for the protocol processor (PP) in FLASH. We have a prototype of an improved verifier for synchronous systems, where the input language and implementation are based on C++.

State Enumeration: We continued work on the automatic generation of corner-case tests using state-enumeration, and succeeded in finding several non-trivial bugs in the Protocol Processor design [9]. This work was extended to other parts of the MAGIC chip, starting with the Data Buffer Allocator and moving to the Inbox. A new state-enumeration tool, called MPP, has been developed to replace Synchronous Murphi. MPP is functionally equivalent to Synchronous Murphi but does not suffer from an exponential increase in the size of the code it generates with increased non-determinism in the model. This will allow us to model larger parts of the design for automatic corner-case test generation.

Technical Summary

Parallel Hardware

Magic Hardware Status

The hardware design has progressed significantly over the last few months. The focus of the design team has been in two areas: (1) completion of the RTL description of the MAGIC chip; and (2) partitioning the RTL in preparation for logic synthesis and datapath layout.

Task (1) involved completing the RTL code such that all functional features of the chip are implemented and, at least at a rudimentary level, tested. Though the actual amount of code that had to be written to complete this task was not large, the goal of having an RTL description that is functionally complete required us to resolve a number of outstanding design alternatives, and to coordinate with the operating system team to ensure that the final MAGIC feature set would provide efficient support for the operating system. Because the MAGIC chip contains an embedded general-purpose processor, a number of low-level operating system functions are actually being implemented on MAGIC rather than on the compute processor. Though the flexibility of the MAGIC chip allows the exact implementation of these functions to be modified even after the design is frozen, efficient low-level communication between the MAGIC chip and the OS running on the compute processor requires hardware support.

Along with implementing all the functionality in the RTL, the design team also partitioned the RTL so that it will be amenable to logic synthesis. As with almost all complex chips, MAGIC will make heavy use of logic synthesis to translate the RTL description into actual gates. Logic synthesis will be used mostly for the control logic; to achieve higher performance and density, layout of the datapath logic will make use of specialized datapath layout and compilation tools instead of a traditional logic synthesis program. Basically, then, the goal of the RTL partitioning effort was to separate the RTL into control logic (meant to be synthesized) and datapath logic (meant for manual layout or processing by datapath tools).

In tandem with the partitioning effort, we also began the actual synthesis process, taking a number of the larger control logic sections of the chip and running them through the Synopsys logic synthesis program. The results of these synthesis runs allow us to gauge how efficiently the synthesis program is able to implement the control logic, both in terms of delay and area. Because the target clock rate for the MAGIC chip is 100 MHz, effective use of the synthesis tools will be necessary if the target clock rate is to be met. In many cases, RTL modifications will be required to improve the output of the synthesis tools.

Since the RTL is now complete and fully partitioned, the focus of the hardware design team over the next few months will shift from implementation of the RTL to synthesis, timing, and other issues related to the translation of the RTL model into a gate-level description suitable for fabrication. As noted, much of this effort will be driven by the results of the synthesis process -- as critical logic paths and other timing or layout issues are identified, the RTL will need to be altered to alleviate the problem. Though we do not expect significant difficulties in this area, the process is nevertheless time-consuming.

Architecture Simulation Tools

We have increased the quality of many of our simulation tools. Our main simulation tool is FlashLite, which is the system-level simulator written in C and C++ that models each component on the FLASH node as well as the internals of the MAGIC chip itself. Included in the simulation is the Protocol Processor portion of the MAGIC chip which runs the actual protocol code we will run on the FLASH prototype. The base cache-coherence protocol code has been expanded to include support for cache-coherent I/O, code to boot the Protocol Processor, code for error-handling and software-partitioning of the machine into "cells". FlashLite has also been partitioned to better facilitate the simultaneous development of multiple protocols. The development of an alternate cache-coherence COMA protocol is finished, and is being performance-tuned. In addition, message passing protocols are expanding from the original RDMA implementation. A simple memory copy protocol is now operational and will be used in our operating system development.

Besides the ability to run multiprocessor applications through FlashLite to test the protocols and analyze machine performance, we also have the ability to run an N processor simulation with N-1 FlashLite nodes and one Verilog node. That is, on one of the nodes we replace the FlashLite model of the MAGIC chip with the actual Verilog model of the MAGIC chip. We call this simulation regime FlashLite/Verilog. This proves to be a convenient and useful way to generate test vectors for the part and to ensure that basic protocol operations are handled properly. With the addition of I/O modeling in FlashLite, we can now test the I/O subsystem via FlashLite/Verilog as well. Currently, we are developing a verification environment that gives a diagnostic writer lower-level access to the pins of MAGIC to facilitate directed tests. It is these diagnostics which will form the suite of regression tests that we run through the part.

As an execution-driven memory system simulator, FlashLite accepts a reference stream from some reference generator. Our original reference generator, Tango Lite, annotated an executable's load and store instructions with calls into FlashLite which then modelled those loads and stores appropriately. This form of reference generation has the limitation that it is extremely difficult to simulate data handling since library calls and system calls are not annotated. For this reason, we have developed another reference generator, Mipsy, which emulates each instruction of the main processor, including all library calls. With Mipsy, we can simulate proper data handling throughout the memory system. This is critical for verifying both the cache-coherence protocol, and when using FlashLite/Verilog, verifying the MAGIC chip design itself. We are adding uncached operations, and prefetching support to Mipsy as well as detailed statistics similar to the ones available under Tango Lite.

Applications and Architectural Studies

Parallel Application Effort

Recently, we have continued our efforts in studying the characteristics of parallel applications on shared-address-space multiprocessors. As part of this effort (see [19] for details), we have released the SPLASH-2 suite of parallel applications to facilitate the study of centralized and distributed shared-address-space multiprocessors. The SPLASH-2 suite is the successor to the SPLASH suite, and includes several new codes in application domains not previously explored in SPLASH. In addition, several of the SPLASH codes have been improved to use better algorithms and data structures, and to incorporate an understanding of the underlying architecture to enhance performance. We have quantitatively characterized the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we have studied include the computational load balance, communication-to-computation ratio, sizes and scaling of the important working sets, sharing behavior, and spatial locality.

We have also addressed methodological considerations for pruning the domain of application and machine parameters for architectural studies. Architectural evaluations often have a huge space of application and machine parameters to consider, and many of these parameters can have substantial impact on the results of a given study. Since most studies that evaluate architectural ideas use software simulation (which is typically very slow), it is important to identify and avoid unrealistic combinations of problem and machine parameters, and to identify the realistic ones so that the rest of the space can be pruned. We find that an understanding of the working set characteristics of an application, coupled with an understanding of the growth rates of key characteristics such as the communication-to- computation ratio is crucial to selecting important and meaningful points in the experiment space.

Latency, Occupancy, and Bandwidth Issues in Distributed Shared Memory MPs

Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsystem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise the efficiency of the communication architecture, the impact on the end performance of applications is unclear. We studied this (see [10] for details) performance impact through detailed simulation and analytical modeling, using a range of important applications and computational kernels.

To try and make some general claims about DSMs, we characterized the communication architectures of DSM machines by four parameters, similar to those in the logP model, rather than studying one specific machine in detail. The four parameters are latency, occupancy (of the communication controller in this model), gap (node-to-network bandwidth), and finally the number of processors. Conventional wisdom is that latency is the dominant performance bottleneck in DSM machines for scientific applications. We showed that for 64 processors, controller occupancy also has a substantial impact on application performance, even in the range of realistic occupancies that are being proposed for cache-coherent DSM machines. Of the two contributions of occupancy to performance degradation--the latency it adds and the contention it induces--it is the contention component that dominates the performance regardless of network latency. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. What is surprising, however, is that the performance impact of occupancy is substantial even for highly tuned applications and even in the absence of latency hiding techniques. We also showed that the applications we studied are not limited by the bandwidth in recent and upcoming machines.

Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. We showed that in many structured computations occupancy-induced contention is not alleviated by increasing problem size, and that there are important classes of applications for which the performance lost by using higher latency networks or higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

COMA Protocol on FLASH

A key attribute of FLASH is that it will allow experimentation with multiple protocols. Over the last six months, we have continued development of the COMA (Cache-Only Memory Architecture) protocol for the FLASH multiprocessor. The first phase of the work was focused toward understanding the requirements a COMA protocol places on the FLASH design, specifically, on the custom node controller MAGIC. The next phase of the work was implementation without respect to physical constraints such as input queue sizes and numbers of available buffers (both of which were assumed to be infinite). Completion of these phases took roughly six months.

After this ideal protocol was debugged, implementing the remaining functionality with respect to finite queue depths and buffer management began. This phase took roughly six months, and resulted in a functional, albeit unoptimized protocol. In fleshing out this portion of the design, we were able to evaluate the MAGIC chip and provide feedback to the hardware team concerning performance and correctness problems. One example concerns resource allocation. Before our work, certain buffering elements were allocated in such a way that one of the possible request queues could be repeatedly denied a buffer. The result is starvation of that unit. Because the COMA protocol is more buffer-intensive than the base protocol, it exposed this problem. The problem has now been solved using a simple partitioning of the buffer elements.

A baseline performance comparison of the COMA protocol versus the base protocol was then undertaken. Preliminary results indicate that COMA can provide performance gains of up to 50% if the application reads large data sets that do not fit in the caches, but can also degrade performance by up to 30% for applications that involve large amounts of writing. However, there is much room for low-level optimizations in the COMA protocol. We are currently in the process of optimizing the COMA protocol so that we can perform a more meaningful comparison to the base protocol.

In addition, another goal of this work is to run COMA on workloads previously not studied, for example, operating systems. To that end, we are also modifying the COMA protocol so that it can interface with a detailed operating system simulator, SimOS, developed at Stanford. We expect these modifications and optimizations to be done over the next few months, and then will undertake a more thorough evaluation of COMA versus the base for both scientific workloads and operating system workloads. After that, we will evaluate whether or not further architectural changes to MAGIC can improve performance.

Software Based Clustered Distributed Virtual Shared Memory

We are collaborating with Silicon Graphics to evaluate software-based distributed virtual shared memory over fast networks. The topic of virtual shared memory is not new, however research to date has focused more on networks of workstations. By using n-way multiprocessors instead of workstations as the underlying nodes we can take advantage of the improved communication to computation ratio between the nodes, and amortize cost of the fast network link. Existing bus based shared memory multiprocessors scale to tens of processors and next generation HW technologies, such as FLASH, will scale to 100s of processors. Clustered DVM introduces a new scaling principle which has the potential to increase the number of processors in each generation of machines using SW while leveraging the state of the art in HW. The resulting machine uses both HW and SW coherency and maintains the shared memory programming paradigm of the underlying multiprocessors. HW is used within the node and SW between them. The TLB is used to detect coherency violations for SW coherency.

We are prototyping a Distributed Virtual Shared Memory machine using SGI Challenge machines and a high speed switch based HIPPI network. The prototype is leveraging the work of the Stanford FLASH HW development effort by using the FLASH protocol handlers (state machines). Because FLASH executes SW handlers on a relatively general purpose processor (MAGIC), the protocols are well suited for software implementation. The prototype code is being implemented both at the user level and at the kernel level (SGI IRIX based). User and kernel implementations will be able to exist on the same network, facilitating debugging. A kernel level implementation is necessary to provide fast interrupts and user level delivery of data pages, which are the underlying coherency objects, directly to user space.

The current status of the project is as follows: We have implemented the FLASH DELAYED mode handlers and are starting to implement the EAGER mode handlers. The code has been tested over slower networks (ethernet, FDDI) and will soon be tested on HIPPI. Experiments will then begin to characterize the application space and understand the performance. Eventually more novel protocol optimizations, such as lazy release and possibly, entry consistency, will be tested.

Silicon Graphics is also supporting this work by providing computer time for development, a testbed for experiments, and the efforts of an IRIX kernel SW engineer.

Shared Cache Architecture Studies

Small-scale, shared-memory multiprocessors represent an important class of multiprocessor design due to their increasing popularity in both the workstation and server markets. Typically, these multiprocessors are comprised of two to eight high-performance processing nodes connected together, and to main memory, using a shared global bus. Parallel applications can place heavy demands on the bus system due to capacity and communication misses in the L2 caches, yet, bus performance has failed to scale at the same rate as processor performance due to the inherent limitations of the bus topology.

To address the bus performance bottleneck, we have investigated the performance improvements that are possible using the plentiful low-latency interconnections available using multi-chip module (MCM) technology in a shared secondary cache architecture.

Our results show that clustering at the secondary cache provides much of the same sorts of benefits that result from sharing at the primary cache without the degradation in primary cache access time and bandwidth. This degradation becomes more of an issue in a multiprogramming workload and applications without much sharing, since there are no benefits from clustering to counteract the degradation.

We also find that for parallel scientific applications with high levels of communication, clustering can dramatically reduce contention for the shared global bus while not increasing contention at the shared secondary caches appreciably. This has the effect of more evenly distributing resource contention throughout the system, and can result in large performance gains.

Operating System and Compilers

FLASH OS - Hive

The work on the operating system for the FLASH machine has advanced significantly over the last six months. By the end of the reporting period the operating system, Hive, has booted and runs complex workloads within our simulation environment. Using the SimOS simulation environment, we have been measuring and evaluating the performance and correctness of Hive's failure containment mechanisms. The description of this mechanism and the results of our experience to date can be found in the paper "Hive: Fault Containment For Shared-Memory Multiprocessors" [14] which has been submitted to SOSP.

Development work on Hive during the reporting period included: extending the virtual memory system to support sharing of memory among kernels, extending the file system to support locating and accessing file data cached in a remote kernel, extending the process fork mechanism to fork a child process on a remote kernel, extending the process signal and process group mechanisms to manage distributed sets of processes, and implementing a failure recovery mechanism that reestablishes a consistent system state when one or more of the kernels fails. These changes represent much new code and heavy modifications of the existing code. Much of our time has been spent and will continue to be spent on the debugging, performance tuning, and functional enhancement of this code.

A second focus of the operating system work on Hive has been on mechanisms and policies for reducing memory latency of a CC-NUMA machine like FLASH. For the last six months, we have focused on using automatic page replication and migration to move memory pages closer to the CPUs that are accessing them. Using SimOS, we studied the benefits of automatic page replication and migration on CC-NUMA and CC-NOW compute servers. We studied several realistic compute server workloads including a program development environment, engineering simulations, a commercial database Sybase, and a multiprogrammed parallel graphics and scientific workload.

Results of this study are the focus of the paper, "OS Support for Improving Data Locality on CC-NUMA Compute Servers," [16] which has been submitted to SOSP. We found that automatic page migration and replication can improve performance by 15% to 35% on CC-NUMA machines and by 15% to 50% on CC-NOWs for many workloads. We studied various metrics for migrating and replicating pages and found that sampling cache misses does very well in approximating cache misses. However TLB misses do not consistently approximate cache misses. We also found the cost of migrating and replicating pages is very sensitive to the OS implementation, and efficient data structures and algorithms need to be studied further.

As part of page migration/replication study, we have developed an initial implementation of the Hive virtual memory system that supports page migration and replication. This work required redoing much of the locking structure of the existing virtual memory to support this new functionality.

Finally, the SimOS simulation environment was advanced in two areas. First, it was heavily used by the above studies which resulted in many of the remaining bugs being flushed out and some new functionally to enhance its usability. Much of new functionally involves better ways to keep and generate the statistics collected from the simulation. In particular, a system of annotations that allowed the user of the simulation system to specify what to study was built. Using annotations, we have been able to study the behavior of the operating system and application programs in detail.

The second push in the simulation environment has been to increase its accuracy. We have included detailed simulation models of advanced microprocessors. These CPU models include features such as multiple instruction issue, out of order issue, and non-blocking caches. Many of these features are needed to accurately model the MIPS R10000 being used in FLASH. We also added more detailed models of the memory systems as well as some of the I/O devices.

Efficient Logic Simulation

Modern digital system design relies heavily on simulation to reduce the number of design errors and to improve system efficiency. In large system designs, so much time is spent in simulation that it has become a design bottleneck. Event-driven simulation and levelized compiled simulation are two well-known simulation techniques that are currently used in digital system design. Levelized compiled code logic simulators have the potential to provide much higher simulation performance than event-driven simulators because they eliminate much of the run-time overhead associated with ordering and propagating events. The main disadvantage of levelized compiled simulation techniques is that they are not general.

We have devised a general method compiling event-driven models called static simulation that combines the generality of event-driven simulations and the efficiency of the levelized simulation approach. Like event-driven simulation, our technique applies to all general models, including both synchronous and asynchronous designs. The only restriction is that any specified delays in the simulation must be known constants at compile time.

In our method, we represent the event-driven behavior with an event graph, whose vertices represent events in the simulation and whose edges represent the causal relationships between the events. We then use the general technique of partial evaluation to schedule the events as well as possible using statically available information. Specifically, the compiler tries to approximate the dynamic simulation process by keeping track of all the available static information that affects the contents of the run-time event queue in a dynamic simulation.

To test our algorithm, we have implemented a prototype simulator, called VeriSUIF, using the SUIF (Stanford University Intermediate Format) compiler system. Our prototype implementation of the simulator achieves an average speedup of about two when compared to VCS 2.3 on six benchmarks. More importantly, our average scheduling overhead amounts to only 4% of that found in the VCS code. We are improving our implementation with the goal of simulating a chip as complex as MAGIC five times faster than it is possible today.

Data and Computation Transformations for Multiprocessors

Even though shared address space machines have hardware support for coherence, getting good performance on these machines requires programmers to pay special attention to the memory hierarchy. Today, expert users restructure their codes and change their data structures manually to improve a program's locality of reference.

We have developed a fully automatic compiler that translates sequential code to efficient parallel code on shared address space machines. Multiprocessor memory hierarchy performance can be affected by the decisions made in a number of phases of the compilation process: parallelization, assignment of computation to processors and data layout designs. We show that the problem of optimizing parallelism in conjunction with memory system performance can be reduced to two simpler problems. The first step chooses the parallelization and computation assignment such that synchronization and true sharing are minimized, without regard to the data layout. The algorithm for this step is useful also for distributed address space machines. The second step then makes the data accessed by each processor contiguous in memory. We introduce two primitive data transforms, strip-mining and permutation, and show that they can be used to derive all the standard data distributions. These primitives can also be used for other purposes, such as data layout optimizations on uniprocessor code.

Since our data transformation algorithm uses data decompositions as inputs, it is also immediately applicable to HPF programs. HPF programs have traditionally been targeted for distributed address space machines. Our algorithm uses this same information to optimize for shared address space machines, while taking full advantage of the underlying hardware. Locality of reference is achieved simply by making the data accessed by each processor contiguous and relying on the cache hardware to provide memory management and coherence functions.

We have implemented our techniques in the SUIF compiler system. We ran our compiler over a set of sequential FORTRAN programs and measured the performance of our parallelized code on the DASH multiprocessor. Our experimental results show that our algorithm can dramatically improve the parallel performance of shared address space machines.

Verification

We have been developing formal verification methods for use in FLASH. There have been two initiatives: state enumeration methods for finite-state machines, and a new method for formally verifying processor pipelines.

State Enumeration

Our technique of using state-enumeration to automatically generate corner-case tests for validation produced some promising results with the Protocol Processor design.[9] Some bugs were discovered that required a complex combination of unusual events to occur simultaneously. These cases may not have been found using our hand-written test vectors or with randomly-generated vectors. The use of state-enumeration gives us an automatic way to create lots of corner-case situations directly from the RTL Verilog, hopefully covering situations that the designers overlooked in their testing.

We continued our work by applying the same technique to the Data Buffer Allocator (DBA) unit of MAGIC. The state-enumeration required some down-scaling of the size of counters and the number of input ports modelled. However, it was functionally complete in other respects. Running the generated test vectors showed that the RTL Verilog conformed to the abstract model that we had created from the written specification of the DBA for the corner-cases found in the state-enumeration. The next unit targeted was the Inbox. Work on this has progressed to the point where test vectors are about to be generated and run.

We are also beginning to work on techniques to extend this unit-level testing method so that corner-case tests can be generated spanning multiple units of the chip and run on a full-chip simulation model. This extension is required since a unit may pass a test when run in isolation against a specification model, and yet contain a bug because of conflicting assumptions with other units.

Since October, we have transitioned from Synchronous Murphi to MPP as our state-enumeration tool. While working on the Inbox, we discovered that Synchronous Murphi suffered from an exponential growth in the size of the code it produced. The Synchronous Murphi compiler takes a description of the model in it's own specialized language and creates a C++ program that will perform the state-enumeration for that model. Unconstrained inputs to the model are represented in Synchronous Murphi as non-deterministic variables. The verifier tries all possible combinations of these non-deterministic variables to find all the reachable states. The problem arises because the Synchronous Murphi compiler statically expands all possible combinations of the non-deterministic variables in the generated C++ code. The resulting exponential growth in code size quickly overwhelms the C++ compiler. For many of our units, this prevents us from modelling all the inputs together, forcing us to chose subsets for each run and tying the remaining inputs to known values. Our new state-enumeration tool, MPP, was developed to overcome this problem. Instead of statically expanding the combinations of non-determinism, it does this dynamically at run-time. This is a trade-off between the amount of non-determinism that can be modelled and its running-time. We have found that we can drastically increase the amount of non-determinism modelled, giving us a better chance of reaching more corner-cases in each unit for test generation.

New Methods in Processor Verification

Last year under other funding, we developed a new method for formally verifying processor pipeline control. The method compares two high-level descriptions, reporting inconsistencies if it detects them. One description is of the microprocessor implementation, called the MicroArchitecture (MA). The microarchitecture description must capture all of the pipelining and cycle-level timing of the implementation. The second description, which specifies the correctness requirements of the process, is a programmer-level view of the microprocessor, which we call the Instruction-Set Architecture (ISA). Both descriptions must have functionally equivalent data-path elements; for example, the processor ALU must be the same in both descriptions.

The new method is similar to traditional methods, but requires orders of magnitude less labor because it automates many of the difficult parts of the verification process.

The ultimate theorem to be proved is that programs execute equivalently on the MA and ISA (in other words, the MA is a faithful implementation of the ISA). More precisely, we aim to show that, for every step of every program, the MA state can be mapped to an ISA state. This is done by creating an abstraction function to do the mapping, and proving that it holds for every step of a program.

Our method and traditional methods would approach this as an induction proof. The most difficult and time-consuming parts of this process would be finding the induction hypothesis (the right abstraction function), and proving that the hypothesis was correct. Our method greatly reduces the difficulty of both of these tasks.

We have discovered a processor-specific trick to make it easier to find an appropriate abstraction function. The primary problem is that, at any given time, the MA has a number of "partially executed" instructions in its pipeline. For example, it is difficult to compare the registers and program counters of the MA and ISA because the MA may have a jump instruction sitting somewhere in its pipe which updates the program counter in a different step from when it updates the registers. However, it is relatively simple to define an abstraction function for the special case where there are no partially executed instructions (we call this a "clean state") -- it requires that certain parts of the MA state, such as the registers, memory, and program counter, directly match corresponding state in the ISA.

Once the abstraction function has been specified in a clean state, our method can automatically augment the definition for the unclean state. If the processor starts in an unclean state, it can be forced into a clean state by "stalling" (e.g. simulating a cache miss) for a certain number of cycles, which will finish executing the instructions in the pipeline without starting any new instructions. Using a symbolic simulator, this stalling process can be executed from a symbolic state, yielding a "flushing function" which can be applied to any state. The flushing function can then be composed with the user-supplied abstraction function, to produce an abstraction function that works for any state (by flushing to a clean state, then applying the abstraction function for clean states).

The second difficult task is proving that the abstraction function truly is an abstraction function. Let MAstep(s) be the function that simulates one step of the MA, starting in state s, and returning the resulting state as a value; let ISAstep(s') be the function that simulates a step in the ISA given an ISA state s'; and let ABS(s) be the abstraction function, which maps MA states to ISA states. Then it is necessary to prove the theorem

ABS(MAstep(s)) = ISAstep(ABS(s))

for legal MA state s. In other words, if we start with s and take a step in the MA, the result is the same as abstracting s first, then taking a step in the ISA. This is an induction hypothesis that can be used to prove that programs execute in the same way on the MA and ISA.

We use a particular logic that is well-suited for processor verification problems: the logic of equality over uninterpreted functions. This logic allows expressions like f(a) = g(b) to appear in logical formulas, along with the usual Boolean functions. The uninterpreted function symbols are especially useful for modelling data path computations.

For example, ALU(opcode, src1, src2) might represent the result of the ALU when given opcode, src1, and src2. Note that the expression says nothing about what the ALU actually does, or how many bits of data are in src1 and src2. This information is not necessary, since the MA and ISA have the same ALU and data values.

We have written a program that checks validity ("is it always true?") for formulas in this logic. The problem is NP-complete, so the program is heuristic. However, it has succeeded on some large practical examples. The theorem

ABS(MAstep(s)) = ISAstep(ABS(s))

can be created automatically by generating the abstraction function as described above, and generating the MAstep and ISAstep functions by simulating the MA and ISA for one step starting in a symbolic state.

We have applied this formal verification method to the FLASH protocol processor (PP). Formally verifying, the PP is a challenging problem. It is a dual-pipeline VLIW processor, with complex constraints on the code generated by the compiler.

We hand-translated the Verilog description of FLASH (written by the designers) to the input language for our symbolic simulator. We wrote an ISA for the PP in the same language. We have been able to formally verify the entire pipeline for the PP, which is orders-of-magnitude more difficult than previous examples that have been verified using other methods.

To do this, we had to deal with some difficulties in modelling. First, unlike many processors, the PP does not continue processing the instructions already in the pipeline when it gets a cache miss. So we had to add this capability in order to get the abstraction function. This entails some extra work, but it is not dangerous since the MAstep in the theorem can still be performed using the original, unmodified description -- the modified hardware is only used to obtain an abstraction function. Also, the compiler constraints had to be written in our logic, and used as assumptions in the proof of correctness (since the PP is not guaranteed to work correctly if the constraints are not satisfied).

The PP also has a complex memory system with caches, store buffers, and so on. We modelled the memory system abstractly to verify the pipeline. We are currently working on verifying that the actual memory system implementation is consistent with the abstract memory. When this succeeds, the results of the individual pipeline and memory system verifications can be composed, without doing any additional verification, with full confidence that the results are sound.

The verification of the memory system is posing some special problems because it has embedded state machines, which seem to cause our method some difficulty. Research is currently underway to understand and solve this problem.

Bibliography

1. Amarasinghe, S., Anderson, J., Lam, M.S., and Tseng, C.-W. "An Overview of the SUIF Compiler for Scalable Parallel Machines" in Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing. San Francisco, CA. February, 1995.

2. Anderson, J., Amarasinghe, S., and Lam, M.S. "Data and Computation Transformations for Multiprocessors" in Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming. ACM SIGPLAN. July, 1995. To appear.

3. Chandra, R., Devine, S., Verghese, B., Gupta, A., et al. "Scheduling and Page Migration for Multiprocessor Compute Servers" in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM & IEEE. San Jose, CA. pgs. 12-24. October, 1994.

4. Chapin, J., Herrod, S., Rosenblum, M., and Gupta, A. "Memory System Performance of UNIX on CC-NUMA Multiprocessors" in Joint International Conference on Measurement and Modeling of Computer Systems. ACM Sigmetrics. Ottawa, Ontario, Canada. May, 1995. To appear

5. French, R., Lam, M.S., Levitt, J., and Olukotun, K. "A General Method for Compiling Event-Drive Simulations" in 32nd Design Automation Conference. IEEE/ACM. San Francisco, CA. June, 1995. To appear.

6. Hall, M., Murphy, B., and Amarasinghe, S. "Interprocedural Parallelization Analysis: A Case Study" in Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing. San Francisco, CA. February, 1995.

7. Heinlein, J., Gharachorloo, K., Dresser, S., and Gupta, A. "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor" in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM & IEEE. San Jose, CA. pgs. 38-50. October, 1994.

8. Heinrich, M., Kuskin, J., Ofelt, D., Heinlein, J., et al. "The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor" in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM & IEEE. San Jose, CA. pgs. 274-285. October, 1994.

9. Ho, R., Yang, C.H., Horowitz, M., and Dill, D. "Architecture Validation for Processors" in 22nd International Symposium on Computer Architecture. ACM SIGARCH and IEEE TCCA. Santa Margherita Ligure, Italy. June, 1995. To appear.

10. Holt, C., Heinrich, M., Singh, J.P., Rothberg, E., et al. "The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors". Stanford University-Computer Systems Laboratory. Technical Report, CSL-TR-95-660. January 1995.

11. Nayfeh, B. and Oluktoun, K. "Evaluating the Impact of Clustering for Small-Scale Shared-Memory Multiprocessors" in 2nd International Symposium on High-Performance Computer Architecture. IEEE/ACM. San Jose, CA. February, 1996. Submitted.

12. Olukotun, K., Bergmann, J., and Chang, K.-Y. "Rationale and Design of the Hydra Multiprocessor". Stanford University-Computer Systems Laboratory. Technical Report, CSL-TR-94-645. November 1994.

13. Rosenblum, M., Herrod, S., Witchel, E., and Gupta, A., "Complete Computer System Simulation: The SimOS Approach." IEEE Journal of Parallel and Distributed Technology. November 1995. To appear.

14. Rosenblum, M. "Hive: Fault Containment for Shared-Memory Multiprocessors" in SOSP-95. 1995. Submitted.

15. Sidiropoulos, S. and Horowitz, M. "Current Integrating Receivers for High Speed System Interconnects" in Custom Integrated Circuits Conference. IEEE Electron Devices Society. Santa Clara, CA. May, 1995. To appear.

16. Verghese, B., Devine, S., Rosenblum, M., and Gupta, A. "OS Support for Improving Data Locality on CC-NUMA Compute Servers" in Symposium on Systems Programming. 1996. To appear.

17. Wilson, R., French, R., Wilson, C., Amarasinghe, S., et al., "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers." ACM SIGPLAN Notices. Vol. 29(12): pgs. 31-37. December 1994.

18. Woo, S., Singh, J.P., and Hennessy, J. "The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors" in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM & IEEE. San Jose, CA. pgs. 219-230. October, 1994.

19. Woo, S., Ohara, M., Torrie, E., Singh, J.P., et al. "Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite" in 22nd International Symposium on Computer Architecture. ACM SIGARCH and IEEE TCCA. Santa Margherita Ligure, Italy. June, 1995. To appear.

 

Last modified 5/24/95 by Joel Baxter, webmaster@www-flash.stanford.edu.

HTML 2.0 Checked!