FastSysC : A Fast SystemC Engine

SystemC is rapidly gaining wide acceptance as a simulation framework for SoC and embedded processors. While its main assets are modularity and the very fact it is becoming a de facto standard, the evolution of the SystemC framework (from version 0.9 to version 2.0.1) suggests the environment is particularly geared toward increasing the framework functionalities rather than improving simulation speed. For cycle-level simulation, speed is a critical factor as simulation can be extremely slow, affecting the extent of design space exploration.

Presenting FastSysC

We propose a fast SystemC engine that, in our experience, can speed up simulations by a factor of 1.93 to 3.56 over SystemC 2.0.1. This SystemC engine is designed for cycle-level simulators and for the moment, it only supports the subset of the SystemC syntax (signals, methods) that is most often used for such simulators. We achieved greater speed (1) by completely rewriting the SystemC engine and improving the implementation software engineering, and (2) by proposing a new scheduling technique, intermediate between SystemC dynamic scheduling technique and existing static scheduling schemes. Unlike SystemC dynamic scheduling, our technique removes many if not all useless process wake-ups, while using a simpler scheduling algorithm than in existing static scheduling techniques.

Improving SystemC Scheduling with Acyclic Scheduling

SystemC simulators are typically made of sequential processes which are sensitive to a clock edge, and combinational processes which are sensitive to their input ports. The former processes are typically waken up at the beginning of the simulated clock cycle; the order in which the latter processes are waken up will determine the total number of wake-ups within the simulated clock cycle. This order is called the process schedule.

SystemC proposes a dynamic scheduling mechanism: at each simulated clock cycle, processes which are sensitive to the clock front edge are waken up, and during their execution, they modify outgoing signals, waking up in turn other processes. A simulated clock cycle is over when no more process needs to be waken up, possibly after multiple iterations/delta-cycles. This technique is called dynamic scheduling because the set and order of process to wake-up is not known at the beginning of the simulated clock cycle. While dynamic scheduling is fairly easy to implement, it also suffers from an excessive number of wake-ups.

To better understand how process scheduling can affect the number of wake-ups, consider the example of figure below:

The architecture is composed of 4 modules with one process each.

The dotted arrows in figure above indicate the data flow within each process: for instance, when S2 is modified, S4 and S6 are modified by Process C. Suppose that signal S3 is a request and signal S4 is an acknowledge: Process B must assert signal S3 before gaining an acknowledge on signal S4.

The left side (a) of the figure below shows how SystemC could schedule such a set of processes:

  1. Processes A and D wake up because they are sensitive to the clock front edge. The scheduler updates signals S1 and S2.
  2. Processes B and C wake up. The scheduler updates signals S3, S4, S5 and S6.
  3. Processes C and B wake up. The scheduler updates signals S4, S5 and S6.
  4. Process B wakes up. The scheduler updates signals S5.

We can observe that Process B wakes up three times, and Process C wakes up twice. If Process C were waken up after Process B has executed, and then Process B were waken up again, the total number of process wake-ups would be 5 instead of 7, as shown in the optimal scheduling presented in (b) of the figure above. Therefore, if process wake-ups were scheduled in a given order and serialized, we could avoid many useless wake-ups.

Seeking an appropriate wake-up order for processes is the principle of static scheduling, an alternative to dynamic scheduling: the architecture is viewed as a graph, each process being a vertex and each link an oriented edge (or conversely), and the graph is analyzed to determine the shortest possible path through the graph; this path corresponds to a process schedule which is then compiled.

In practice, we found that providing enough additional dependency information between signals is enough to obtain an acyclic graph, because the natural flow-oriented structure of processor and system architectures already give a fairly (if not fully) acyclic graph.
Therefore, we have developed a simple scheduling technique that is intermediate between static and dynamic scheduling called acyclic scheduling.

Unlike the abovementioned scheduling techniques, we do not attempt to find and schedule strongly connected components, we bluntly interpret the graph as one or several acyclic graphs. For that purpose, we simply rank the vertices starting with the source vertices, i.e., the vertices with no inbound edge.

Consider again the previous example. The dependency graph is shown on the left of figure below:

Signals S1 and S2 have rank 0 because they are source vertices; signal S3 has rank 1, signals S4 and S6 have rank 2 and signal S5 has rank 3. The corresponding process schedule is obtained by replacing the signals with the processes updating the signals, as shown on the right of the figure.

Results

The 5-stage RISC processor we tested has 11 sequential processes and 18 combinational processes. We found that its original graph has no cycle, so that adding dependency information was not necessary to improve its performance.
Using acyclic scheduling, the number of process wake-ups decreased by 22.7% compared to SystemC 2.0.1.

The superscalar processor simulator has 25 sequential processes, 29 combinational processes, and 460 signals. Unlike the RISC processor, its graph has several cycles, but we found that adding 68 signal dependency informations would remove all cycles.
For the combinational processes, we found that our acyclic scheduling eliminated almost all process wake-ups in excess, meaning the performance of simulators executed with this schedule can be much more predictable and regular than with a dynamic scheduling algorithm.

The overall speedup of the acyclic scheduling technique is 3.56 for the RISC processor and 1.96 for the superscalar processor, see figure below; we can note that the compiled version (see Acyclic Scheduling) performs way better than the non-compiled version.

Related Documents

Links

FastSysC was originaly developed as part of the MicroLib project. You can refer to the FastSysC page on the Microlib website.