A Highly Parallel Processor
Using a Data Flow Machine Language

by

Jack B. Dennis
Clement K. Leung
David P. Misunas

This work was supported by grant DCR75-04060 from the National Science Foundation for research on data flow computer architecture.

January 1977
(Revised June 1979)
A Highly Parallel Processor Using a Data Flow Machine Language*

by

Jack B. Dennis
Clement K. Leung
David P. Misunas

Abstract: A computer based on the data flow principle executes instructions in response to the arrival of their operands -- there is no notion of sequential control flow. Because programs expressed in data flow form are free of sequencing constraints other than those imposed by the flow of operands between instructions, a processor using a data flow program representation can be designed to achieve highly parallel operation through concurrent execution of program parts which have no data dependencies. A graphical data flow language and a corresponding architecture for a highly parallel processor are presented in this paper. The language is illustrated by expressing a Fast Fourier Transform algorithm in data flow form. A machine language program suitable for executing the FFT algorithm on the data flow processor is derived, and the potential performance of the processor for this computation is discussed.

*This work was supported by grant DCR75-04060 from the National Science Foundation for research on data flow computer architecture.
1. Introduction

Most efforts to devise computer architectures for highly parallel computation have retained the traditional concept of sequential control flow in their machine level program representations. On one hand a very limited level of parallelism is achieved through use of multiprocessor organizations. On the other hand, parallel operation is achieved in single instruction stream machines either through analysis of the instruction stream as in the IBM 360/195 [5], or through use of specialized data formats and operations as in machines like the Texas Instruments Advanced Scientific Computer [35] which have pipelined processing units, and in array machines such as the Illiac IV [9] and Staran [10].

An alternate approach is to use an architecture able to exploit parallelism on a global basis through use of a machine level program representation based on the concept of data flow. In such a machine, the execution of instructions is driven by the flow of data, and any instructions that are not data-dependent may be executed concurrently.

The concept of data-directed instruction execution has appeared several times in the literature; in particular, see the reports of Shapiro, Saint and Presberg [33], Seeber and Lindquist [32] and Miller and Cocke [25]. However, these early papers left many questions unanswered and failed to relate proposed architecture to a well-defined level of programming language expressive power.

Later, work in the area of program schemata, especially that of Karp and Miller [21, 22] led to development of the data flow program graphs [14, 29] and related abstract models for data-driven programs [4, 8, 24]. This work led to the use of data flow program graphs as the basis for the conception and development of several computer architectures having potential for extensive exploitation of parallelism in computation. These early projects include the work of Davis at the University of Utah [13], Arvind and Gostelow at the University of California at Irvine [6], and a
closely related project at Toulouse, France [28], based on the single assignment idea of Tesler and Enea [34]. Interest in the data driven approach to computer architecture has spread recently with projects at the Texas Instruments Co., at Manchester and Newcastle Universities in England, and a second project at the University of Utah.

In the Computation Structures Group of the MIT Laboratory for Computer Science, two of the authors have designed several hypothetical machines that implement specific data flow languages [16, 17], and the architecture of a data flow multiprocessor that implements a high order data flow language has been presented in the doctoral thesis of Rumbaugh [30, 31]. The subject of the present paper is a comprehensive treatment of a data flow processor in which the machine language programs are a direct encoding of programs expressed as data flow program graphs, and an analysis of its performance potential for a particular computation of considerable importance -- the Fast Fourier Transform.

Data flow program graphs as a representation for computation that exposes concurrency are developed in Section II. In illustration, a data flow program for the Fast Fourier Transform algorithm is developed. The overall architecture and principles of operation of the data flow processor are presented in Sections III and IV. The coding of the FFT algorithm as a machine level program for the data flow processor is developed in Section V. The structure of the routing networks that convey information between sections of the processor is discussed in Section VI, where we also estimate the performance potential of our hypothetical processor for the Fast Fourier Transform. In the concluding section we discuss improvements and extensions of our architectural concepts.

II. Data Flow Program Graphs

We envision that a user of a data flow processor would express his programs in a textual language and a translation program would be used to generate the machine level program
representation directly. The design of a textual source language and a translator to match the qualities of a data flow computer is an interesting problem in itself and has been addressed in a thesis report by Weng [36] and in [3], but further discussion is beyond the scope of this paper. To illustrate the concepts of data-driven computation, we have chosen for presentation in this paper a graphical representation of data flow programs, an example of which is given in Figure 1. These data flow program graphs are convenient for representing the structure of programs prepared for execution on the data flow processor and for studying their properties. In later sections, we develop a program graph to represent a data flow program for an FFT algorithm and also derive from it a machine language representation of the algorithm.

The nodes of a data flow program graph are of two kinds called actors and links. Informally, actors perform elementary computational steps and pass results to succeeding actors by way of the links.

For a formal discussion of the behavior of data flow program graphs we consider configurations of the program graph in which tokens, represented by large solid dots, are associated with certain arcs of the graph. Each token carries a value which is an element of some data type. Several configurations of the program graph in Fig. 1 are shown in Fig. 2. The behavior of a data flow program graph is specified by firing rules which specify the possible sequences of configurations that may describe computations by the program graph. For all links and actors of data flow program graphs (with an exception noted later), the firing rule is the following: (1) A node (an actor or a link) is said to be enabled if a token is present on each of its input arcs and no token is present on any of its output arcs; (2) any enabled node may be chosen to "fire"; (3) firing a link means removing the token from the input arc and associating its value with tokens placed on each of its output arcs; (4) firing an actor means that tokens are removed from each of the actor's input arcs, the values associated with the input tokens are used to determine a result, and a token
Figure 1. A data flow program graph.
Figure 2. Snapshots of a data flow program in execution.
carrying this result value is placed on the actor's output arc.

Figure 2 illustrates the behavior of the program graph of Fig. 1, which consists of the eight actors A1 - A8 and the six links L1 - L6. Actors A1, A2 and A3 are input operators which place values from the environment on their output arcs whenever they are available and the arcs are not occupied. In Fig. 2a the values $x_1^0$, $x_2^0$ and $w^0$ have been received. In this configuration of the program graph, links L1, L2 and L3 are enabled. Suppose links L2 and L3 fire, placing tokens on the input arcs of the multiplication actor A4. Then A4 may fire, placing a token carrying the value $x_2^0 w^0$ on its output arc, yielding configuration (b). Links L1 and L4 are now enabled and may fire in either order. Once both fire, configuration (c) is reached. Actors A5 and A6 are now enabled and their firings deliver the result values $x_1^0 + x_2^0 w^0$ and $x_1^0 - x_2^0 w^0$ to output actors A7 and A8 through links L5 and L6 as in configuration (d). Whenever the environment is ready to accept an available output, an output actor removes the token carrying the output value from its input arc and delivers it to the environment. In the meantime, more input values $x_1^1$, $x_2^1$ and $w^1$ may have been received and a second instance of computation by the program graph may follow the first through the graph, as indicated in the figure. Thus this data flow program graph allows the execution of computations in pipeline fashion.

The links and actors used in the data flow program graphs of the present paper are shown in Figs. 3 and 4. The two kinds of links transmit data values (values of type integer, real or complex, for example) and boolean values, respectively. The behavior of operator actors (Fig. 4a) has already been explained; the function letter f may denote any primitive operation on data values. The actors in the program graph of Fig. 1 are all operators. An identity operator (Fig. 4b) is a special kind of operator that has one input arc and transmits its input value unchanged.

Deciders, gates and merge actors are used in the representation of conditional or iterative computation in data flow program graphs. A decider (Fig. 4c) requires a value from each input arc
Figure 3. Links of the data flow language.
Figure 4. Actors of the data flow language.
and produces the truth value resulting from applying the predicate \( p \) to the values received.

Tokens bearing truth values control the flow of data tokens by means of T-gates, F-gates and merge actors (Fig. 4d, 4e, 4f). A T-gate passes a data token from its data input arc to its output arc when it receives the value true on its control input arc. It will absorb a data token from its data input arc and place nothing on its output arc if it receives the truth value false. An F-gate has similar behavior, but with the sense of the truth value reversed. A merge actor has T- and F-data input arcs, and a truth value input arc. When a truth value is received, the merge actor places a token on its output arc bearing the next data value received on the corresponding data input arc. A token on the other data input arc is unaffected.

The data flow program graph in Fig. 5 illustrates use of the decider, gate and merge actors.

It represents the computation of \( z = x^n \) specified by the following conventional program:

\[
\text{input } x, n; \\
y := 1; i := n; \\
\text{while } i > 0 \text{ do} \\
\quad \text{begin } y := y \cdot x; i := i - 1 \text{ end; } \\
z := y \\
\text{output } z;
\]

The successive values assumed by the loop variables \( y \) and \( i \) pass through the links they label in the program graph. The decider emits a token carrying the value true each time execution of the loop body is required. (This routes the current values of loop variables through the body operators.) When firing of the decider yields false, the value of \( y \) is routed to the output link \( z \).

Note the presence of tokens carrying false values on the truth value input arcs of the merge actors. These tokens allow the merge actors to initiate execution of the loop by passing in initial values for the loop variables.
Figure 5. An iterative data flow program.
The Fast Fourier Transform

Now we are ready to construct a data flow program graph for the Fast Fourier Transform algorithm. The discrete Fourier transform of a sequence of \( N = 2^n \) input samples \( x_0, ..., x_{N-1} \) is the sequence of values \( f_0, ..., f_{N-1} \) where

\[
f_k = \sum_{i=0}^{N-1} x_i W^{ik}
\]

and

\[
W = e^{-j(2\pi/N)}
\]

The direct computation of these values involves the accumulation of \( N^2 \) product terms; the Fast Fourier Transform (FFT) is based on the observation that the transform on \( 2^p \) data samples can be simply expressed in terms of two transformations on \( 2^{p-1} \) samples. Continuing recursively, one discovers that the transform on \( 2^n \) points can be expressed in terms of \( n \cdot 2^{n-1} \) transformations of two points each. Figure 6 shows the flow of values in one arrangement of the FFT computation for eight data points \( (n = 3) \). This arrangement, in which the computation consists of \( n \) stages (the columns of the figure) having identical form, is known as the time decimated, constant geometry FFT \([19]\). Each stage of the computation consists of \( N/2 \) units of similar form, known as "butterflies," which compute two-point transforms.

The general form of this FFT algorithm may be described as follows: Let \( u_{p,k} \) be the \( k^{th} \) component of the vector of values computed by the \( p^{th} \) stage of the computation. Then \( B_{p,q} \) the \( q^{th} \) butterfly of stage \( p \) computes

\[
u_{p,q} = u_{p-1,2q} + u_{p-1,2q+1} W^{p,q}
\]

(2)

\[
u_{p,q+2^{n-1}} = u_{p-1,2q} - u_{p-1,2q+1} W^{p,q}
\]

(3)
Figure 6. The eight-point, constant geometry, time decimated FFT.
where the exponent $e_{p,q}$ of each phase factor $w_{p,q} = W^{e_{p,q}}$ is given by

$$e_{p,q} = 2^{n-p} \text{quo}(q, 2^{n-p})$$

and

$$0 \leq q < 2^{n-1} \quad 0 < p \leq n$$

The function \text{quo}(m,n) yields the integer quotient of $m$ divided by $n$. The input values for stage one are related to the data samples by

$$u_{0,k} = x_i \text{ where } i = \text{rev}(k)$$

in which \text{rev} is the operation on integers such that the $n$-bit binary representation of $i$ is the reverse of the $n$-bit representation of $k$. The output values are

$$f_k = u_{n,k}, \quad 0 \leq k < 2^n$$

We wish to take maximum advantage of parallelism in representing the FFT as a data flow program graph, but since each actor will take space in the machine representation, we do not want to use a larger program graph than necessary to exploit concurrency. Since each stage of the computation uses values computed by the preceding stage, it is appropriate to construct the program graph as an $n$-cycle iteration in which the body consists of the $2^{n-1}$ butterflies comprising one stage of computation written out explicitly. The form of the corresponding data flow program graph is shown in Figure 7 for the eight-point case. This is fairly easy because the constant geometry of the computation over all stages makes it possible to use a fixed routing of values from the outputs of the butterflies to their inputs where they become operands for the next cycle. Generating the phase factors for each butterfly, however, presents a problem. The usual technique is to use a table lookup in a table of powers of $W$, but our program graph notation includes no suitable mechanism. Instead, the factor $w_{p,q}$ used for butterfly $q$ in stage $p$ may be computed from
Figure 7. Iteration data flow program graph for the eight point FFT.

Phase Factor Generation*  
*for each q in {0, 1, 2, 3}

Loop Control
the factor $w_{p-1,q}$ used for the previous stage by a simple rule derived as follows: The exponents of $W$ for $w_{p,q}$ and $w_{p-1,q}$ are (from (4)):

$$e_{p,q} = 2^{n-p} \text{quo}(q, 2^{n-p})$$

$$e_{p-1,q} = 2^{n-p-1} \text{quo}(q, 2^{n-p-1})$$

Then

$$e_{p,q} = e_{p-1,q} \cdot (e_{p,q} - e_{p-1,q})$$

$$= e_{p-1,q} \cdot 2^{n-p} \left(\text{quo}(q, 2^{n-p}) - 2 \text{quo}(q, 2^{n-p+1})\right)$$

The careful study of the factor $T_{p,q}$ reveals that

$$T_{p,q} = \begin{cases} 
0 & \text{if } \text{quo}(q, 2^{n-p}) \text{ is even} \\
1 & \text{if } \text{quo}(q, 2^{n-p}) \text{ is odd}
\end{cases}$$

Thus $T_{p,q}$ is the $(n-p)^{th}$ bit in $q_{n-1} \ldots q_0$, the $n$-bit binary representation of $q$. Let $\text{bit}(r, q)$ be a primitive function that yields the $r^{th}$ bit of $q$. Then we have

$$w_{p,q} = w_{p-1,q} \times (\text{if } \text{bit}(n-p, q) = 1 \text{ then } W^{2^{n-p}} \text{ else } 1)$$

The initial value of the phase factor for the $q^{th}$ butterfly is

$$w_{1,q} = W^{e_{1,q}} \text{ where } e_{1,q} = 2^{n-1} \text{quo}(q, 2^{n-1})$$

$$= W^0 = (1 + j0)$$

The computation of the phase factors $w_{p,q}$ is performed by the sections of Figure 7 labelled "Phase Factor Generation" and "Phase Constant Queue."

We suppose that the signal values are delivered to the program graph as a continuous
Figure 8. Tree of fan-out alternators for sample distribution.
Figure 9. Fan-in alternator.
stream through a single input operator, and must be distributed among the $2^N$ input links of the FFT program. This may be done by means of a binary tree of program graph fragments, which we may call fan-out alternators, connected as in Fig. 8. A similar binary tree of fan-in alternators (Fig. 9) can be used to form the transform values $f_0, ..., f_{N-1}$ into a stream.

III. The Data Flow Processor: An Overview

The data flow processor is a stored program computer designed to exploit the concurrency of action represented by data flow program graphs such as we have illustrated for the Fast Fourier Transform. The overall structure of this processor is shown in Fig. 10; it consists of five major sections connected by channels through which information is sent in the form of discrete packets. The five sections are:

Memory Section -- consists of Instruction Cells which hold instructions and their operands.

Processing Section -- consists of Processing Units that perform the basic operations on data values.

Arbitration Network -- delivers operation packets from the Memory Section to the Processing Section.

Control Network -- delivers control packets from the Processing Section to the Memory Section.

Distribution Network -- delivers data packets from the Processing Section to the Memory Section.

Briefly, instructions held in the Memory Section are enabled for execution by the arrival of their operands in data packets from the Distribution Network and control packets from the Control Network. Enabled instructions, together with their operands, are sent as operation packets to the Processing Section through the Arbitration Network. The results of instruction execution are sent through the Distribution and Control Networks to the Memory Section where they become
Figure 10. General structure of the data flow processor.
operands of other instructions. We next consider the operation of each section in more detail.

The Memory Section of the processor is a collection of Instruction Cells. Each Instruction Cell has a unique identifying address, the cell identifier. An occupied Cell holds an instruction consisting of an operation code and several destinations. Each destination contains a destination address, which is a cell identifier, and additional control information used by processing units to generate result packets. An instruction represents one or more actors of the program graph together with their output links. Instructions are linked together through destination addresses stored in their destination fields.

Each Cell also contains three Receivers (Figure 12) which await the arrival of values for use as operands by the instruction. Once an Instruction Cell has received the necessary operand values and acknowledge signals,\(^1\) the Cell becomes enabled and sends an operation packet, consisting of the instruction and the operand values, to the appropriate Processing Unit through the Arbitration Network.

The Arbitration Network provides a path from each Instruction Cell to each Processing Unit, and sorts the operation packets among its output ports according to the operation codes of the instructions they contain. For each operation packet received, a Processing Unit performs the operation specified by the instruction using the operand values in the packet, and produces one or more result packets which are sent to Instruction Cells through the Control Network and Distribution Network. Each result packet consists of a result value and a destination address derived from the instruction being processed by the Processing Unit. There are two kinds of result packets: control packets containing boolean values or acknowledge signals, which are sent through the Control Network; and data packets containing integer or complex values, which are sent

\(^1\) Acknowledge packets at the machine level are needed to correctly implement the firing rule for program graphs. Their use is explained fully in Section V.
through the Distribution Network. The two networks deliver result packets to Receivers of Instruction Cells as specified by their destination address fields; that is, result packets are routed according to their destination address.

Arrival of a result packet at an Instruction Cell either provides one of the Receivers of the Cell with an operand value or delivers an acknowledge signal; if all result packets required by the instruction in the Cell have been received, the Instruction Cell becomes enabled and dispatches its contents to the Arbitration Network as a new operation packet.

Note that the functions performed by the processing unit of a conventional machine are distributed among several sections of the data flow processor. The operations specified by instructions are carried out in the Processing Section, but control of instruction sequencing is a function of the Instruction Cells of the Memory Section, and the decoding of operation codes is partially done within the Arbitration Network. Also unusual is the fact that address fields (destination addresses) of instructions only specify where results are to go, and are not used to access operand values. Instead of instructions having to ask for their operands, the operand values are sent to the instructions.

We emphasize that all communication between parts of the data flow processor is by packet transmission over the channels shown explicitly in Fig. 10; there are no connections other than those shown in the figure. Furthermore, transmission of packets over each channel is done using an asynchronous protocol so the five sections of the processor may operate independently without need for a clock or other central source of timing signals. Systems organized to operate in this manner are said to have packet communication architecture [15].

In particular, note that the Instruction Cells are assumed to be physically independent, so at any time many of them may be enabled. Later, we will show how the Arbitration Network can be designed so many instruction packets may flow into it concurrently and be funneled into dense
streams of packets directed to the processing units. Similarly, the Control Network and the Distribution Network can be designed to distribute dense streams of control and data packets efficiently to the Instruction Cells through highly concurrent operation. In this way, highly parallel operation of the entire processor is achieved, and the appetites of pipelined processing units can be satisfied.

IV. The Data Flow Processor: Principles of Operations

We have described the major modules of a data flow processor and a graphical representation of data flow programs. The architectural implications of data flow concepts is further studied in this paper by deriving a machine level program representation of the FFT algorithm from the program graph of Figure 7 and estimating the performance of a data flow processor in executing this program. In this section we define the data flow processor modules in sufficient detail to support these developments. An instruction set is specified, and its capabilities are explained by detailing the principles of operation for an instruction cell and for two functional units.

The Processing Section consists of five processing units having characteristics appropriate for the FFT algorithm:

1. Multiplier -- multiplication of complex operands.
2. Adder -- addition and subtraction of complex operands.
3. Distributor -- replication and distribution of data and control values.
4. Int-Processor -- integer arithmetic and test operations.
5. Cntl-Processor -- replication and routing of data and control values.

The formats of packets transmitted between major sections of the data flow processor are given in Figure 11. Each instruction consists of an operation code from opcode-set and an array of up to five destinations which specify the data and control packets to be generated by instruction
definitions:

**type** instruction = record
  opcode: opcode-set;
  dest: array[1..5] of destination
end;

**type** destination = record
  used: boolean;
  send-ack-signal: boolean;
  switch: {null, boolean};
  addr: address
end;

**type** address = record
  cell-id: L..n-of-cell;
  rec-id: 1..3
end;

**type** operand = {null, boolean, integer, complex};

**type** operation-pkt = packet
  inst: instruction;
  opd: array[1..3] of operand
end;

**type** cntl-pkt-c = packet
  ctype: (ACK, BOOL);
  value: {null, boolean};
  addr: address
end;

**type** cntl-pkt-r = packet
  ctype: (ACK, BOOL);
  value: {null, boolean};
  rec: 1..3
end;

**type** data-pkt-c = packet
  dtype: (INT, CPLX);
  value: {integer, complex};
  addr: address
end;

**type** data-pkt-r = packet
  dtype: (INT, CPLX);
  value: {integer, complex};
  rec: 1..3
end;

Figure II. Packet Definitions for a Data Flow Processor.
execution in the Processing Section and where the packets are to be sent. A destination consists of
an address and other information to be explained later, which is interpreted by the Processing
Units. An address consists of an integer that designates an Instruction Cell and an integer that
specifies the Receiver of the Instruction Cell which is to receive a control or data packet. An
operation packet consists of an instruction and an array of three operands, each of which may be
null in case the operand is not required. Two forms of control packets and two forms of data
packet are declared; this is because the cell-id component of the address field is irrelevant once the
packet has been routed to the correct Instruction Cell by the Control or Distribution Network.
Packets generated by the processing units are of type data-pkt-c or cntl-pkt-c. Packets delivered to
instruction cells are of type data-pkt-r or cntl-pkt-r.

The Arbitration Network delivers an operation packet from an Instruction Cell to the
processing unit specified by its opcode. The Distribution Network delivers a data packet from a
processing unit to the Instruction Cell specified in its cell-id. Similarly the Control Network
delivers control packets from processing units to Instruction Cells. The structure of the routing
networks will be discussed later where we relate their characteristics to the performance potential of
the data flow processor for the FFT algorithm.

Instruction Cell Operation

The function of each Instruction Cell is to receive control and data packets, and to transmit
an operation packet when all needed operands have arrived and the enabling condition for its
instruction is satisfied. In a machine language program which contains an iteration construct, or
which is intended to process data streams via pipelining, it is necessary to condition instruction
execution on receipt of a specified number of acknowledge signals (as explained in Section V) in
order to implement the firing rule correctly. Thus the general enabling condition for an Instruction
Cell is that the required data and control packets have arrived and that the Cell has also received a
specified number of acknowledge packets.

As shown in Figure 12, an Instruction Cell consists of an input interface module Cell-Input-Cntl, three Receiver modules and an output interface module Cell-Output-Cntl. The Cell-Input-Cntl unit processes control and data packets, sending acknowledge signals on to Cell-Output-Cntl and distributing operand values to the Receivers according to the receiver number in the packet. The format of receiver packets, sent from Cell-Input-Cntl to the Receivers, is:

\[
\text{type receiver-pkt = packet \(rtype\): (BOOL, INT, CPLX);}
\]

\[
\text{value: \{boolean, integer, complex\} end.}
\]

A behavior description of the Cell-Input-Cntl module is given in Figure 13.\(^2\) Behavior descriptions for the Receiver module and the Cell-Output-Cntl module are given in Figures 14 and 15.

Each Receiver (Figure 14) may be set (by initializing receiver-type and receiver-mode) to provide one of three different types of operand values, boolean, integer or complex, and to operate in one of two modes: constant and variable. The behavior of a Receiver for each allowed

\(^2\) The constructs used in these descriptions are familiar programming language constructs except for the receive and send statements. We envision that operations specified in the separate boxes of a behavioral description are carried out concurrently, and are synchronized by passing signals between them explicitly. A statement

\[\text{receive x at P}\]

means that the next packet to arrive at input port P is made the value denoted by identifier x. A statement

\[\text{send y at Q}\]

means that the value denoted by identifier y is transmitted at output port Q. Execution of a receive statement cannot be completed until an input packet is available at the named input port. Likewise execution of a send statement is not completed until the unit connected to the output port is prepared to accept a packet. It is also assumed that proper arbitration is performed when send statements having the same output port are executed concurrently.
Figure 12. Structure of an Instruction Cell.
/* Process control packets */
var cntlpkt: cntl-pkt-r;
do forever
begin
    receive cntlpkt at cntl-pkt-in;
case cntlpkt.type of
    ACK: send signal at ack-sig-out;
    BOOL: send [rtype: BOOL, value: cntlpkt.value]
         at rec-pkt-out(cntlpkt.rec);
end case
end

/* Process data packets */
var datapkt: data-pkt-r;
do forever
begin
    receive datapkt at data-pkt-in;
send [rtype: datapkt.dtype, value: datapkt.value]
    at rec-pkt-out [datapkt.rec];
end

Figure 13. Behavior description of Cell-Input-Cntl.
\begin{algorithm}
\begin{verbatim}
/* variables set at instruction load time of */
var receiver-type: (NULL, BOOL, INT, CPLX);
receiver-mode: (constant, variable);
value: operand; /* set for constant node only */
/* others */
rec-pkt: receiver-pkt;

case receiver-mode of
/* Receiver in constant node of */
constant:
  case receiver-type of
    NULL: do forever
       begin send nil at contents end;
    INT, CPLX:
       do forever
       begin send value at contents end;
    otherwise: error;
  end case;
/* Receiver in variable node of */
variable:
  if receiver-type = NULL then do forever
     begin send nil at contents end;
  else begin
     receive rec-pkt at data;
     if rec-pkt.type <> receiver-type then error;
     else send rec-pkt.value at contents;
   end;

otherwise: error;
end case
\end{verbatim}
\end{algorithm}

Figure 14. Behavior description of a Receiver module.
/* variables initialized at instruction load time of */
var ack-expected, ack-received: integer;

/* count acknowledge packets of */
do forever
begin
if ack-received = ack-expected
then begin signal ack-complete; ack-received := 0 end
else begin receive signal at ack-signal;
ack-received := ack-received + 1 end;
end

/* variables initialized at instruction load time of */
var instr: instruction;
/* others of */
operand-array: array [1..3] of operand;
opn-pkt: operation-pkt;
/* Assemble operation packet of */
do forever
begin
receive operand-array [1] at operand-in [1];
receive operand-array [2] at operand-in [2];
receive operand-array [3] at operand-in [3];
opn-pkt := [inst: instr, opd: operand-array];
on ack-complete send opn-pkt at opn-pkt-out end

Figure 15. Behavioral Description of the Cell-Output_Cntl unit
combination of type and mode is summarized in Table I. If the instruction held by an Instruction Cell requires fewer than three operands, then one or more of the Receivers are set to type NULL; in this case the Receiver is always enabled and delivers the value nil repeatedly, since Cell-Output-Cntl requires an operand packet from each Receiver to form each operation packet.

The instruction held by an Instruction Cell is represented by the value of instr in Cell-Output-Cntl (Figure 15). The number of acknowledge packets required to enable the Instruction cell and the number of acknowledge packets received since the previous firing of the Instruction Cell are stored in ack-expected and ack-received. The upper box of the Cell-Output-Cntl module waits for arrival of the expected number of acknowledge packets; it then resets its count and transmits the signal ack-complete. The lower box waits for this signal, and then transmits an operation packet containing one operand value (possibly nil) from each of the three receivers.

The variables instr, ack-expected and ack-received in Cell-Output-Cntl and receiver-type, receiver-mode and value in each Receiver must be initialized with values derived from a machine language program for its proper execution on the data flow processor. The corresponding "loading" mechanisms for setting these values will not be discussed in this paper.

Table 2 presents the instruction types which will be used to code the FFT program for the data flow processor. These instructions are defined so several actors in a data flow graph may be encoded by a single instruction. Although these instructions have been chosen to illustrate the performance achievable in the FFT computation, they are nevertheless representative of the sort of instructions that might be included in a complete instruction code. For each type of instruction, the letter (M, A, D, I or C) under each instruction name indicates the Processing Unit that executes instructions of that type. The format given in the table specifies the type of each Receiver and whether a value is required from it. The format also indicates the kinds of destinations that make sense for the instruction. This requires a bit more explanation. Destinations have the form
<table>
<thead>
<tr>
<th>Symbol</th>
<th>Receiver Type</th>
<th>Receiver Mode</th>
<th>Packet(s) Required</th>
<th>Enabling Condition</th>
</tr>
</thead>
<tbody>
<tr>
<td>N</td>
<td>NULL</td>
<td>none</td>
<td>BOOL</td>
<td>always enabled</td>
</tr>
<tr>
<td>B</td>
<td>BOOL</td>
<td>variable</td>
<td>BOOL</td>
<td>receipt of a boolean packet</td>
</tr>
<tr>
<td>I</td>
<td>INT</td>
<td>variable</td>
<td>INT</td>
<td>receipt of an integer packet</td>
</tr>
<tr>
<td>C</td>
<td>CPLX</td>
<td>variable</td>
<td>CPLX</td>
<td>receipt of a complex packet</td>
</tr>
<tr>
<td>Ic</td>
<td>INT</td>
<td>constant</td>
<td>none</td>
<td>always enabled</td>
</tr>
<tr>
<td>Cc</td>
<td>CPLX</td>
<td>constant</td>
<td>none</td>
<td>always enabled</td>
</tr>
<tr>
<td>Name</td>
<td>Format</td>
<td>Results</td>
<td>Name</td>
<td>Format</td>
</tr>
<tr>
<td>-----------------------</td>
<td>------------</td>
<td>--------------------------</td>
<td>-----------------------</td>
<td>------------</td>
</tr>
<tr>
<td>Integer Add (I)</td>
<td>I [ p ]</td>
<td>(p + q)</td>
<td>Complex Switch (C)</td>
<td>C [ z ]</td>
</tr>
<tr>
<td></td>
<td>I [ q ]</td>
<td>a</td>
<td></td>
<td>B [ b ]</td>
</tr>
<tr>
<td></td>
<td>N</td>
<td></td>
<td></td>
<td>N</td>
</tr>
<tr>
<td>Integer Subtract (I)</td>
<td>I [ p ]</td>
<td>(p - q)</td>
<td>Complex Add</td>
<td>C [ x ]</td>
</tr>
<tr>
<td></td>
<td>I [ q ]</td>
<td>a</td>
<td>and Switch (A)</td>
<td>C [ y ]</td>
</tr>
<tr>
<td></td>
<td>N</td>
<td></td>
<td></td>
<td>B [ b ]</td>
</tr>
<tr>
<td>Integer Distribute (D)</td>
<td>I [ r ]</td>
<td>[r]</td>
<td>Complex Subtract</td>
<td>C [ x ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td>a</td>
<td>and Switch (A)</td>
<td>C [ y ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>B [ b ]</td>
</tr>
<tr>
<td>Complex Multiply (M)</td>
<td>C [ x ]</td>
<td>[x x y]</td>
<td>Integer Compare (I)</td>
<td>I [ p ]</td>
</tr>
<tr>
<td></td>
<td>C [ y ]</td>
<td></td>
<td></td>
<td>I [ q ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>N</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Complex Distribute (D)</td>
<td>C [ z ]</td>
<td>(z)</td>
<td>Bit Test (I)</td>
<td>I [ p ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td>a</td>
<td></td>
<td>I [ q ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>N</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Boolean Distribute (D)</td>
<td>B [ b ]</td>
<td>(b)</td>
<td>Integer Switch (C)</td>
<td>I [ p ]</td>
</tr>
<tr>
<td></td>
<td></td>
<td>a</td>
<td></td>
<td>B [ b ]</td>
</tr>
</tbody>
</table>

Receiver types: N-NULL; B-BOOL; I-INT; C-CPLX
Values: p,q,r-integer; x,y,z-complex; b-boolean

Table 2. Instruction types.
type destination = record
    used: boolean;
    send-ack-signal: boolean;
    switch: {null, boolean};
    addr: address
end;

and are used by Processing Units to determine the control and data packets they generate. The
used field is set to false if this destination is not needed in the program. If an acknowledge packet
is to be sent to the Instruction Cell specified by a destination address, then its send-ack-signal field
is set to true. Otherwise a result packet generated according to the instruction type is sent.

An instruction such as i-add that represents a data flow operator, or such as c-dist that
provides necessary fan-out, would typically use destinations for both purposes. The switch fields of
destinations, while ignored in the execution of these instructions, are used in the several switch
instructions provided. A switch instruction is convenient for coding the commonly occurring
program graph fragment shown in Figure 16. For these instructions, the switch field of a
destination indicates whether a result packet should be sent to the Instruction Cell specified by the
destination address in the event of a false outcome (F), a true outcome (T), or both (nil). In our
figures, destination arcs of switch instructions with non-null switch fields are always labelled with
the corresponding boolean values (Figure 16).

The memory space required in an Instruction Cell depends on the number and type of
operands and the number of destinations used, and the possibilities permitted by our specification
of the Instruction Cell span a wide range. Clearly, the instructions for complex arithmetic are the
most demanding of memory for operands, so the number of used destinations should be limited. In
our programs, we have permitted instructions to have up to five destinations. The manner in
which operands and destination addresses are efficiently coded in Instruction Cells and in operation
packets is a matter that would be dealt with in the detailed design of a complete instruction code.
Figure 16. The switch instruction.
Processing Unit Operation

To illustrate the operation of the processing units, we consider the Int-Processor unit and the Cntl-Processor unit in more detail. Operation of the remaining units is similar.

As shown in Fig. 17, the Int-Processor unit consists of an Int-ALU module and an Int-Cntl module. The Int-Cntl submodule receives operation packets of the form

\[
\text{inst: [opcode: (i-add, i-sub, bit, less), dest: array[1..5] of destination],}
\text{ opd: array[1: integer, 2: integer, 3: null]}\]

On receipt of an operation packet, Int-Cntl determines whether addition, subtraction, bit test or comparison is required and sends a command-pkt of the form

\[
\text{op: (i-add, i-sub, bit, less), opi, op2: integer}\]

to Int-ALU to request that the appropriate operation be performed. Upon receiving from Int-ALU a result-pkt of the form

\[
\text{rtype: (BOOL, INT), value: [boolean, integer]}\]

Int-Cntl constructs control and data packets for transmission through the Control and Distribution Networks according to the destination fields of the operation packet.

Behavioral descriptions of the integer processor control unit and of the integer arithmetic logical unit are given in Figs. 18 and 19. Both descriptions are straightforward and should be easily understood.

The Int-Cntl module consists of two parts which are activated alternately. The first part waits for an operation packet to arrive; then it fetches the operands from the packet, determines if the operation code is either of the four allowed values, and dispatches to the Int-ALU module the two integer operands and a scalar value indicating whether addition, subtraction, bit test or integer
Figure 17. Structure of the Int-Processor processing unit.
var dest-arr: array [1..5] of destination;
  opd1, opd2: integer;
  op: (i-add, i-sub, bit, less);
  op-pkt: operation-pkt;
  result: result-pkt;

do forever
begin
  /* Process operation packets o/*
  receive op-pkt at arbnet-in:
    typecase op-pkt.opd[1] of
      integer: opd1 := op-pkt.opd[1];
      otherwise: error;
    end case;
    typecase op-pkt.opd[2] of
      integer opd2 := op-pkt.opd[2];
      otherwise: error;
    end case
  op := op-pkt.inst.opcode;
  dest-arr := op-pkt.inst.dest;
  send [opn:op, op1:opd1, op2:opd2] at cmnd-out;
  /* Transmit data and control packets o/*
  receive result at result-in;
  for i = 1..5 do
  begin
    d := dest-arr[i];
    if d.used then
      case d.send-ack-signal of
        true: send [ctype: 'ACK', value: nil, addr: d.addr] at cntlnet-out;
        false: case result.type of
          BOOL: send [ctype: BOOL, value: result.value, addr: d.addr] at cntlnet-out;
          INT: send [dtype: INT, value: result.value, addr: d.addr] at distnet-out;
          otherwise: error;
        end case
      end case
    end
  end

Figure 18. Behavioral description of the integer processor control unit.
var cmd: command-pkt;
    op: (i-add, i-sub, bit, less);
    rl, r2: integer;
    rtype: (BOOL, INT);
    value: {boolean, integer};
do forever
begin
receive cmd at command;
    op := cmd.opn;
    rl := cmd.op1;
    r2 := cmd.op2;
case op of
    i-add: begin value := rl + r2;
               rtype := INT end;
    i-sub: begin value := rl - r2;
               rtype := INT end;
    bit: begin value := odd (rl rshift r2); /\ bit test on r2th bit of rl o/
                 rtype := BOOL end;
    less: begin value := rl < r2;
               rtype := BOOL end;
otherwise: error;
end case;
send {rtype, rtype, value, value} at result
end

Figure 19. Behavioral description of the Int-ALU unit.
var op-pkt: operation-pkt;
dest-array: array[1..5] of destination;
d: destination
sendp: array[1..5] of boolean,
certype: (INT, CPLX);
do forever
begin
receive op-pkt at arnet-in;
dest-array := op-pkt.inst-dest;
/* determine conditions for sending result packets of*/
for i = 1..5 do
begin
  d := dest-array[i];
  typecase d.switch of
    null: sendp[i] := d.used;
    boolean: sendp[i] := d.used and
      (op-pkt.opd[2] eq d.switch)
    otherwise: error
  end case
end /* use conditions set up above to send result packets of*/
for i = 1..5 do
if sendp[i] then
  begin
    d := dest-array[i];
    if d.send-ack-signal then
      send [ctype: ACK, value: nil, address: d.addr]
      at cntlnet-out;
    else begin
      typecase op-pkt.opd[i] of
        complex: ctype := CPLX;
        integer: ctype := INT;
        otherwise: error
      end case
      send [ctype: ctype, value: op-pkt.opd[i], address: d.addr]
      at distnet-out;
    end
  end
end cntlnet-out distnet-out
cnt1-pkt-c --- data-pkt-c

Figure 20. Behavioral description of the Cnt1-Processor.
comparison is required. The components of the destination array of the operation packet specify what is done with the result value when it is returned by Int-ALU. For each destination, Int-Cntl sends an acknowledge packet, a boolean packet, a data packet, or nothing if the destination is marked as unused.

Use of the switch field is illustrated in the operation of the Cntl-Processor unit (Figure 20). For each operation packet received, the Cntl-Processor unit first of all decides which Instruction Cells should receive result packets. Each such cell is addressed by a destination marked as used; furthermore, if the switch field of the destination is set to either true or false, its value must match that of the boolean operand carried along in the operation packet. To each such cell the Cntl-Processor unit sends a result packet, just like the Int-Cntl module.

V. The Fast Fourier Transform Program

In this section we develop a complete machine level program for the FFT algorithm expressed as a data flow program graph in Figure 7. Groups of nodes in the program graph are encoded into machine instructions. For execution on the data flow processor, these machine instructions are loaded into Instruction Cells and linked together through cell identifiers stored in the destination fields of these instructions.

To illustrate the general procedure, let us consider the pair of data flow operators shown in Fig. 21a. First the program graph is partitioned (as in Fig. 21b, for example), where it is intended that each block be able to compute concurrently with others in pipeline fashion. Then the program graph is encoded into Instruction Cells using acknowledge signals so the machine level program correctly simulates the program graph firing rules. Partitioning of this program segment and the corresponding machine instruction encoding are shown in Figs. 21b and 21c, respectively. Each execution of the instructions in Cell-c delivers a data packet to Cell-d. Each execution of the instruction in Cell-d returns an acknowledge packet to Cell-c, and the re-enabling of Cell-c is
Figure 21. Machine level representation of program graphs.
predicated upon the receipt of this acknowledge packet (c.f. Instruction Cell Operation, Section IV). According to the semantics of data flow graphs under the firing rule (Section II), data link c encoded in Cell-c cannot fire until its output arc is empty. This condition is signalled during the execution of a machine language program by sending an acknowledge packet from Cell-d to Cell-c. To indicate the necessary synchronization, the pair of numbers within the ellipse in an Instruction Cell (Fig. 2lc) specifies the number of acknowledge signals required to enable the cell, ae, and the number of signals presumed to have been received in the specified configuration, ar. These variables are initialized to their proper values for each instruction in a machine language program. It should be noted that when a machine instruction encodes several actors and their output links, each execution of the machine instruction corresponds to the firing of the actors followed immediately by the firing of the output links. Hence such a machine instruction must not be enabled until the instruction cells implementing the output arcs of these data links are all free to receive the next set of data.

A detailed discussion on the deadlock problems that may arise in a data flow processor supporting iteration if acknowledgements are not provided in a machine language program can be found in [17].

The main body of the FFT program graph (Fig. 7) is an iteration construct. To facilitate the subsequent presentation, an example of an iteration construct and its machine level implementation are shown in Fig. 22. Initially Cell-in (Fig. 22) is enabled upon receiving a data packet. Each time the predicate p in Cell-pred evaluates to true, execution of the instruction in Cell-cntI delivers a data packet to Cell-l1 and an acknowledge packet to Cell-l2, allowing a new iteration. If p evaluates to false, the output of the iteration construct is sent to Cell-out and an acknowledge packet is sent to Cell-in, allowing the iteration construct to be re-entered. Cell-cntI can be re-enabled upon receiving an acknowledgement from either Cell-l1 or Cell-out. No
Figure 22. Machine level representation of an iteration construct.
acknowledgement is needed from Cell-pred to Cell-in and Cell-\text{l}. The reader should convince himself that the cell configuration in Fig. 22 implements an iteration construct correctly, according to the firing rule for program graphs.

The data flow program graph for the FFT algorithm (Fig. 7) consists of five major sections: Phase Constant Generation, Loop Control, Phase Factor Generation, Butterfly and Distribution Trees. We consider each section in turn.

**Phase Constant Generation**

The Phase Constant Queue section of the FFT program graph (reproduced in Fig. 23) might be partitioned into three blocks (Fig. 23a) for machine level encoding. This yields the instruction cell configuration in Fig. 23b. However, we have already noted that this partition should not be used naively to derive the machine language representation in Fig. 23b. The difficulty is that executing any instruction in Fig. 23b results in delivering a data packet to an occupied cell receiver, in violation of the firing rule. To avoid this problem, we use the partition in Fig. 23c, after adding an identity operator, to derive the machine language representation in Fig. 23d.

**Loop Control**

The data flow graph of the Loop Control section is reproduced in Fig. 24. Partitioning and implementation of this program graph (Figs. 24a and b) follows closely the strategy illustrated in Fig. 22. Note that each execution of the instruction held in Cell-cntl delivers two acknowledge packets back to Cell-cntl, so that the enabling condition of Cell-cntl does not depend on the decision

---

3. To avoid clustering up the figures, we have introduced a new arc type \[ \rightarrow \] to denote a pair of arcs delivering a data packet in the forward direction and an acknowledge packet in the opposite direction.
(a) phase constant queue with partitioning.

(b) machine language program from (a)

(c) adding an identity operator

(d) machine language program from (c)

Figure 23. Machine level representation of the phase constant queue.
(a) partitioning the program graph

(b) machine level representation

Figure 24. Machine level representation of the loop control section.
Figure 25. An optimized machine level implementation of the loop control section.
outcome (a boolean operand) it receives from Cell-pred. A slightly optimized implementation of the Loop Control section is given in Fig. 25, obtained by applying the following transformations to the machine level representation of Fig. 24b:

i. Whenever execution of the i-sw instruction held in Cell-cntl delivers a data packet to Cell-l₂, an acknowledge packet is also delivered to Cell-l₂. The acknowledgement can be omitted and the enabling condition of Cell-l₂ adjusted accordingly. Similarly acknowledgement from Cell-l₂ to Cell-cntl can be omitted.

ii. Each time Cell-l₁ receives an acknowledgement, it will send one in turn to Cell-cntl. The latter can be eliminated by sending the acknowledgement intended for Cell-l₁ to Cell-cntl directly (Fig. 25).

iii. Every execution of the instruction in Cell-cntl leads indirectly, either through Cell-in or Cell-l₁, to delivery of a data packet to Cell-pred. Acknowledgement from Cell-cntl to Cell-pred can be omitted and the enabling condition of Cell-pred adjusted accordingly.

These optimizing transformations illustrate the opportunities for reducing the number of acknowledgements needed to implement the firing rule correctly.

Phase Factor Generation

For each iteration of the FFT algorithm, each butterfly section receives a phase factor generated by the program graph shown in Fig. 26. To derive the machine level implementation, we partition the program graph as in Fig. 26a. The corresponding machine language program is given in Fig. 26b. Noting that several acknowledgements have been eliminated through optimization, the reader should again convince himself that this machine level program correctly implements the program graph.

Butterfly Section

Derivation of a machine level program from the program graph of a Butterfly section is
(a) partitioning the program graph for phase factor generation

Figure 26. Machine level representation of phase factor generation.
(b) machine level implementation
Figure 27. Machine level representation of the butterfly section.
(b) machine level implementation
Figure 28. One unit of a distribution tree and its implementation.
illustrated in Fig. 27. Generating the cell configuration for performing the required arithmetic is straightforward. Each butterfly section also contains two control cells to receive either a pair of inputs (through $x_{rev(2q), x_{rev(2q+1)}}$) to initiate a new FFT computation, or a pair of intermediate results (through $u_{2q, u_{2q+1}}$) from the previous stage of the current FFT computation.

**Distribution Trees**

For each stage of the FFT computation, the values $b_p, n_p$ and $w_p$ must be distributed to the Phase Factor Generation sections and the Butterfly sections. This may be done by three Distribution Trees made up of units as shown in Fig. 28. Use of these units to construct Distribution Trees allows enough skew in the execution of the Phase Factor Generation and Butterfly sections that the computation for successive stages of the FFT may overlap substantially.

**VI. Routing Network Structure and Performance**

The Arbitration, Distribution and Control Networks of the data flow processor are examples of routing networks that perform the function of directing packets to one of several or many physical units of the processor. If the parallelism represented in the data flow form of algorithms such as the FFT is to be exploited by the kind of machine we have described, these routing networks must be structured so they can handle many packets concurrently. If a high degree of parallelism is supported, then it is not crucial that each unit acts in the fastest possible time on information it receives to achieve balanced utilization of sections of the machine.

A structure for the Arbitration Network is shown in Fig. 29a. It is built of arbitration units, switch units and buffer units. Each arbitration unit passes packets arriving at its input ports one-at-a-time to its output port, using a round-robin discipline to resolve any ambiguity about which packet should be sent next. A switch unit assigns a packet at its input to one of its output ports according to some property of the packet, the operation code in the case of the Arbitration
(a) Arbitration Network

(b) Distribution Network

arb: arbitration unit;  sw: switch unit;  buf: buffer unit

Figure 29. Structure of the Arbitration and Distribution Networks.
Network. A buffer unit stores a packet until the succeeding switch or arbitration unit is ready to accept it. The network shown has three stages: stages 1 and 2 have arbitration units that funnel operation packets from many Instruction Cells into a smaller number of more heavily utilized channels; the switch units in stage 2 split the streams of operation packets into separate streams for each Processing Unit; and the output streams of the switch units are merged by stage 3 into a single stream for each Processing Unit.

The Distribution Network (Fig. 29b) is structured in a similar manner and provides a path for data packets from each Processing Unit to each Instruction Cell. Switch units direct each data packet toward the appropriate Cell by examining bits of the cell identifier in the packet's destination address. A few arbitration units are needed in the Distribution Network to provide for merging the flow of data packets from different Processing Units to the same group of Instruction Cells.

Since the Arbitration Network has many inputs, a serial format is appropriate for packet transfer between Instruction Cells and the Arbitration Network to reduce the number of connections needed. However, to achieve a high rate of packet flow at the output ports, a parallel format is required. For this reason, serial-to-parallel is done within the buffer units as a packet travels through the Arbitration Network. Parallel-to-serial conversion is performed in the Distribution Network for similar reasons.

The structure of the Control Network is similar to that of the Distribution Network. However, the packets passing through the Control Network convey either simple boolean values or acknowledge signals, and parallel-to-serial conversion of packets is not required; thus the Control Network is composed of only switch units and arbitration units.

We now turn to an analysis of the performance achievable by the data flow processor in performing the FFT computation using the programs of Figs. 23, 25, 26, 27, and 28.
Computation by the data flow processor will be at the maximum rate permitted by the three routing networks (Arbitration, Distribution and Control) provided the capacities of the Processing Units are not exceeded and provided sufficiently many Instruction Cells are enabled to keep the Arbitration Network supplied with operation packets. Two factors control the number of enabled instructions: delays in the passage of packets through the routing networks and constraints on the enabling of instructions imposed by the structure of the machine level program.

Our plan of analysis is as follows: First we determine the maximum computation rate permitted by the Processing Units for the FFT program. Then we consider suitable structures for routing networks able to support this computation rate, and compute the minimum packet transit time for each network. Finally, we show that the processing rate assumed is consistent with the sequencing constraints embodied in the machine level FFT program.

Table 3 gives the number of operation packets that must be processed during one stage of the 1024-point FFT computation, and the number of data and control packets that must be distributed. For determining computation rate, we shall use as a basic measure the rate at which a Processing Unit can receive bytes at a port. Let this rate be 5 MHz, corresponding to use of a medium speed logic family. Since serial-to-parallel conversion is carried out in the Arbitration Network, this is also the maximum rate at which a Processing Unit receives operation packets. We suppose that operation packets can indeed be received every 200 nanoseconds by the Distributor, Int-Processor and Cntl-Processor units, and at half this rate by the two complex arithmetic Processing Units. As shown in Table 4, the Processing Units are able to support execution of one stage of the 1024-point FFT every 512.2 microseconds where this limit on computation rate is set by the speed of the Cntl-Processor unit.

For the 1024-point FFT, 7349 Instruction Cells are required to hold the machine level data flow program, so let us hypothesize a processor having $8192 = 2^{13}$ Instruction Cells. We suppose the
<table>
<thead>
<tr>
<th></th>
<th>Instruction Cells</th>
<th>Multiplier</th>
<th>Adder</th>
<th>Distributor</th>
<th>Int-Processor</th>
<th>Cntl-Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>O     D     C</td>
<td>O     D     C</td>
<td>O     D     C</td>
<td>O     D     C</td>
<td>O     D     C</td>
<td>O     D     C</td>
</tr>
<tr>
<td>Butterfly Units</td>
<td>3072</td>
<td>512 1024 1024</td>
<td>1024 1024 2048</td>
<td>512 1024 512</td>
<td>-- -- --</td>
<td>1024 1024 2048</td>
</tr>
<tr>
<td>Phase Factor</td>
<td>3584</td>
<td>256 256 --</td>
<td>-- -- --</td>
<td>512 1024 512</td>
<td>512 0 1536</td>
<td>1536 1280 2048</td>
</tr>
<tr>
<td>Generation</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Distribution Trees</td>
<td>684</td>
<td>-- -- --</td>
<td>-- -- --</td>
<td>684 1368 2052</td>
<td>-- -- --</td>
<td>-- -- --</td>
</tr>
<tr>
<td>Loop Control</td>
<td>5</td>
<td>-- -- --</td>
<td>-- -- --</td>
<td>-- -- --</td>
<td>3 3 2</td>
<td>1 2 0</td>
</tr>
<tr>
<td>Phase Constant</td>
<td>4</td>
<td>-- -- --</td>
<td>-- -- --</td>
<td>4 5 4</td>
<td>-- -- --</td>
<td>-- -- --</td>
</tr>
<tr>
<td>Queue</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Totals</td>
<td>7349</td>
<td>768 1280 1024</td>
<td>1024 1024 2048</td>
<td>1712 3421 3080</td>
<td>515 3 1538</td>
<td>2561 2306 4096</td>
</tr>
</tbody>
</table>

O - operation packets; D - data packets; C - control packets

Table 3. Packet counts for one stage of the FFT computation.
<table>
<thead>
<tr>
<th>Processing Unit</th>
<th>Time per Packet</th>
<th>Operation Packets Per Stage</th>
<th>Period for One Stage</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiplier</td>
<td>400 ns</td>
<td>768</td>
<td>307.2 microsec.</td>
</tr>
<tr>
<td>Adder</td>
<td>400 ns</td>
<td>1024</td>
<td>409.6 microsec.</td>
</tr>
<tr>
<td>Distributor</td>
<td>200 ns</td>
<td>1712</td>
<td>342.4 microsec.</td>
</tr>
<tr>
<td>Int-Processor</td>
<td>200 ns</td>
<td>515</td>
<td>103.0 microsec.</td>
</tr>
<tr>
<td>Cntl-Processor</td>
<td>200 ns</td>
<td>2561</td>
<td>512.2 microsec.</td>
</tr>
</tbody>
</table>

**Table 5. Design Parameters for an Arbitration Network**

<table>
<thead>
<tr>
<th>Stage</th>
<th>fan-in</th>
<th>fan-out</th>
<th>format</th>
<th>arbitration units</th>
<th>flow rate</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$P_i$</td>
<td>$q_i$</td>
<td>$s_i \times t_i$</td>
<td>$n_i$</td>
<td>$R_i = n_i/(s_i \times T)$</td>
</tr>
<tr>
<td>Stage 1</td>
<td>8</td>
<td>1</td>
<td>$48 \times 3$</td>
<td>1024</td>
<td>106.6 MHz</td>
</tr>
<tr>
<td>Stage 2</td>
<td>8</td>
<td>1</td>
<td>$12 \times 12$</td>
<td>128</td>
<td>53.3 MHz</td>
</tr>
<tr>
<td>Stage 3</td>
<td>8</td>
<td>1</td>
<td>$3 \times 48$</td>
<td>16</td>
<td>26.6 MHz</td>
</tr>
<tr>
<td>Stage 4</td>
<td>4</td>
<td>4</td>
<td>$1 \times 144$</td>
<td>4</td>
<td>20 MHz</td>
</tr>
<tr>
<td>Stage 5</td>
<td>4</td>
<td>5/4</td>
<td>$1 \times 144$</td>
<td>4</td>
<td>20 MHz</td>
</tr>
</tbody>
</table>
Arbitration Network has the structure specified in Table 5. Each stage consists of a rank of arbitration units having fan-in $p_i$, a rank of buffer units, and rank of switch units having fan-out $q_i$. The number of arbitration or switch units in stage $i$ is $n_i$ and these numbers satisfy the relation

$$n_i q_i = n_{i+1} p_{i+1}$$

$i = 1, \ldots, 4$

which expresses the condition that the number of output links from stage $i$ must equal the number of input links to stage $i+1$. In stage $i$, the input packets are represented by $s_i$ bytes of $t_i$ bits each. The formats specified in the table assume that operation packets are $144$ bits in length, and that serial-to-parallel conversions are done by the buffers in stages 1, 2 and 3. The packet flow rate $R_i$ for each stage is computed as the number of channels (one per arbitration unit) divided by the time required to transmit the bytes of a packet serially; the time $T$ for transmitting one byte is assumed to be $200$ nsec.

The design parameters of this Arbitration Network have been chosen so that the full $20$ MHz capacity of the Processing Section can be met. The early stages of the network have generous capacity to accommodate shifts of activity among the Instruction Cells during the running of a computation. For this network, the minimum transmit time is

$$T_A = \sum_{i=1}^{5} s_i T = (48+12+3+1+1)T = 13 \text{ microseconds}$$

Let us suppose that the Control Network and the Distribution Network of our hypothetical processor are similarly designed to accommodate the flows of data packets and control packets indicated in Table 3. It is reasonable to assume approximately equal transit times for the Distribution Network and the Arbitration Network, and about one fifth of this time for the Control Network as it is much simpler than the others:

$$T_D = 13 \text{ microseconds} \quad \text{-- Distribution Network delay}$$
$T_C = 3$ microseconds -- Control Network delay

Now we may ask whether these transit times and the computation rates of the Processing Units are consistent with the sequencing constraints imposed by the FFT program. To answer this question, we must determine how the rate of executing the sequence of $n$ stages of the FFT computation is limited by the structure of the machine level program. We consider the Phase Factor Generation and Butterfly sections of the program since these make up the major portion of the iterative computation. The following analysis is readily extended to include the iterative Loop Control section and the distribution trees without effect on the conclusion. For computing the period of computation we may assume that the boolean values arriving at input $b_p$ are all true so the switch instructions in cells Gen-cntl, But-cnt1-1, and But-cnt1-2 always transmit results and signals to their T-destinations and never to their F-destinations. Thus cell Gen-1 does not participate in the periodic behavior we wish to analyze.

We shall make two simplifying assumptions of a conservative nature: The first is to suppose that each Butterfly Unit sends its results to itself rather than to other Butterfly Units. This assumption is justified by the symmetry of the interconnection pattern of the Butterfly Units. We further assume that the boolean output of cell Gen-2 is always true since this choice invokes execution of the multiplication in cell Gen-5, and will yield the worst case period for the computation.

With these assumptions, the cyclic execution of the FFT program may be accurately represented by a special kind of Petri net known as a marked graph [12]. In Fig. 30 each node of the marked graph corresponds to an Instruction Cell participating in the cyclic computation or to a source of input values from the Loop Control Section. Each directed arc represents a data path between cells specified by one destination of an instruction. The arrowheads of the arcs indicate the type of the packets -- data, boolean, or signal -- that flow over the corresponding path. Tokens
Figure 30. Marked graph description of the FPI computation.
are placed on arcs of the marked graph to represent a live and safe initial configuration of the data flow program. Signal values are denoted by a and data tokens carry unknown values indicated by ( ).

We regard each arc of the marked graph as having an associated "propagation delay" which is the time interval from the moment a token is placed on the arc by its origin node to the moment presence of the token is observed by the destination node. For data arcs we take this time to be $T_A + T_D + T_P$, the time for an operation packet to pass through the Arbitration Network plus the time for the data packet resulting from instruction execution to pass through the Distribution Network plus the time $T_P$ for processing an operation packet by a Processing Unit. Similarly, the propagation delay for boolean and signal arcs is $T_A + T_C + T_P$.

Now our question of computation rate for the FFT program becomes a question about the minimum period for the cyclic behavior of a marked graph when each arc has a known propagation time. This problem was solved by Karp and Miller [21]: The minimum period is determined by the directed cycle in the marked graph having the largest value of total delay divided by the number of tokens on the cycle. Assuming

$$T_A = 13, \quad T_D = 13, \quad T_C = 3, \quad T_P = 4$$

we find there is one critical cycle in the program -- one involving cells Gen-6, Gen-cntl, Gen-3 and Gen-5. This cycle has one token and a total delay of $4 \times (T_A + T_P + T_D) = 120$ microseconds. Evidently, the sequencing constraints present in our FFT program are not a significant factor in determining the performance of the data flow processor. Indeed, its performance could be improved substantially by raising the number or performance of the Processing Units, and increasing the capacity of the routing networks.
VIII. Conclusion

The prospective performance of our hypothetical data flow processor is attractive in comparison with conventional stored program computers. However, whether such a machine is practical depends on the feasibility and cost of its construction. In the interest of obtaining a more practical design, it is attractive to divide the Instruction Cells into groups of eight or more cells and to implement each group together with associated portions of the Arbitration, Distribution and Control Networks by a combination of RAM chips and common control logic.

A limitation of the present data flow processor is that one Instruction Cell is required for each instruction, imposing a practical limit on the size of programs that may be run. This limitation may be overcome by including an auxiliary memory system that has space for all instructions of the data flow program and using a smaller number of Instruction Cells arranged to hold the most active instructions during program execution. The Instruction Cells then form a "cache" whose contents changes as activity shifts from one section of the program to another. An outline of the mechanisms required for this extension of the data flow architecture has been given in [17], and some ideas on the structure of memory systems for a data flow computer are given in [1, 2, 15]. These papers are in harmony with the principles of packet communication architecture.

We are also extending the generality of data flow architectural concepts by developing mechanisms to support procedure definition and invocation, and data structures. Data flow program graphs as described by Dennis [14], among others, encompass procedures and a general data structure capability, so the major problems concern development of satisfactory architectural schemes for implementing these capabilities. Issues in the design of multilevel memory systems for data flow computers have been studied by Ackerman [1, 2]; Rumbaugh's machine [30] implements procedures within a more conventional framework. Recently an extension of the processor design of the present paper to handle procedures and data structures with a high level of generality has been
presented in the work of Weng [18, 37]. Other approaches are being developed by Arvind, Gostelow and Plouffe [7], and by Keller, Patil and Lindstrom [23].
REFERENCES


[27] Peterson, J. Petri nets. Submitted for publication.


