### MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Project MAC

Company Company



## A Note on Asynchronous Parallel Processing\*

#### H. Witsenhausen

maskronous parrallel paralling within a given job in an essentially sulfi-processor computer is considered in a limited framework.

- The scope of an element of data must be defined in two dimensions .
- to the operations performed at a fork.
- the operations that replace the notion of join.
- The impact of look-ahead.
- that loss of time due to lock-out is not significant.
- What a programmers specification of parallelism could be like.
- Forme hardware features that may be helpful.

<sup>• 10</sup> a slightly edited version of a term paper submitted for subject in January 1964. - J.B.D.

# CONTENTS

| 8 | of Parallel Processing                                                                                                                                                                                                                                                                                                                                                                                                                            | Page | 3                                                 |
|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|---------------------------------------------------|
|   |                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      | <b>4</b> 4                                        |
|   | Monosequence Execution                                                                                                                                                                                                                                                                                                                                                                                                                            |      | 5                                                 |
| 4 | General Experson Structure Nata Equivalence Exogram Description Special Cases Allocation                                                                                                                                                                                                                                                                                                                                                          |      | 37743                                             |
|   | Processing                                                                                                                                                                                                                                                                                                                                                                                                                                        |      | 9                                                 |
|   | Conditions for Parallel Operation Public and Private Data Forks  Sement Definition Place Diagrams  Laic Example Rardware Control of Data Ownership Piliation of Data Ownership Programmers Description Serix Multiplication Example Loops of the "While" Type Locked Branches for Scope Termination Publication, Grabbing and Lock-Out of Data Data Dating-in of subprograms Execution Trees Look-Ahead Sharing Processing Capability Application |      | 9 11 12 13 14 17 18 80 21 22 24 25 25 27 27 31 31 |

#### Types of Parallel Processing

The distinction between sequential, single processor and parallel,

- a) Parallel treatment of several bits and registers within a single
- b) Local concurrency in the sense of Codd, which is found in the \$\cdot\$6500 and in the Stretch look-ahead scheme does not raise the same type \$\cdot\$700 problems as the operation we consider below. (No programmers specification)
- c) Synchronous parallel execution, as found in the Solomon computer, similar to conventional operation in that processors requested are assumed that available and the detailed relative timing of all operations is completely planned. Logically, these are single processor computers with a distributed beessor.
- d) We shy away from the difficult and very important case where some retions (inequalities) between execution times of some program sequences by recessors are known in advance. In such a case a "fork" does not necessarily life a "joint." The processor executing the shorter prong returns unconditionally to the idle processor list while the other processor unconditionally rings the program, secure in the knowledge that the parallel operations been completed in due time. As opposed to this we assume that no upper it can be predicted for the time of execution of any program sequence, for tance because any processor may be pulled out from under us by a higher tity interrupt from some other program.
- e) Parallel operation of processors assigned to different jobs, on lead to the same problems (some analogies occur in case tell upon common subroutines). It is essentially a scheduling problem which tempts to satisfy optimization criteria instead of being obligated to satisfy estraints.

Summarizing, we distinguish

- 1. conventional single processor computers
- ditto with local concurrency
- 3. distributed processor computers (synchronous multiprocessors)
- 4. asynchronous multiprocessor computers with interjob concurrenty only.

- 5. ditto with intrajob concurrency
- 6. ditto with consideration of known execution time relationships
- 7. ditto with real time constraints for operation with an environment (strict inequality constraints).

address ourselves to case 5, and this in a restricted framework.

## Framework

In order to concentrate on the questions of most interest to the writer, a number of simplifying assumptions will be made, unless otherwise stated, in this note.

- 1. We seek for a minimum of changes in conventional memories and pressors, such that parallel operation becomes easy to implement, rather than a radically new structure (such as ALPS). In particular one should be able on any problem sequentially with a single processor in the conventional manner, desired.
- 2. We consider a single job. In fact, the multiprogramming concurrency exertal jobs is essential to make parallel processing efficient. Idle processors that be able to find employment. This scheduling problem is similiar to that of the sential computers with I/O concurrency and is not our concern.
- 3. While the main reason for parallel operation within a job is to the distribution as required by the final user, we will not consider the raised by rigid real time constraints imposed on the production of certain that.
- 4. I/O operations are not considered; in the parallel processing framethey differ from other program segments solely by the necessity to obtain to one or more of a special kind of processing units. In all other respects are implicitly included (of Conway).
- 5. No interrupts or traps are considered explicitly though the considered explicitly though the considered explicitly of interrupts is one of the factors that make execution time of a segment a random variable without upper bound and uncorrelated with that wher executions.
- 6. Allocation problems are mosidered only from the point of view of should be saved and to whom it is accessible (the 2-dimensional scope problem)

there it is stored. The latter would be handled by an extension of the models of the storage allocation techniques for the single processor case (cf. Van Horn).

## Remarks on Monosequence Execution

#### . General

Monosequence execution of a job means its accomplishment by a string of citive steps, which include conditional branching. It has been shown by der Foel that a single instruction is logically sufficient:

(A)-(Y) 
$$\Rightarrow$$
 (A)-(Y()  $\Rightarrow$  (Y)   
(PC)+1  $\Rightarrow$  (PC) arithmetic

A distinction between processor state and other data is not essential.

Multiple address instructions the state of the processor between instructions is characterized solely by the program counter. The latter could be a life storage location. The notion of state-word is unnecessary as long as are no interrupts. Even with interrupts, provision for multiple program (TX2, Honeywell 800) does way with the state-word concept. In aschines this concept occurs only because of technological structure:

Precial location of AC, MQ, XR's. . . makes a difference as to how they specified (implicitly, tags, . . .) and how fast they are accessible, and ling more.

The specification of the algorithm carried out by the machine can be extracted at various levels.

sentation instruction by instruction is shown in Fig. 1. Without repts and exception traps, and taking all illegal instructions as halts:

This is the representation considered by Rimet.

Representation at the program level is provided by

- flow diagrams (Rutishauser's, Plan-Kalkiil)
- Algorithm schemes (A. Markov, A.A. Lyapounovland, V.I. Yanov)
- Indidence matrices (Karp)



Restriction-by-instruction representation.



Simplified instruction-by-instruction representation.

we the notion of operator which can be a single instruction or a sequence which conditional branching out of the operator; the terminal control point an operator is unique.

These schemes do not explicitly account for subroutine lipitage because terminal control point of the last operator of the routine is not unique.

The sake it unique one must follow it by a series of tests to identify the call that, an unrealistic representation. This can be avoided by returning to the lastruction level.

To prepare for discussion of parallel processing we now introduce a specific representation.

#### Program Structure

As in the flow diagram approach we consider data as distinct from program.

Consider all programs to be pure procedure (by use of indirect addressing, index states, etc.). Our reason for doing so is not the relocation problem (that we not consider) but the problems of parallel processing.

Programs are considered as made up of a finite and fixed number Not segments

This a unique entry point to each. They are labeled by the integers 1 to N,

Line 1 0 being reserved for halt (or trap to supervisor). When a subroutine call

to be explicitly shown, the calling segment is split, creating a label for

return point. The subroutine entry point has of course its own label.

#### Bata Equivalence

The program, being rigidly fixed, can be thought of as in a protected read-only store. All other information, including the contents of processor states, is called data. In a finit machine the data can have a finite number states only. At any time there are likely to be large numbers of states which equivalent as far as the execution of the job, or of the current segment, is exercised.

At halt or at end-of-segment time this equivalence has two causes: 1) only fraction of the words in store are of interest; the others contain immaterial that the contain immaterial occur leading to the recording of error flags and possible a premature halt.

production situation, as distinct from debugging, all these final states

In effect then, if the job terminates, we are only interested in the state within an equivalence. At all previous stages we are only concerned distinguishing those states that will lead to non-equivalent final states, the states that will lead to endless operation are all members of the like class.

In some way, an equivalence can be defined for any program segment by

## \* Program Description

Each program segment, whenever it is called into operation has access the the current state of the data. Its effect is to map this initial state into the current state and into one of the N > 1 labels 0 to N. If the label producing the state and into one of the N > 1 labels 0 to N. If the label producing the was to lead to anything but one of these, this would be considered the label 0 with error flag. We thus consider the program perfectly desugged the initial data not necessarily meaningful.

Each segment is defined by two single-valued functions.

segment 
$$i\begin{cases} D_1 = f_i(D_0) \\ L = g_i(D_0) \end{cases}$$
  $i = 1, \dots, N$ 

there D and D range over the possible data states, while L is in the set of t

The description of the program by these 2N functions is felt to be more weral than the description of Yanov. The function  $g_i(D_0)$  partitions the states into M+1 classes. The tests of boolean variables constrained by change lists such more restrictive, even when the maximum of 9 possibilities to which scheme can be extended is taken into account. These are obtained, for exem operator, by combining the 3 possible change conditions of the initial wife zero with the 3 similar conditions for the initial value one. The 2 litions are no change, complementation, either. This still does not account the fact that the effect of an operator depends on the previous operator and the initial data.

### Special Cases

- l. g (D<sub>o</sub>) = constant; such a segment is called a "box" (operator)
  string of boxes is a box.
- 2.  $g_1(D_0)$  n or  $n_2$ ; such a segment is a combination of a box and two-way branch.asAshebrownniEig.g33.
- 3.  $g_i(D_0)$  is a function only of the label of the previous segment carplied in  $D_0$ ). This is the case of the subroutine with unconditional return.
- 4. f<sub>i</sub>(D<sub>o</sub>) = D<sub>o</sub> the case of a pure branch, which can always be viewed successive two-way branches.
- Internally generated traps (overflow, etc.) do not transcend this transcend this transcend the coding execution of conditional branches.

The assumption of pure procedure programs means that the functions figure are fixed for successive passages through the same segment and that no segments can be created.

It is impossible to know the exact equivalence classes at any time in the running the problem for various initial states but it is possible to the equivalence classes which are surely fine enough. Assume the error flag is non program resettable, then all states leading to its setting in a second are equivalent. Also states are equivalent if they differ only in the times to be currently meaningless.

#### Llocation

To allow dynamic storage allocation the programmer must specify the more of every piece of data. This amounts to specifying its release (erasure). Scopes are nested in Algol for push-down allocation but this is not a more state.

### Parallel Processing

### Conditions for Parallel Operation

Given a sequential formulation of an algorithm, what is required to

Corn two successive segments in parallel?

Let i and i + 1 be the segments in question, with

(1) 
$$\begin{cases} f_{i}(D_{o}) = D_{1} \\ g_{i}(D_{o}) = i + 1 & \text{for all } D_{o} \text{ of interest} \\ f_{i+1}(D_{1}) = D_{2} = f_{i+1}(f_{i}(D_{o})) = q(D_{o}) \\ g_{i+1}(\theta_{1}) = g_{i+1}(f_{i}(D_{o})) = p(D_{o}) = L \end{cases}$$

to the single segment

(2) 
$$D_2 = q(D_0)$$
  $L = p(D_0)$ 

they are equivalent to any segment which results in

(3) 
$$\begin{cases} D_2 = q(D_0) & \text{modulo equivalence} \\ L = p(D_0) \end{cases}$$

all D of interest, segments i and i+l are said to be compatible if (3) is the relation of the individual instructions of the two sequences.

Is the recessary and sufficient condition for their parallel implementation

The framework.

Two other concepts occur in this connection. The segments are said to

(4)  $f_i(f_{i+1}(D_0)) = f_{i+1}(f_i(D_0))$  modulo equivalence, for all  $D_0$  of interest.

Listing necessary but not sufficient results from very simple examples such as and f(x) + f(x) and f(x) + f(x) which commute but are not compatible if the additions place in the accumulators of two different processors which fetch and store with two operations which are asynchronous.

Commuting segments can be made compatible by lock-out of data but this

The segments are said to be <u>data-disjoint</u> if the set of locations into **segment** i stores is disjoint from the set of locations accessed (for either tending or writing) by segment i+1 and vice versa, and this for all cesses of **laterest**.

This is a sufficient but not necessary condition, for instance both

\*\*Treets could use the same location for temporary storage of a single inter
\*\*Lists result which is necessarily the same in both, or they could execute

\*\*Lists and x+5 \*\* by add-to-memory instructions for which lock-out is automatic.

Thus we have

## Public and Private Data

In a monosequence execution, the same segment can be executed numerous [loop], subroutines) usually with different initial data (different existence class) at each passage. In multiprocessing, similarly, the same executed several times and this by different processors.

It will therefore occur that the same segment is being used simultaneously by the or more processors. These executions must be compabible, - in practice, late-disjoint.

#### This requires:

- a) pure procedure segments referencing data by indirect means.
- b) the indirect means must include at least one element which is distinct for each processor, for instance an index register R associated with the physical processor, P.

It follows the intermediate results generated by two processors

couting the same segment go into disjoint locations and that cross references

one processor P1 to the data generated by P2 are in general undesirable

may be impossible. To find the location of a result generated by P2 we

and to know the state of R in P2 as of the time the result was stored. Even

The has not been changed by P2 it is impossible for P1 to discover whether

is physical processor P2 or P15 that is involved. Looking at all processor

regram counters is no help, since P2 may already have finished the segment

the assumption of completely asynchronous time relationships))

In conclusion we must distinguish two types of data: public and privatelic data is accessible to all processors unless a temporary lock-out condition
lis. Private data is associated with specific processors and accessible
to them. This is not a "look-out" but a long-term memory protection, It
lise accomplished by pointers in the processor and protected pages or segments
linearry. In case of interrupt, only pointers need to be stacked.

In main memory, access to any page or segment by any of the proposed the proposed the proposed the proposed the proposed the public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by this public data not locked out or to private data owned or co-owned by the public data owned or co-owned by the publi

Ownership of private data can be extended to more than one processor when new ones are called to serve, i.e. at fork time.

## Forks

Program execution always starts by activating a single processor, providing with a program counter setting (label of segment) and a pointer to an initially set of private data. To get more than a single execution sequence, an tase in active processors must be possible: the fork. Forks are a new type mero instruction that can be specified in a program segment. Their function the creation of a new set of private data and segment label for the benefit of edditional processor. If none is available, a pointer to this data is the processor scheduling system, which hands out such pointers to physical the sessors as soon as they become available. The fork can optionally specify of several priority levels for consideration by the scheduler. If only job is considered, the scheduler could entity to implemented in herdere. the scheduler must also consider the requirements of other jobs traffic. It will keep a number of queues, one per priority level, the priorities by some algorithm. A processor then becomes necessary the scheduling job. It can be obtained by interrupting any one of the processors. (The interrupted sequence resumes immediately if a processor times available before the end of the scheduling activity). Alternatively, mecial purpose processor can be reserved for scheduling and other supervisory

Expense for hardware to speed up the red tape involved in forks is satisfied. It reduces the minimum segment length for which parallelization worthwhile: large numbers of low priority requests for short segments can be stacked. Even if the majority will ultimately be sequentially promoted, due to the limited number of processors, an over-all gain in job selection time will result. This applies only to the highest priority in multiprogramming. It is very questionable whether parallel processing any but the highest priority job is meaningful. The most likely situation the presence of one job of extreme urgency and of others forming a backwood phich prevents wastage of processing capability. In any case a lower living job would have no incentive to fork for short segments under a meanable scheduling algorithm.

Richards is quoted by Conway as having shown that a single queue of wate words" is not optimum. I do not know his assumptions but it seems plausible segment length should be taken into consideration.

#### Segment Definition

Let D<sub>O</sub> be the public data at the time processor k begins execution of its linear i. Since other processors are constantly modifying public data, D<sub>O</sub> is ally defined modulo equivalence for the operations of segment i. The equivalence states for execution of i by k can not be affected by the other processors since they are assumed to be compatible. Let P he the initial state of data private k (within equivalence for co-owned data). We allow only one fork per segment they menting segments where necessary). At termination of segment i we have at two processors with private data P<sub>1</sub> and P<sub>2</sub> which are to continue with the processors are constantly modifying public data has undergone a transformation into equivalence class D.

We have, modulo equivalence for segment i

segment i 
$$\begin{cases} D = f_{i}(D_{o}, P) \\ P_{1} = h_{i}^{1}(D_{o}, P) \\ j_{1} = g_{i}^{1}(D_{o}, P) \\ P_{2} = h_{i}^{2}(D_{o}, P) \\ j_{2} = g_{i}^{2}(D_{o}, P) \end{cases}$$

description is symmetrical with respect to the two prongs of the fork.

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to

Letty is introduced if we have the convention that processor k is to be a convention to the convention that the convention t

$$f_{j_1}(D, P_1) = D$$

$$h_{j_1}(D, P_1) = \emptyset \text{ the empty set}$$

$$g_{j_1}^1(D, P_1) = 0 \text{ label of halt}$$

$$g_{j_1}^2(D, P_1) = P_1$$

$$h_{j_1}^2(D, P_1) = f_3 \text{ the desired continuation.}$$

dess segments are characterized by  $h_1^2 = \emptyset$  and  $g_1^2 = 0$ , this amounts to the legices processor situation on the cartesian product D x P (within equivalence, the other processors modifying D and co-owned parts of P are running). In  $h_1^2 = 0$ ,  $h_1^2 = 0$  we say that the execessor "quits". Instead timitying job completion this means that all private data is released to free storage unless co-owned by another processor) and the statuter informed of the processor's availability. Only when the scheduler no requests in store for the job and all processors working on it have

### Flow Diagrams

For monosequence processing, diagrams need only the following symbols:

\*\*Example \*\* \*\*Continuous of the following symbols:

\*\*Example \*

With proper interpretation only two new symbols are necessary for multipressing: the fork and the locked brahch. Definitions of the flow diagram the sols in terms of segments are now given:

Box: 
$$g_1^2 = 0$$
  $h_1^2 = \emptyset$   $g_1^1 = constant$ 

Figure 4a.

Junction: indicates equality of successor labels of two boxes; it is not a "join", no waiting is involved. Figure 4B.

Branch: 
$$f_i(D, P) = D;$$
  $h_i^1(D, P) = P;$   $g_i^1(D, P) = j_1 \text{ or } j_2$ 

$$g_i^2(D, P) = o;$$
  $h_i^2(D, P) = f$ 

Figure 4c.

Locked Branch: 
$$g_i^1$$
 (D, P) =  $j_1$  or  $j_2$ 

$$g_i^2$$
 (D, P) = 0  $h_i^2$  (D, P) =  $\emptyset$ 
Figure 4d.

branch in that the elements of data relevant for  $f_i$ ,  $h_i^l$ ,  $g_i^l$  are locked to all other processors during execution of the locked branch. In most cases a single add-to-memory operation is sufficient (with a copy of the same left in the accumulator), lock-out can then be automatic by virtue of the memory access system. In more complex cases the page or segment containing the relevant items has to be marked as locked (not as private) before execution begins and is unlocked at completion. If it is found locked requests are repeated until access is granted. This is feasible because locked operations will only amount to a few instructions and the number of processors fighting for the data is limited by the number of physical processors.

Lock-out of co-owned private data by one of the owners to all others is likewise required for all elements of data relevant to a looked branch.

Return connector 
$$f_i$$
 (D, P) = D;  $h_i^1$  (D, P) = P;  $g_i^1$  (D, P) =  $\phi$ (P);  $h_i^2 = \phi$ ;  $g_i^2 = 0$ 

Figure 4e.



where address is saved as private data as are parameters and wiry results, to permit simultaneous multiple execution of the with nesting and recursion large amounts of private data inerated temporarily.

**Ealt** or quit:  $g_1^1 = g_1^2 = o$ ;  $h_1^1 = h_1^2 = \emptyset$   $f_1(D, P) = P$ Figure 4f.

Exist:  $f_{i}(D, P) = D$   $h_{i}^{1}(D, P) = P$   $g_{1}^{1}(D, P) = constant = j_{1}$   $g_{i}^{2}(D, P) = constant = j_{2}$ 

Figure 4g.

essential feature of a fork is the creation of an initial set of the data  $h_1^2$  (D, P) and a label  $j_2$  for the new processor (this is like to Conway's state-word), with a request for scheduler assignate a physical processor.

#### Lesic Example

A besic example is illustrated by Figure 5. Let A, B, C, and D be inet segments and let the executions required from B and C be compatible.

Let public variable T: = 2, before the fork. The locked branch is an example assection of -1 to T; the quit exit is taken unless the result is zero, like the ease continuation is with D. This is the cusual "join" operation, the locked branch is a more powerful concept.

# Mare Control of Data Ownership

For each page of segment a set of data control bits have to be a set instance as follows. With four physical processors, we use ten the second in Fig. 6. The interpretation of the data control bits is the following example.

| Ł. | Free storage:               | 000000       |  |
|----|-----------------------------|--------------|--|
|    | r seens of the €            | 21111        |  |
| 2. | Public data:                | 101111       |  |
|    |                             | 1111         |  |
| 3. | Public data locked out by   | 101111       |  |
|    | processor 3:                | 0010         |  |
| ٨. | Private data of yet         | 010000       |  |
|    | unscheduled processor       | 1111         |  |
| 5. | Private data owned by       | 000010       |  |
|    | processor 3:                | 1111         |  |
| Б. | Private data co-owned by    | 010010       |  |
|    | processor 3 and one or more | 1111         |  |
|    | yet unschededed processors: | 1111         |  |
| 7. | Same as 6, locked out by    | 010010       |  |
|    | processor 3:                | 0010         |  |
| 8. | Private data co-owned by    | 001010       |  |
|    | processors 1 and 3:         | 1111         |  |
| 9. | e same locked out by        | 001010       |  |
|    | processor 3;                | <b>601</b> 0 |  |

Locking processor with zero meaning no lock-out.



A Same to example.

| 9 s | P1 P2   | På   94 |
|-----|---------|---------|
|     | 181 182 | 1.3 1.4 |

the late for data ownership.

Separate bits for private ownership and lock-out are necessary,

ise to an on what pattern to restore at the end of the

branch is lost. With a six bit set, if the locking processor stores

bits to locking, restoration of the initial pattern by want to

thanges in the ownership status of the segment and there is no reason

them wait. A 7->9 transition occurs when the scheduler assigns

I processor one with this segment on its initial private data list.

transition occurs if Proc. one releases the segment or quits. In

the lock-out bits are than immaterial though processor 3 has no way

ing, if he was not sole owner (Sole ownership cannot be changed

the owners; caseebileout below)

# Lation of Data Ownership

Programs begin with a single processor and all public data. The thing processor can create data private points to astachiless cropies or the soft public data, it can operate on both kinds.

At the first fork, it specifies what initial private data the new

- a) It wreates data which becomes private solely to the new processor.
- b) It declares some of its own private data to be common to the
- "allocation" activity (for lack of a better word) we call a "fork

Both processors can produce new forks withheheamentuislessoschmat

The scope of elements of data is delimited on one side by public, and fork declaration, and on the other side by the following erasing tions.

Erasure of public data (return to free storage) can be effected by processor. That this should not cause conflicts is part of the compati-

Release of private data owned solely by the processor will cause its

A processor can not always know, without a locked branch on the data

bits, whether its ownership is sole. Once it has made the data common

if ownership is sole at any time this fact can not be changed withthe processor's consent.

Release of private data which is common to other processors (which still be unscheduled) only lowers the corresponding data control bit. The last-but-one processor releases we are back to sole ownership finally, when the last processor releases, to erasure.

When a processor quits it automatically releases all its private

## Programmers Description of Data Control, Forks, Locked Branches

We modify Algol, to eliminate the block concept; declarations releases are explicit and at the whim of the programmer (not nested), and end serve only for statement parentheses.

Examples:

declare public real X; integer array NAME (1:N\*M)

and later

release X

Le the array stays on.

Similarly

declare private y

nd later

release y

er perhaps

quit

use here A when we want to underscore the fact that a variable is private, exposition purposes.

A COMPANY CAN APPLICA

for instance

fork declare private T common y; 1: = J + K -1;

priority 5 to lab;

is public, I is known to the current processor but will not be known to be the new one, I will be known to both.

The priority level is for the scheduler, it could be an arithmetic fraction. When scheduled the new processor will begin at label "lab." Lime the values of  $\hat{J}$  and  $\hat{J}$  may have changed, that of  $\hat{I}$  not.

At a locked branch

begin lock T; T: = T-1; if T > 0 quit else so to lab end

The fork in the basic example is part of a subroutine which can be called

\*\*Processor\*\*, the variable T must be declared private to the enter
\*\*Rescensor\*\* who then extends its ownership by a common declaration at the

\*\*This shows the necessity of distinguishing between privacy and lock-

## Example of Matrix Multiplication

We presume the following declarations of public data:

integer N, real array A(1:N, 1:N), B(1:N, 1:N), C(1:N, 1:N)

Leastble program is the following which is diagramed in Fig. 7.

daclare public integer T, I, J; T:=N2 + 1

I: = 1 step 1 until N do

for J :=1 step 1 until B do

fork declare private integer I, J; I: = I; J: = J;

priority 2 to term;

test; begin lock T; T: = T-1;

if T> o quit else go to next end

term; declare private K; C(1, 1): = 0;

for K: = 1 step 1 until N do

C(1, J) := C(1, J) + A(1, K) \* B(K, J);

release K, I, J;

go to test;

next: - - - -

Note that we can just as well have the initial processor quit conditionally after issuing the  $N^2$  forks (set initially  $T := N^2$ ).



Flow diagram for matrix multiplication.

why storage positions for I,J must be available? For k << N<sup>2</sup>+1

"wars, if the scheduler queue system builds up to a maximum of

"k we need k+q stores which could be close to N<sup>2</sup>. To economize

"is storage one has to test queue length (a public read-only variable)

"sach fork. If q is too large the processor queues itself with

"late priority.

t distance

15 0 > 10 fork declare common all, priority 2 to next; quit

# Loops of the "While" Type

Loops of the "while" type differ from the previous case in that the wariable can not be set at entry. Instead we start at one and index such time the range of the loop is entered as illustrated in Fig. 8.

I shows a case where two side computations are done for each traversal the range. If B is traversed n times, so are C, D, G and I. After loop and J will be traversed once when all n executions of G and I, crively, have been accomplished. Box K is executed once upon complementations are done to the complementation of Both H and J.

# te of Locked Branches for Scope Termination

It it is important to release storage space of some large public soon as possible, we can use locked branches as shown in Fig. 10.

declare public M; T:=2;

a: - - - ;

fork declare common all, priority 2 to C;







b: ---; begin lock T; T:=T-1;

diff T=0 go to 1<sub>1</sub> else 1<sub>2</sub> end

1<sub>1</sub>: release M; 1<sub>2</sub>: ---
c: ---; begin lock T; T:=T-1;

elif T = 0 go to 1<sub>3</sub> else 1<sub>4</sub> end

1<sub>3</sub>: release M; 1<sub>4</sub>: ----

private no locked branch is necessary, we simply release M at and the data control bits take care of the rest.

# Publication, Grabbing and Lock-Out of Data

A processor can publish any of its private data, this being equivalent declaration of a public copy before release; this can be abbreviated

A processor can grab public data by declaring a private copy of it

By first Prabbing and later publishing data elements we obtain lockthe the ordinary sense, independently of the locked branch concept. It the lieved that such lock-outs are rarely justifiable in programming since the tend to nullify the speed advantages of parallel processing.

# ng-in of Subprograms

A segment of the box type (defined previously) has the property

for each entrance of a processor there will be one and only one exit

corresponding processor. To analyze complex flow diagrams it helps to

lify subprograms with a single entry and exit which have the same property.

condition is that the number of quits always balances the number of

before emergence of the processor or of one of its descendants at

it. Just as for segments, these boxes can, in general, be subject to

simultaneous executions - namely if a new processor enters before

previous one exits.

The box concept can be generalized to the single entry multiple case. The condition is then that for every entry there will be an extreme at one and only one of the exits. Again all forks must be considered by quits before the emergence. Different boxes can call the subroutine, provided the subroutine is itself a box (no"side-effects").

## Execution Trees

To check a program as to compatibility of all segments that can recuted in parallel, it is helpful to find all commutation conditions.

This can be found by imagining execution with a single processor with a single process

At a fork the number of sequences ready to run increases by unity.

It number is shown by the exit degree of the corresponding thee node.

It quit the number decreases by one. Therefore successive tree nodes

It by + 1 in their exit degree. The root (entry to program) has exit

while tips of the leaves (exit degree 0) are reached when all

leaves have quit. Each arc is a transformation of the data, or operator.

It is the string of operations for every path from root to a tip and

while these strings to be equivalent gives the commutation conditions.

There is in general dependent on the initial data, so that a family of

these (forest) has to be considered.

Boxing-in allows to go through this procedure in steps, beginning that the trees for the innermost, simplest boxes. At the next level, the mountation conditions having been checked, these boxes are equivalent to be sperator.

#### 16. Look-ahead

Definitions of look-ahead programs:

- Definition 1: The final exit can be reached with some processing still occuring.
- having to wait for execution of all the sequences it initiated.



Fig. 10 Use of a locked branch for scope termination.



Fig. 11 Subroutine for look sheed control.

The entire program cannot be considered a single box.

A look-ahead sequence is one which may turn out unnecessary; it is lated by a fork and proceeds in "look-ahead status". At a later time processor will discover whether it has become a necessary or inecessary. The fork is taken as soon as the data is available, necessity is taken or disproved at a later time.

The major problem is to communicate the change in status to the change in stat

The suggested solution uses boolean variables and a hardware feature.

\*\*Contract Cook-ahead bit LAH is associated with every processor (in its word") and is normally zero. At a look-ahead form this bit is set and a pointer is set to two bits LAH and UNN declared common to both the fork. The \*\*Contract Cook in Fig. 11 is then prepared, partly in the form. At the fork we set

LAH: = LAH: = 1 UNN: = 0

The subroutine could be run automatically, as long as LAH=1, by the subroutine assignment of a physical processor and them by that processor k instructions.

As long as LAH=1 all descendants of the processor are initiated in wheat status, with common ownership of LAH and UNN.

The subroutine action for UNN=1 can be a jump to the end of the

As an example consider the "while" loop implementation in Fig. 12 where
3. . . . 7 designate boxed subroutines. Box 3 is \*\*\* STATE \*\*\* inside

y only in some revolutions, and for which the data

ble before test of predicate P decides its necessity. The method

excounts for the case that on the n<sup>th</sup> revolution an unnecessary look
sequence started on a previous revolution may still be running.

Note that the arrays indexed with o are always of small size, regardof c, because the lower c-values are progressively released. The value
could be incremented modulo a sufficiently large integer if very large
lation numbers are expected.



Fig. 12 An essample of Look-shead.

heker's take-a-check system).

A simpler method is available

if P=1 wait for termination of box 3 at the "join" if P=0 do not begin box 6 without backing (by a common boolean variable) that box 3 has been stopped.

way we prevent restart of the loop with old look-ahead sequences

funning. It is felt that the complications of the first approach are

functive time-wise, expecially if the look-ahead-status checking subroutine

programmed and therefore is called at longer intervals.

## Raring Processing Capability

Some operations which can be accomplished much more rapidly by hardware than by subroutines do not occur quite often enough for the properties of this hardware in every processor. It seems worthwhile to an appropriate number of these units, one for operation A, 2 for frequent operation B, etc. for, say, 10 processors. Call is by todes": if all special units are busy, transfer to the subroutine is

If the time difference (subroutine-hardware) is considerable, one queue requests up to a maximum length after which the subroutine is

### **Applications**

An obvious class of applications for this type of multi-processing
integration of ordinary differential equation systems. Speed gains by
proportional to the order of the system (up to the number of physical
are possible and parallelism is easy to specify. Such problems
cour under time pressure conditions (missile tracking). The superiority
present approach compared to DDA's or a superiority
specific of compilation techniques with changes in problem size. The
line of compilation techniques with changes in problem size. The
line database must fit the problem to a fixed equipment complement
line tracking and the system is the line of the system of the system is the line of the system of the system is the line of the system of the system of the system is the line of the system of