6.1 StressFlow Electronic Architecture Node Model
So far stress-flow has been described mainly from the perspective of universal interface to the multiprocessor machine it offers to the user. But it also has another side to look at – namely how it defines multiprocessor architecture and how it utilizes it. These two faces of stress-flow are not separable and have to come together to make it a practical method. The model of stress-flow from the point of view of low level organization and electronic circuitry also allows better explaining of inner workings of stress-flow.
FIG. 31: Layout of processing node of stress-flow using conventional stack parameter passing
FIG. 31A: Layout of processing node of stress-flow using lock structure parameter passing
FIG. 31 and FIG. 31A show hardware conceptual layouts of processing node of stress-flow. FIG. 31 shows the layout where actual parameters of the tasks and optional return value are passed through stack in conventional fashion. FIG. 31A shows the advanced architecture where task locks provide storage for both actual parameters and optional return value.
The solution shown on FIG. 31 has to be discussed because it is still capable of fulfilling all functional requirements that necessitated stress-flow, although with much lesser efficiency, particularly on distributed memory architectures. Other than the issue of selected method of passing parameters, presented layout is the same regardless of details of particular target hardware. There always has to be plurality of such nodes, even in special case implementations that run on a single processor. In such case, the implementation must emulate virtual processors that time-share the same physical resource, but which otherwise behave as if they were separate hardware pieces. From the standpoint of code running on a single processor, every architecture appears the same, the code only interfaces through lock constructs which, by virtues of stress-flow, become critical processing/hardware resources, are regulating all aspects of parallelism and providing interface to other tasks. The locks in turn interface with implementation specific Task Sending and Task Receiving Interfaces – which could be seen as part of hardware/software implementation of locks themselves. Due to importance of the locks it would only make sense that computer architectures geared toward stress-flow would attempt to implement all lock operations in hardware.
Operation of architecture shown on FIG. 31 is as follows. For a new task to come to life, it has to be sent to the Task Receiving Interface. The task definition always consists of lock identification of the current task plus identification of sequence of instructions to run. In this architecture, with parameters passed through the stack, the new task stack identification or address has to be also included in task definition because at the point of sending the task definition to the Task Receiving Interface the stack must already be allocated and set with task’s actual parameters.
The lock is reserved before the task is sent to Task Receiving Interface. Task Receiving Interface can run the new task at once or store it in some queue to be retrieved from there when the processor becomes available. A queue like that makes sense even in the case when the target platform has a very large number of processors. Also, the ability to frequently schedule far more tasks than there are processors is one of the key futures of stress-flow that allows very efficient parallelism on different platforms. Without such a queue a processor would not have anything else ready for it do after a currently running task has been suspended through a reservation attempt on some lock while schedule instructions would have to suspend the caller in cases when there were no available processors to take the task – complicating the scheme and making it far less efficient.
At the moment the Task Receiving Interface receives the new task, the processor can either be doing nothing or already executing some other task. If it is doing nothing, the Task Receiving Interface will make the processor execute the new task right away. If the processor is already executing another task, the incoming task is stored in the queue and will be removed from it when the processor becomes free: the previous task completes or gets suspended waiting for some lock.
Architectures with distributed memory may need another module connected to the Task Receiving Interface and cooperating with it: Code and (optionally) Data Fetching Interface. The time the tasks wait for processor in the Task Receiving Interface FIFO is perfect time to pre-fetch any code or data if it is necessitated by the architecture since these functions can be performed with little or no involvement of the node processor and all information needed for it is available in from task definition. Data copying pre-fetching should only be performed in architectures such as virtual parallel networks where the overhead of trying to do data access on byte-by-byte as needed basis would be prohibitive. However, code pre-fetching combined with caching is a good idea on all systems with distributed memory, even if nodes can access each other’s memory at relatively little cost. The reason for it is that in most cases the same sequences of instructions will be scheduled to execute as separate tasks many times, and often they will be assigned to the same node or one close to it.
Once the new task is running, it will have to release its associated lock somewhere inside its sequence of instructions and it may need to initiate new tasks. The idea here is that few commands performed on the lock constructs is all that a running task knows how to do in regards to other processors or hardware. This lock/task initiation scheme is intended to be the main (or even the only) means to communicate with other processors and to synchronize with them. A task may access some common or global memory without the lock/task interface, but if any constraints are needed on such access, the method used to accomplish it is once again through using the lock construct.
To start a new task, the current task first reserves the new task’s lock. Next it obtains a new stack for the new task and stores the actual parameters on that stack. A reserved lock, identification of sequence of instruction to run, and identification of stack for the new lock constitutes definition of a new task. Additionally on memory distributed parallel architectures, the new task definition should also contain originator task provided “Resource Location” to aid in selecting the node to run the new task on. The task parameter resource location concept was described in separate chapter and its idea is to provide rough information as to where most of resources needed by the new task reside.
Next the schedule command is issued to the lock that will allow thus completed task definition to be passed by lock to the Task Sending Interface. If the new task return a value, an optional wait-for-lock-release command may be used before the returned value is accessed which will be discussed in detail later.
Once a new task description is complete it gets passed into Task Sending Interface. This module routes the new task to the node in which it will be run. This involves selecting the optimal target node, using (if available) the Parameter Location information combined with current possible target nodes loads. The cost/delay of waiting for other tasks to complete is weighted against the cost of accessing the needed resources to come up with an optimal target node. On memory distributed architectures the Task Sending Interface of a specific node may only have access to its neighbor nodes. In such a case it simply passes the new task to the Task Receiving Interface of its neighbor in the direction of the optimal target node – and not directly to the Task Receiving Interface of the target node. Regardless of details and architecture of the target platform, the only information that can be required of the user software for proper node assignment and optimization is the Resource Location calculated by the originating task.
FIG. 31A shows how using the special parameter passing scheme through the lock affects the architecture. As previously explained, the idea is to have memory reserved for task actual parameters as part of the lock or next to it in the same memory space. Since only one task can have a particular lock reserved at a given moment, this is a perfectly sound scheme. A task calling a new task stores the actual parameters only after reserving the lock. Another task will be prevented from storing/corrupting the previously stored actual parameters in the lock by the need of reserving the lock first. If some parameters are needed inside the new task’s relaxed section, they can be copied to the local stack at or before the stressed to relaxed section transition.
The big difference that this scheme makes is that the stack for the new task needs not to be allocated and accessible until the new task actually runs on the target, possibly after period of waiting in target node’s queue as shown on FIG. 31A. This greatly simplifies the Task Receiving and Sending Interfaces as the task definition passed through now does not need to contain the ID of the stack assigned to the new task. But the performance gains are most significant when implementing stress-flow on any distributed memory architecture. In such a case, the preferred location for the stack is in the local memory space of the target node. The problem is that at the time of writing the new task actual parameters we do not know the target node yet, and it would be better not to have to make final determination of where the new task should run until the moment when it actually runs. The load situation of all nodes changes all the time and therefore so does optimal node assignment. Having to firmly assign target node for new task way ahead of time would therefore introduce costly inflexibility. For these reasons, allocating the stack for the new task way ahead of time only to store parameters passed to it there is a very bad solution. A complex scheme could be designed where new task actual parameters would be stored on the calling task’s stack and the pointer to it passed along. This scheme would however be far more complex and less efficient than using the lock as the actual parameter storage area. With no need to pre-allocate the new task stack or need to access anything but the lock space, the Task Sending Interfaces can now freely pass the new task definitions from one node to another based on current node loads. Node loads themselves do not need to be known precisely until actually dealing with a new task ready to run.
Send mail to
email@example.com with questions or