[Company Logo Image]    

Home Feedback Contents Search

5.1 Stack Use
5.1 Stack Use 5.2 Internals 5.3 Microprocessors 5.4 Parallel HW 5.5 VPP HW 

Home Up Next

5.1 Stack use, Automatic Stack and De-allocation

The fact that all execution is split into a very large number of mini-threads each with own stack requires that these stacks can dynamically grow (and at times also shrink) as needed. This would have been a problem with older processors that used “flat” (also called unsegmented and non-virtually mapped) memory model. Since in general case we do not know how long the stack would need to be, fixed size allocation of stack would make stress-flow practically unmanageable. The only way to run stress-flow on processors with flat memory would be for the compiler to determine maximum required stack at compilation time which would not be that complex (due to relatively small size of stress-flow atoms) but would require prohibiting techniques that would make such determination impossible (example: allocating an array on stack with non-constant number of elements). The modern processors obviously do not have such problems. Virtual (or segmented) memory mapping techniques used there allow starting with minimal stack and growing it as needed and this is precisely what is required for efficient implementation of stress-flow.


FIG. 29: Managed stack that persists as long as any references to its data persist

There are, however, additional stack allocation requirements. As briefly mentioned in discussion of code from FIG. 8A and 8B, the management of stacks for stress-flow atoms creates some challenges. Suppose some data structure is allocated on stack of a stress-flow atom and another stress flow atom is called with address of the stack allocated data structure. Normally, the first stress flow atom finishing its calculations would result in destroying the stack or allocating it to another micro-thread. If this happened before the second stress flow atom accessed the data, the consequent access would result in reading invalid data. The problem could be avoided by allocating all local-use data on automatic heap rather than stack, but this solution, (resulting in some code limitation that would have to be rigidly enforced) would be fairly inefficient in case of many small variables being constantly allocated and de-allocated from the heap. The solution to this problem is a stack organization that counts all references to an instance of stack and prevents releasing/reusing a stack as long as references to it exist – as shown on FIG 29. The functioning is similar to garbage-collected (or managed) dynamically allocated heap memory blocks. Each stress-flow atom/mini-thread associated stack thus maintains the counter of references to it and persists past the last instruction of the stress flow atom instance it was generated for. Once the counter drops to zero, the stack can be released/reused.


FIG. 29A: Node assignment process in mesh architecture 

Another issue affecting construction of stack interfaces is the need of handling the default path calls for ordered and similar stress-flow atoms. Default path calls require re-firing the ordered and similar stress flow atoms with the same set of parameters as immediately before. This requires keeping a copy of these parameters somewhere. The method of stack operation based on previous art concepts would allocate or assign a stack right after the lock of a stress flow atom was reserved. The parameters would then be pushed onto the stack and a new stress-flow atom scheduled. In case this was a default call, the previous parameters stored somewhere would now have to be pushed on the newly allocated/assigned stack. On some large number processor system, several parts of such scheme would be inefficient even without default path support. Consider a FIG. 29A four node part of “mesh” multi-processor computer with distributed memory. In such organization a node consists of a processor, memory, and interconnect circuitry to communicate with neighboring nodes. A new task (instance of stress-flow atom) has to be assigned a node as close as possible to the nodes the new task has to communicate with. This is required in order to optimize performance. Suppose we have some object X allocated in node C and node B executes code that needs to fire stress-flow atom aaa from object X. In order to be able to pass parameters, node B needs to obtain access to the stack that will be used by X.aaa instance fired by it. In this architecture this means the need to already assign node to the fired instance of X.aaa even if all the nodes around are busy and none will be able to actually run X.aaa any time soon. Since B needs to write parameters for the new instance of X.aaa, it has to be allocated somehow close to both B and C where the object X and its data reside. In the situation shown on FIG. 29A, if node C was overloaded and node A was relatively free, A would be assigned the new task due to its relative closeness to both B & C even if say node D was significantly more free than A. Performance problems with this solution come from the fact that the new node and especially new local stack had to be assigned up front at firing time, based on node loads at time moment possibly way ahead of the moment X.aaa will actually be executed. At the time X.aaa actually runs the node load situation might be totally different. The second performance problem comes from the fact that due to need to write parameters for the new stress-flow instance, the node selected for X.aaa has to be close to the node that fired it (in this case node B).

It may appear that not much can be done with the problem described above. Stress-flow, however, has some unique characteristics that can be taken advantage of in order to greatly improve performance in the situation shown on FIG. 29A. Every stress-flow atom consists of the stressed and relaxed sections. The underlying principle of stress-flow is that only one process (or one mini-thread) can be inside the stressed section at the same moment. Also, once a newly fired stress-flow atom gains control of the associated lock, it does not release it until it is run and until it reaches the relaxed section. The implementation also already requires allocating data space for holding the lock data at the same scope where the stress flow atom was defined. This is also where the default-path copies of the parameters of a selected stress-flow atom types would have to be stored. But, there is nothing preventing us from reserving proper space there and storing the stress-flow atom parameters themselves. The only downside would be that doing something like that is impossible with currently available compilers.


// Atom lock structure generated and hidden by the compiler here

// int aaa;          // Storage for atom actual parameters

// some_type bbb;    // Storage for atom actual parameters

detach void SomeAtom(int aaa, some_type bbb)  

   … some_instruction(bbb); // This use can accesses bbb stored

                            // with the lock


   … some_instruction(aaa); // Use copy of aaa on the local stack


FIG. 29B: Access scheme of stress flow atom actual parameters

The firing code would thus store a call’s actual parameters in the space reserved for it with the stress flow atom lock. The stored values would need to be copied from there to the stress-flow atom’s stack only if they were used in the relaxed section, as shown on FIG. 29B. Storage for all actual parameters for “SomeAtom” is generated at the same place and scope the lock data is. If a parameter is used only in the stressed section, no copy of it is ever made on the local stack. If a parameter is used in the relaxed section, a copy of it is made on the stack and the accesses to it within the relaxed section operate on/with the copy.

This may seem as a little unorthodox and complex, but the advantages, however are enormous and well worth the cost. First of all, there is no delay associated with allocating a stack (or a whole mini-thread) to a newly fired stress flow atom which already improves performance on all possible implementation platforms. The moment the control of the stress-flow atom’s lock is obtained the parameters can already be stored in the space reserved with the lock without even knowing what node/processor will actually run the newly fired atom. In case no acceptable node is ready to run the fired atom, it can be queued inside the node that hosts the atom data awaiting change in situation. Queuing the fired stress-flow atom rather than actual node-assigned mini-thread complete with local stack with parameters written to it means that we do not tie resources (such as mini-thread control data and stack) until a node is ready to run the fired atom. This means ability to easily assign and reassign the target node to run the atom and further improvement in performance. But the gains are most important in distributed parallel architectures. The firing node does not have to communicate directly with the node that runs the fired atom which allows more flexibility in assignments and a simpler assignment algorithm. Also, not being forced to assign a node ahead of time means that when the atom actually runs it will run on best possible node and not one that had to be guessed to be the best ahead of time. If the guessing was wrong, and an atom was assigned to a very busy node, it can be easily reassigned to its neighbors as soon as they become free. In situation shown on FIG. 29A this means that X.aaa can be assigned to node D rather than being artificially and wrongly forced into node A.

The last advantage is that default call/firing implementation is automatic. Since the stress atom actual parameter data is persistent between calls, default call firing simply means firing the node without touching the parameters at all since the copies of previous values already are exactly where they need to be.


detach void Func(int@ aaa)










FIG. 29C: Example of passing parameters by “coercion”

Above described method of passing actual parameters has another advantage of allowing new type of parameter passing. Since storage place and previous value are persistent between calls and safe to access by tasks while the corresponding lock is reserved, this can be taken advantage of by allowing calls of, for example, “one more than before” sort. Such parameter method could be indicated by placing say “@” character inside parameter definition as shown on FIG. 29C. Such scheme would allow passing unary expressions operating on the space reserved on the parameters as also shown in several examples of calls on FIG. 29C. With prior art parameter passing through stack, something like that would be impossible, but here it has perfect sense. Before the expression given is executed, the lock of function “Func” must be reserved, assuring that two processes cannot be accessing the parameter space at the same time. Due to its characteristics, such method of parameter passing could be named “by expression” or “by coercion”. This method will allow merging many separate functions for typical object operations into one. Rather then write several functions for object: “increment,” “decrement”, “set” – all of it can be now done with one function passing proper parameter through “coercion.”

It may appear that similar effect could be achieved with prior art object-oriented languages through operator overloading. This is false. Here, we specifically guarantee that only one task/process can be changing the variable at the moment while also allowing proper notification of other objects and functions if it so desired by designer of the function. Consider, once more, the example on FIG. 29C. The function body allows notifying others to take action, before another change can take place, through “notify_A” call, and it also allows to start post-fact notifications by “notify_B” call made from the relaxed section of “Func” routine.    

Home Up Next
Send mail to info@stressflow.com with questions or comments.
Copyright © 2005-2010. All Rights Reserved. Patents Pending in Multiple Jurisdictions.
First published 03/29/06. Last modified: 06/25/10