[Company Logo Image]    

Home Feedback Contents Search

4.7 Optimization
4.1 Structure 4.2 Compiling 4.3 Example 4.4 Socket 4.5 OOP 4.6 Competing 4.7 Optimization 4.8 Connector 4.9 Merging 4.10 Emulation 4.11 Ordered Calls 4.12 Last-Only 4.13 Group Inputs 4.14 Complex Sys. 4.15 Prior Art... 4.16 Redundant SW 4.17 Visual SW 

Back Home Up Next

4.7 Optimization of Parallelism in Stress-Flow

const int n=100;

void  Get_Data ( some_type* Dest );

void  Calc_MA  ( some_type* Dest, some_type* Src );  

void  Run_Sim  ( some_type* MA );

//----------------------------------------------------

struct SimData

some_type  RawData[n];

               some_type MAData[n];

 

   detach void StepA (int iDepth)

   {   Calc_MA (MAData,RawData,i);

       Run_Sim (MAData);

       return;

   };

   detach void Go()

   {   Get_Data(Raw_Data);               

       for (int i=1; i<=100; i++ )

          StepA(iDepth);

       return;               

   };

};

FIG. 8: An example to be made parallel and optimized

As briefly mentioned, parallel performance of the stress-flow code is increased by moving as much code as possible from the stressed section to the relaxed section of each of the stress-flow atom. FIG. 8 shows a simple example to optimize. The code is invoked externally by repeated calls to routine “Go.” External routine “Get_Data” obtains an array of raw data. This raw data needs to filtered using routine Calc_MA with a hundred different filtering depths. Every case of resulting filtered data needs then to be processed by external routine “Run_Sim.” The un-optimized code of FIG. 8 stores both raw and filtered data in SimData structure member variables. As all our code is in stressed sections of routines “Go” and “StepA,” the code has zero parallelism. Even if the target hardware had hundreds processors available, the FIG. 8 example would take just as much time to execute as if the target hardware just had a single processor.

 

struct SimData

{  detach void StepA(int iDepth, some_type* RawData)

   {  some_type   MAData[n];

      Calc_MA (MAData,RawData,i);

      return;

      Run_Sim (MAData);

   };

   detach void Go()

   {  return;                

      some_type   RawData[n];

      Get_Data(Raw_Data);                

      for (int i=1; i<101; i++ )

          StepA(iDepth,RawData);

   };

};

FIG. 8A: Code from FIG 8 after partial optimization

Simply moving code from the stressed to relaxed section of routine “Go” would produce erroneous code. Since “RawData” is declared as “SimData” member variable, consecutive calls to “Go” would all be filling “RawData” with new sets of data before the previous one could be processed. Constructing code that way would be a logical error, but one that could easily be detected by a compiler due to structural clarity of parallel processing declarations of stress-flow. A properly constructed partial optimization of example code is shown on FIG. 8A. Both “RawData” and “MAData” arrays are now declared as local variables, which means they are located on separate stacks of every invocation of “Go” and  “StepA” stress-flow atoms. Entire code of “Go” can now be placed in relaxed sections. However, placing all of the “StepA” routine code in relaxed section would not be correct. It needs to access “RawData” array that is allocated on stack of “Go” atom. Unstressing the “Go” atom before a given set of “RawData” is completely processed would result in routine “Go” finishing and its stack reused by some other stress-flow atom. For these reasons, all “StepA” code that needs to access “RawData” is placed in the stressed section. The rest is placed in relaxed section.

 

struct SimData

{  detach void StepA(int iDepth, some_type* RawData)

   {  return;

      some_type* MAData= new some_type[n];

      Calc_MA (*MAData,RawData,i);

      Run_Sim (MAData);

   };

   detach void Go()

   {  return;                

      some_type* RawData= new some_type[n];

      Get_Data(Raw_Data);                

      for (int i=1; i<101; i++ )

         StepA(iDepth,RawData);

   };

};

FIG. 8B: Code from FIG 8 after full optimization

Full optimization of discussed code in shown on FIG. 8B. Both RawData and MA_Data are allocated on the heap. It is assumed that garbage collector is operational, meaning that memory used by both arrays will get released the moment the last pointer storing their addresses gets destroyed. As RawData array persists on heap as long as valid pointers to it exist, all of StepA code can now be placed in the relaxed section.

Side Note: Garbage collector was not part of original C language behavior. It was however added later and is part of many C++ compiler implementations. Such garbage-collector enabled pointers are often called “managed pointers” or “smart pointers” by various authors. Internally, every allocated block memory has a usage counter associated with it. All pointer operations automatically increment the usage counter whenever a separate variable points to it and decrement the usage counter whenever a such variable is destroyed or made to point to another block. When the usage counter reaches zero, the memory block can be safely released.

If the compiler was modified in a way where each stress flow atom would maintain a pointer/reference counter for all anything allocated on the atom’s stack, that stack can be automatically maintained until the usage counter drops to zero. This would allow having the same code as shown on FIG. 8B but with variables that would be kept on the stack and not on garbage-collected heap. This solution is actually most efficient from the standpoint of optimal performance and perfect object-oriented design. As such, it will be assumed to be a standard feature of the compiler based on stress-flow in the reminder of this document.

Unexpectedly, the organization of data in stress-flow programs becomes far more logical than in non-parallel programs. To allow better parallelism, only the data that is persistent beyond several directly communicating atoms should be declared as member variables. For example, the object-oriented matrix processing “Array” object presented on FIG. 5 needed the array declared as a member variable because, on the lower level, many atoms were processing different elements of the array and needed to access it together. However, the actual use of the “Array” object should most likely involve declaring “Array” objects locally or on heap as much as possible, to allow better parallelism.

Following these rules quickly produces code that has all the characteristics and advantages of data-flow concept. The crucial difference that with stress-flow it still is possible to have a structurally persistent data and not just the data that gets passed as the “flowing” values. Lack of a natural method to represent such structurally persistent data was the biggest problem with most attempted universal data-flow multi-processor computer architectures. This was the critical weakness that has prevented wider application of data-flow computer architectures.

This trivial example also shows parallel self scalability of stress-flow. If all Get_Data, Calc_MA,  and Run_Sim were time consuming operations and routine “Go” was called 100 times, our code would instantly be using almost 200 processors and performance boost would be almost 200-fold. If Run_Sim was modified to be a properly constructed stress-flow atom itself, almost 300 processors would be used in parallel.

Back Home Up Next
Send mail to info@stressflow.com with questions or comments.
Copyright © 2005-2010. All Rights Reserved. Patents Pending in Multiple Jurisdictions.
First published 03/29/06. Last modified: 06/25/10