[Company Logo Image]    

Home Feedback Contents Search

5.4 Parallel HW
5.1 Stack Use 5.2 Internals 5.3 Microprocessors 5.4 Parallel HW 5.5 VPP HW 

Back Home Up Next

5.4 Application in Parallel Architectures

The important feature of stress-flow is that, in spite of being intended as extension to existing object-oriented languages, it always relies on a large number of processors (physical or simulated/virtual) to accomplish its object-oriented character. This means that stress-flow offers substantial gains in clarity of software development even if a program developed using it runs on a single processor that emulates many virtual processors. This is the key, crucial element that seemed to be missing from the previously proposed parallel programming methods. In many people’s view, this is the very characteristic that has doomed all previous methods, limiting most practical parallel software development into sequential design in Fortran or C where elaborate compilers tried to parallelize software written as sequential. Specifically proposed parallel methods, on the other hand, would almost always be geared toward one specific architecture, sacrificing universality of written code and extremely complicating debugging process. Stress-flow does not suffer such limitations and it can be said that decisive majority of algorithms correctly coded using stress-flow will run equally well on all universal-purpose computing platforms. Furthermore, it naturally distributes/localizes the memory it uses which allows application on distributed memory systems. It will be shown how stress-flow is implemented on different existing parallel platforms. 

Implementation of stress-flow on shared-memory multi-processor system is most straightforward and leaves least issues to resolve as far as application of stress-flow goes. But the problem with such architecture is that it either uses the same memory bus for all processors or it uses crossbar circuitry for connecting memory with processors. The first solution makes the system bus the hardware bottleneck resulting in situation where only a few processors can be used efficiently. The second solution of using the crossbar is quite expensive and not easy to miniaturize. It results in complex, high-speed, energy-consuming circuitry that connects multiple processors with multiple memory blocks. Stress-flow can run well on shared-memory multi-processor system but its advantages come from the fact that they can run equally well on other multi-processor architectures.

The object-oriented stress-flow programming method of stress-flow results in breaking the problem into a number of stress-flow atoms or parallel mini subtasks. Each of them is clearly defined user of one or more small sets of data and a producer of usually one small set of data. To be able to break a problem into sub-atoms, the data itself has to be organized into subparts that can be worked on independently. A well written stress-flow atom only gets access to the subset of data it actually needs to access and the stressed/relaxed layout of atoms naturally encourages it – properly written stress-flow code is almost always shortest. Practice quickly shows that stress-flow object-oriented programming is often more clear and easier to understand than object-oriented programming without it.

The sets of data processed by a particular stress-flow atom are most often data located on stack or heap of the stress-flow atoms that led to firing current stress flow atom and rarely some more persistent structure created far back in some remote location. For these reasons, the ideal hardware for stress-flow is a large number of small processors configured into an n-dimensional mesh/toroid of processing nodes, each having own processor(s) and its own memory, preferably all on one chip. The node interconnecting circuitry/bus could be much slower than the internal bus thus allowing low cost/low power design thanks to all high speed circuitry being inside individual chips. Large number of slow processors organized in a higher degree mesh (three, four, five-dimensional, etc) thus becomes a far better computing platform than a few very fast processors. Such architecture is very natural from the standpoint of ease and cost of hardware design.

Since stress-flow inherently atomizes all problems into a very large number of mini-tasks running in parallel, increase in number of processors directly translates into performance increase even if both individual processors and their means of interconnection are relatively slow. What is required, however in such a situation is a method of properly assigning newly fired stress flow atoms to processors in a mesh, in order to minimize performance loss due to remote atoms accessing each other’s data. The issue is very important, because it only makes sense to spread newly fired atoms around the mesh as long as delay of accessing remote data is not higher than cost of waiting for processing resources to be available closest to where the data resides.      

To properly assign a newly fired stress-flow atom through localized (non-central) computation of best node to use, we need current load of each processor and cost of node hops required to access each of the non-local memory blocks passed as pointers/ references to the newly fired atom. This computation can be crude, giving only very rough estimation of the best node to run the new stress-flow atom on. Since stress-flow atoms are run as relatively short mini-threads, load situation of nodes changes rapidly. Therefore, key to performance of stress-flow design is atomization of all code into a large mass of mini-threads and being able to instantly reassign scheduled atoms from one node to another – not completely accurate/optimal original choice of node to run a newly fired atom on. Whenever a node becomes free it takes work off the queues of its neighbors if they are busy. Moving a scheduled instance of stress-flow atom is quite easy and, if the actual-parameters-stored-with-lock method is used, it requires moving just one pointer from one node to the other. Empirical experiments and simulations show that roughly accurate estimation of node hoping cost is far more important that estimate of current or projected node load. This has to do with the fact that if the node load is self correcting, while node hoping is not. A node with underestimated load will get more waiting tasks that the neighbors, and the neighbors will either be now getting newly fired atoms or they will quickly finish and take the load of our overloaded node. A node with overestimated load will finish its work and thus will get its load estimate lowered. No such self-regulation is possible as far as optimizing node hoping is concerned. Wrongly assigned stress-flow atoms will push new calculations away from the data that needs to be accessed and the problem will compound itself rather than self-correct.

 

FIG. 30: Subroutine data location concept used for assignment of processing node

For these reasons, it is helpful to have a universal mechanism to properly ascertain penalty associated with accessing remote data. Such mechanism does not have to be accurate. Doing that completely accurately would be time consuming and the cost of it could outweigh benefits. The idea is to have an estimate of where most needed data is located and how much of data is there plus quick method to combine such estimates together into one as roughly shown on FIG. 30. The result location structure A+B is not an exact merging of A and B but rather rough approximation of one data location structure somewhere in the middle between A and B or the most important elements of A and B, depending on architecture.

 

FIG. 30A: Sample assignment and distribution of tasks in mesh architecture

 

FIG. 30B: Sample operations on data locations in mesh architecture of FIG. 30A

In a mesh parallel processor architecture, the best starting candidate for running a newly fired stress-flow atom from the standpoint of least node hopping is a “center of gravity node” of pointers/references passed as parameters to that newly fired atom. If that node’s neighbor offers better performance because of the lower load, it now becomes the candidate and its neighbors are checked. The process would stop when the current candidate was found to be better than any of its neighbors. Such algorithm, combined with natural, non-hardware specific stress-flow design rules therefore allows optimal utilization of any specific architecture. FIG. 30A shows an example of how such algorithm would work for a 8 by 8 mesh of processing nodes. The numbers inside the squares represent the number of tasks waiting for processor (but not a lock) in a particular node. The shaded nodes N(2,2) and N(5,5) show nodes where memory of the parameters for some atom being fired are located. If the amount of data that needs to be accessed in both shaded (parameter data) nodes is the same, we want the fired node to run as close as possible to half distance between the parameter nodes but while taking current load of nodes into consideration. If amount of data to access is considerably more in one of the shaded nodes, we want the newly fired node as close as possible to the node with more parameter data. A sufficient method to calculate the center of gravity in this system is simply the average of the coordinates of nodes weighted by the amount of data to access in each of them. For example, if the amount of data to access in both N(2,2) and N(5,5) was some number m, then the center of gravity would be ((2m+5m)/2m,(2m+5m)/2m) = (3.5,3.5). If the amount of data to access in node N(2,2) was twice as much as in N(5,5), then the center of gravity would be ((2*2m+5m)/3m,(2*2m+5m)/3m) = (3,3). Locations structures for this situation and process of combining them is shown on FIG. 30B. Structure A has 2000 units of memory at (2,2) and structure B has 1000 units of memory at (5,5), therefore, A+B means 3000 units of memory at (3,3). Such location information passed to the node assignment circuitry or routine will try to assign relatively non-busy node as close as possible to node (3,3).

In order to make possible optimal assignment of a processing node to a fired atom in any architecture with distributed memory and to limit penalty associated with accessing remote data, stress-flow provides platform independent means of retrieving and processing location and amount of all accessed data. This is accomplished by means of defining a special “location” function and type as part of the language of stress-flow. The idea behind location type is to provide platform dependent data location and amount information but allowing platform independent way of providing and processing this information according to scheme of FIG. 30. In the two-dimensional mesh described above, the “location” type would be a variable combining x and y node coordinates and some rough amount of data to access at these coordinates. Platform independent calculation of “location” variable is accomplished by means of assignment of pointer or reference to some data, for example:

     char* p = new char[2000];

location loc_p = p;

This code makes sense regardless of how the location data is stored and formatted. Depending on type of data and type of platform, the location may return meaningful data or empty location as indication that cost of access of data is negligible. Any pointer or reference to any globally accessible data already must contain the identification of the node where the data is stored, therefore, the only thing to do is to strip that part and to combine it with information about the amount of data to access there. Statically allocated data are obviously no problem as we simply combine “sizeof” information with masked part of the pointer. For dynamically allocated data it is also as simple all dynamically allocated data already includes a header with platform dependant maintenance info. The user data is located directly below this header and the header out of necessity always includes amount of allocated memory and other info such as usage count for garbage collection purposes. For purposes of processing node assignment it is actually beneficial in most cases to use the amount of currently allocated memory rather than amount actually used at the moment and all that we want and need is a rough estimate.

For structures and objects (classes), the location info needs to be more elaborate. For example, for a structure defining a dynamically allocated string or array, the location and size of the dynamically allocated buffer is obviously much more important than the location of the pointer to it.

     class String

            {  char* m_buffer;

            } s;

            location loc_p = &s;

The location calculated here should obviously contain location of m_buffer as the main component and not location of the place where the pointer to it is stored. A compiler can automatically generate such location function, or it can be written/overwritten by the author or the class. To avoid having to reserve another keyword, the location function can take advantage of the fact that constructors cannot return information and therefore declare the class location function with the same name as the class, except returning the “location” type:

     class String

            {  char* m_buffer;

                location String() { return m_buffer; };    

            } s;

If an object contains pointers to more than one block, the location function needs to combine them in some way, but not necessarily in a way that assigns equal weight to all components. Consider a binary tree class. Assigning the same weight to both current object and the left and right branches would needlessly constrain the tree from distributing itself throughout many nodes. For these reasons, a proper general case “location” function for a binary tree would look like this:

            class Tree

            {  char *left, *right;

                data m_Data;         

                location Tree() { return this+left/2+right/2; };       

            } s;

What the location function in this case says is: return center of gravity of current, left, and right elements, but in a way that treats left and right contribution to have together the same weight as that of the current element – which makes perfect sense for a tree that we want distributed throughout many nodes. Increasing or decreasing the weight of a location expression parameter through multiplication and division operators makes perfect sense because these operators do not need to be used for adjusting/combining coordinates and because the adjustment relative to amount of data being processed is the ideal required effect. A simple scheme simply declared importance of different data components as seen by the designer – all that should be required from standpoint of high-level language design. This scheme has been found to be sufficient method of optimizing parallelism for any parallel architecture. If interconnect cost is fairly small or non-existent (as in case with shared memory system), the generated location info can be ignored or implemented in a way where same weight is returned for all original data pointers. The scheme will still keep all processing in general proximity of the nodes containing data.

 Such crude universal data location scheme would not work with other methods of programming, but it is completely sufficient as far as stress-flow method of programming is concerned due to the way most data is defined and used. Therefore, the programmer does not need to concern himself with details of hardware on which his software will run. The programmer only provides expressions listing importance of various components in interactions with others, and the actual hardware provides means to store and combine this information.

Back Home Up Next
Send mail to info@stressflow.com with questions or comments.
Copyright © 2005-2010. All Rights Reserved. Patents Pending in Multiple Jurisdictions.
First published 03/29/06. Last modified: 06/25/10