[Company Logo Image]    

Home Feedback Contents Search

5.5 VPP HW
5.1 Stack Use 5.2 Internals 5.3 Microprocessors 5.4 Parallel HW 5.5 VPP HW 

Back Home Up

5.5 Application in Virtual Parallel Platforms

High cost of custom parallel-processing computer hardware resulted in fairly significant interest in being able to organize low-cost “commodity” computers into “virtual” parallel processing platform. The commodity computers would most often be connected with relatively slow universal network hardware. Communication between nodes is accomplished explicitly by means of message passing framework, including data interchange between nodes by means of explicit commands. In spite of its limitation, the method was quite successful due to low cost - result of utilization of often unused processing power of many computers connected to a network. The computing structures utilized in such virtual parallel commodity systems were relatively simple, most often computing the same code with different sets of data on many network machines – which happened to be quite sufficient for many common problems requiring massive amount of processing. Programming/communication method used there is rather crude and incapable of implementing truly complex systems where large number of many dissimilar software objects communicates with one another. There were attempts to encapsulate the crude, low-level message processing interface used for programming such architecture in object-oriented programming wrappers, but no universal programming method was proposed for such an architecture, much less a universal programming method compatible with other parallel systems.

In spite of relying on a generic, low level programming technique, the stress-flow programming language and method implement well on virtual parallel processing platforms. The stressed/relaxed programming layout of all elements is sufficient method to describe all common applications of virtual parallel processing platforms, while its unique features (such as redundancy support) allow a lot of new applications. Feasibility of stress-flow for “virtual parallel platforms” is also a very good demonstration of universal characteristics of stress-flow with data “location” functions described in previous chapter.

The most challenging part of any “virtual parallel platform” is lack of any direct means for platform participants accessing each other’s memory which necessitates copying data back and forth in order for parallel processing to take place. If the means of communication between participating computers are relatively slow network interfaces, the overhead of the communication becomes the most important system management issue. The prior-art methods of programming “virtual parallel platforms” performed the copying explicitly. Manual, explicit method of inter-computer communication might offer appearance of being a performance safe, efficient method and is sufficient for some computation intensive tasks. However, this approach totally prevents high-level structural software design concepts and as a universal-purpose computing platform is as efficient as programming complex software systems all in assembly language would be. 

The first necessary step to adapt stress-flow to “virtual parallel platform” is globalization of pointers passed between stress flow atoms to identify data being passed from remote computers (processors with local memory). The details depend on particulars of the used processors, but special far pointers, with locally illegal segment ID, is the most optimal solution in most situations.  

Second step is creating proper location processing implementation appropriate for virtual processing platform corresponding to the generic scheme shown on FIG. 30. One way to implement location representation is simply by making a list of amount of data used on different remote computers. As the purpose of the “location” functions is to determine best node to fire a new atom on, the location information does not have to be complete or precise. The goal is simply to try to assign the new atom to one of the nodes that already have some or most of data needed if it is not overloaded with tasks. It can simply be a fixed length list/array of IDs of nodes containing data most relevant to the newly fired atom together with the total amount of data in it to be accessed by the new atom. Empirical tests have shown that such list being about four elements long is completely sufficient for all practical applications of stress-flow. Combining such location information is fairly straightforward too. If both added “location” structures have elements indicating use of data in the same node, the amounts of data are added up and all the elements combined into the result “location” structure. In case the result structure is too short to keep all the elements, only the elements with the highest amount of data to be accessed are kept. Platform specific code that knows costs of copying data then processes the location information and assigns optimal node based on amount of data needed to be copied versus current load of a node. The necessary copying is then initiated and running in parallel with previously scheduled atoms at the node. This is where the key feature of stress-flow makes it possible to efficiently use it on Virtual Parallel Platforms – for all time consuming calculation the code generated by stress-flow will schedule a lot of atoms together, long before any of them are even run. This results in automatic overlapping of data copying and execution phases – a key advantage without which this universal scheme could not generate efficient code.


FIG. 30C: Sample operations on data locations in a virtual parallel platform

An example of this scheme at work is shown on FIG. 30C. Locations structures A and B contain list of nodes where needed data resides together with the amount of data needed. The list is sorted by amount with largest amount being first. Adding such location structures involves merging two lists, keeping the result sorted by amount as well. When both lists have matching nodes, the amount of data is added together as is the case with Node A of the example. Elements with smaller amount of data that do not fit in the final list are simply discarded. When such combined location structure is now passed to the node assignment routine, an attempt is made to assign the new task to node E or node A because this is where a lot of needed data is already located. If both E and A are so busy that cost of copying 6000 units of data would mean lower delay than waiting for E or A, an attempt is made to assign the new task to node D, and so on.

The method of calculation described above might appear imprecise and crude, but is sufficient to determine location information needed to allocate newly fired atom to a node that will run it. If it was more complex, the overhead needed to calculate location would outweigh its benefits since we always want the individual stress flow atoms to be as short as possible. For all this to work properly, we now have to do one little thing more. Once a node determines that load situation justifies copying some data to another node, from now on the other node’s copied data location has to be returned as location of source data until load situation changes to the point where the original node is loaded less. Once the assignment algorithm decided the current node was so overloaded that it was worth copying the data to another node, we have to try to take advantage of that copying work as much as possible. If we didn’t do that, copying would rarely be worth it, and what’s worse, the original node’s algorithm would try to copy the data to yet another work-free node, and then another, and so on. With this scheme, for most cases, a well tuned node assignment algorithm tries to keep everything in one node/computer due to heavy cost of copying. Once that load gets loaded so much that it justifies copying some data to another node, a significant set of associated data gets copied there and location of it reassigned. Now that node will be getting loaded with all new work until it too gets overloaded. The reversal of the scheme happens if one of the original nodes finishes its work and discovers that it is less loaded than some of the nodes using copies of its data. The “location” function is fixed back to return the older location.  It would be possible, of course, to use more elaborate scheme, one that, say, could remember all copied locations and try to come up with best location set combinations for each atom call, but like almost everything else with stress-flow concept, the simplest scheme is best and most compatible with other platforms. Stress-flow thus becomes an excellent tool to program virtual parallel platforms, compatible with other platform implementations. The only thing that has to be remembered about it is that due to need of copying data, this platform is less forgiving performance-wise if data and parameters are declared in clumsy fashion and, for example, “const” data is not declared as such, forcing unnecessary copying.

Back Home Up
Send mail to info@stressflow.com with questions or comments.
Copyright © 2005-2010. All Rights Reserved. Patents Pending in Multiple Jurisdictions.
First published 03/29/06. Last modified: 06/25/10