Stress-flow method relates to object-oriented, parallel computer languages, script and visual, together with compiler construction, to write programs to be executed in fully parallel (or multi-processor) architectures, virtually parallel, and single-processor multitasking computer systems. The proposed method also relates to architecture and synchronization of multi-processor hardware.
Fundamentally, two drastically opposing methods were known to construct computer architectures and programs: control-flow and data-flow. In control-flow method, the programs have the shape of series of instructions to be executed in strict sequence, in data-flow the execution occurs when a set of data needed to be processed is available.
Control-flow is the method used widely by the mainstream computing while data-flow method has been unable to make its way to mainstream computing and its application is currently limited to rare custom built hardware and sometimes as top conceptual model for some multi-user and real time software.
Data-flow method is naturally concurrent but even with this appeal was unable to overcome severe other problems with the method, chief among them the fact that data-flow as low-level software design method does not translate well into use with common computer algorithms. Numerous data-flow architectures were researched and they were excellent in providing high-speed computing in a number of specific applications. However, even the very limited applications of data-flow computing produced numerous problems with the concept.
Parallelism in data-flow relies on splitting the problem to be solved (input token in data-flow terminology) into many sub-tokens traveling parallel paths. The results of computations performed on sub-tokens traveling parallel paths would then have to gradually merge back together to produce the final result. A key problem with data-flow architecture concerned allowing unrestricted feeding of tokens (input data) into data-flow system and letting them take the shortest rather than predefined path (so called “dynamic data-flow”). Such unrestricted data-flow processing would result in mismatching result tokens arriving at destination. To solve this problem, numerous methods of tagging and instance numbering of dynamic data-flow tokens were proposed.
Key problem with concurrency in control-flow environment centers around simultaneously executing processes sharing the same data and with separate control sequences cooperating with one another. Rudimentary non-structural synchronization tools (critical sections, semaphores, signals) for dealing with these issues have been known for a very long time. Many mass produced processors are equipped with special instructions that allow exclusive/safe access to memory shared by two or more processors. These instructions (called interlocked memory access instructions) allow easy implementation of the rudimentary synchronization tools and are used by all operating systems that support multi-processor use.
Programs using the rudimentary tools are however fairly hard to construct and prone to hard to be seen and corrected conceptual problems leading to deadlocks. A deadlock is a situation where two or more separate processes all hung (are forever suspended) waiting for resources reserved by other suspended processes.
For these reasons, many methods of structural, object-oriented methods of concurrent process synchronization have been proposed and implemented. For example:
“Monitors” implemented in Concurrent Pascal define sections of program as the only elements to be accessed by more than one process.
“Rendezvous sections” implemented in Ada provide instructions that allow two separate processes meet at points specified in both of them.
“Named channels” of Occam and similar messaging methods (Concurrent Object-Oriented C) provide special constructs through which to send and receive data between processes running in parallel.
“Shared variables” QPC++ allow exchanging inter-process information by special data bound to semaphores.
“Separate” designation for routines and data of SCOOP/Eiffel allow specifying routines to be executed as separate process and data to be used exclusively by one process. The method seems very appealing but fails to address many of the problems. Further mechanism of “require” block within a separate procedure allow specifying conditions to be met for a separate procedure to execute as multi-tasking extension of the “design by contract” concept.
None of the above methods have been widely accepted in mainstream computing and concurrent programming art is still an exception rather than a rule. In spite of tremendous need for parallel programming support, the most popular languages either offer only the rudimentary, non-object oriented parallel programming support, or none at all. In particular, in spite of numerous attempts, the C++ standard committees had failed to agree on a universal support for parallel programming in C++. All proposed methods were unable to get enough support to be accepted as the basis for a standard of parallel programming in C++.
In such a situation, programmers were often being forced to find their own ways to implement some parallel programming in C++ and other widely used languages. Many innovations were made in order to do some parallel programming using C++.
All these old methods either elegantly solve only one small subset of concurrent programming needs or propose a concept that is very costly to implement in practice. Monitors, Rendezvous Sections, Named Channels, and Separate Objects all appear to be incomplete solutions, serving only a few needs. SCOOP/Eiffel Require blocks, on the other hand, while conceptually appealing, are costly/impractical to implement because they specify an expression which must be met for a concurrent procedure to begin executing. This requires some method to be able to reevaluate the expression each time the source conditions might have changed to merit starting the execution of object containing the “require” block.
Purely control-flow programs result in cumbersome behavior and many unacceptable features. They cannot naturally share resources, cannot easily cooperate with each other. Due to these problems and pervasive lack of universal multi-thread and multi-process parallel programming method, various workarounds were designed to eliminate some of the bad characteristics of the control-flow programming model. These included message-based operating system interface and some visual programming.
In order to simplify software development, a lot of common work was being shifted to supervisory programs – the operating systems. A key part of an operating system is its “kernel” – the central program that manages the key hardware resources and lets other programs use them through common interface. At first the operating systems simply provided means for several programs to share the resources, including the processor. This meant being able to run many programs at once by constantly switching processor ownership: executing a portion of one program and then switching to the next one. Later, a messaging system has been added to handle certain functions – especially user interface in “windowed” environment. Functions were somewhat reversed. Rather than individual programs calling the operating system to provide user interface, the operating system would call the user programs with messages to process. This scheme has solved some problems inherent to control-flow programming by a method that bears resemblance to some data-flow concepts. At least the user interface was now event/new data driven rather than looping when waiting for new data. These messaging features, allowed pretty good appearance of multi-tasking and data-flow. Multiple programs elements like windows could be serviced virtually simultaneously, individual programs would not waste the processor while looping for input, etc.
Messaging methods provided very good emulation of parallel processing for the most popular computer uses. It also allowed running many independent programs simultaneously that would share all the available hardware. However, it was by no means the true low-level parallelism sought after. Individual programs would most often still be single threads processing messages received thorough single entry point. If actual, true parallelism/multi-tasking was desired for better performance, additional “threads” would have to be created by hand and the rudimentary synchronization tools would again be used to allow safe sharing of data.
To simplify software development process, numerous visual programming tools have been proposed and developed. The “flow-charting” methods simply representing regular script-type instructions through graphics did not really offer any practical advantages over script programming. More advanced methods of visual programming tools based on some of the dataflow concepts have found much wider application particularly in instrumentation markets. Prior to appearance of such tools, the users of computer based instrumentation have been forced to convert essentially parallel, data-flow type concepts (such as connecting sources of voltages to displays, switches to control lights) into extremely unnatural in this case control-flow code.
Two kinds of such partially dataflow-based instrumentation programming tools have been developed. Some of them (like SoftWIRE™) allow the user to compose their applications out of “controls” - rudimentary functional building blocks where each block’s action is triggered explicitly. Asserting a control’s “control-in” input triggers a control’s action. Once a control has finished its processing, it triggers its “control-out” output which can be connected to the next control’s “control-in” to continue such explicitly designed data-flow.
National Instruments’ LabView™ “virtual instruments” is another such a tool and is a subject of several patents. Working model here is somewhat closer to the commonly understood data-flow concept as processing happens when complete set of new data is available on inputs of a node.
By emulating data-flow interface, these concepts and systems do offer the user some degree of multi-tasking or actually good appearance of it. Success of these systems shows tremendous need for parallel, non control-flow programming tools.
Internally, the emulation of data-flow in these systems is pretty straightforward. As the data gets updated in various parts of the user-designed program graph, this triggers new graph nodes to be updated, often in very remote locations. The update requests get queued and executed sequentially, but for most of these systems’ applications this passes as good enough parallelism. This method is very similar to the messaging system used by operating systems for user interface.
Originally, the entire such data-flow emulator (which could be considered the centralized operating system in this case) would run as a single thread which by nature eliminated all the synchronization/data sharing headaches of true parallelism. As the systems became more popular and performance demands harsher, the emulator was split into several threads handling tasks/update requests grouped by their nature (example user interface, instrument I/O, standard code). Later, to further meet growing performance needs, user-controlled multi-threading and synchronous multi-processing support was added. This has opened the old can of worms of the users, once again, having to create a few threads by hand and code the crude rudimentary synchronization tools (critical sections/semaphores) to avoid racing conditions and corrupting of data shared by several threads.
Necessity of the user having to assign work to be performed by separate threads and need to use the rudimentary synchronization tools substantially negate the true data-flow concept and all its advantages. However, the limitation of such near data-flow visual programming was not so much the visual programming concept itself (which is fairly universal), but the way it was implemented internally through control-flow, non-parallel code. A single visually-designed program could not naturally run on more than one processor and multi-processor use would result in need of explicit rudimentary control tools. Once again, lack of low-level, universal, multi-tasking at the core, quintessentially multi-processor programming method was the chief culprit here.
Prior-art visual programming tools created mainly for instrumentation market (LabView™, Softwire™) must be addressed here in more detail because they tend to make a very unfortunate claim that by merely being able to create parallel-wire like diagrams, full possible parallelism or data-flow processing can be described and achieved. If this claim were to be true even remotely, it would make the proposed method completely unnecessary. However, this claim is either completely false or grossly imprecise which can be seen by studying actual details of implementation of these systems. First of all, the centralized supervisory software that queues and executes fired nodes that is used by these systems prevents this technique from being a universal programming method to construct say, operating systems, data bases, or device drivers. Second, contrary to often-repeated “hassle-free parallelism” claims made by these systems, the parallelism achieved there is not by any means an actual parallelism that is seen in, for example, data-flow computer where nodes are actual separate pieces of electronic hardware. Most of the time, the parallelism offered there is an illusion achieved by complex centralized supervisory software sequentially executing nodes fired at distant parts of the program graph. This is good enough for the specific application in instrumentation market but is by no means the actual parallelism sought by universal prior-art programming tools. Some two-processor parallelism was achieved there at great effort, by expansions of the centralized supervisory software, but even then the parallelism offered is not able to happen in most cases without the user modifying his graphically designed software. Third – existence of any centralized queue or supervisory software prevents full auto-scalable parallel execution on many processors from being possible.
The above points can clearly be seen in application notes describing methods to accomplish (some) multi-tasking in, for example, prior art LabVIEW™ system. National Instruments Application Note 114: “Using LabVIEW™ to Create Multithreaded VIs for Maximum Performance and Reliability” describes steps that are necessary to accomplish limited parallel performance with this prior-art system. To begin with, the application note concerns itself with creating two or more “virtual instruments” to be made to run in parallel. This already goes against the stated goals of actual parallel programming, where the entire code would naturally be parallel with many pieces executing in parallel, where breaking it into several logical parts would not improve performance. On page 5, the application describes various central “execution systems” that make execution of various elements seem like parallel, and the ways to properly direct execution of a specific instrument to a proper execution system. On pages 10 through 12, it describes steps that need to be taken to prevent “race conditions” from corrupting data. The methods offered include global variables that are only changed in one place, “Functional Global Variables,” and semaphores. This brings the already discussed specter of hard to use, non-object oriented “rudimentary synchronization” tools back into the fold – which further shows that this prior-art system is by no means a parallel programming tool sought after. In fact, by most definitions such prior-art systems should not be considered parallel programming tools at all any more that say standard C or C++ language could be considered as such. Just as manually coded limited parallelism is possible in C and C++ at extra effort and by using the rudimentary synchronization tools, very similar limited parallelism can be achieved in these prior-art instrumentation market tools.
Another National Instruments Application Note 199: “LabVIEW™ and Hyper-Threading” shows “Primes Parallelism Example” on page 2. Stating that dataflow order forces mandatory waits for every input in a loop, a claim is made that the only way to make “dataflow” code be able to execute on more than one processor is to split it to two odd and even loops and shown on modified diagram on page 3. This claim is either patently false or at least very imprecise, since it uses a fairly standard “data-flow” term to mean something that has very little to do with data-flow as defined by computer-science literature. Even if we assume that it was meant that LabVIEW™ implements a “static data-flow” machine where a single node cannot be fired again until it processes the previous firing, the claim still does not make much sense. In any data-flow machine as understood by computer science literature coining the term, various nodes of data-flow machine work simultaneously. A system that does not do that should not be called a dataflow system. This means that if we have a data-flow graph consisting of consecutive parts A and B, as soon as A finishes work on input dataset 0, it should pass it to B and be able to start processing input dataset 1. A system that does not do that probably should not be considered a data-flow system capable of parallelism. Forcing the user to split the problem into odd and even loops to take advantage of two processors, clearly shows that LabVIEW™ prior-art system does not even begin to deal with the issues addressed by the stress-flow method, shows conceptual limitations of the centralized supervisory node-queuing execution system used there, and proves the tremendous need for the methods of the proposed method. One of the goals of the stress-flow method was to provide universal low level tools to allow, among other things, replicating static and dynamic data-flow algorithms executing in parallel on non data-flow hardware.
In spite of tremendous need for it, parallel programming remains a black art which is only used where absolutely necessary. True multi-processor parallel programming is only used for very specific, chosen time-consuming applications running on very costly and relatively rare hardware.
Most computers used in mainstream computing still have one processor executing user programs. Multi-processor server/workstation type computers are available, but their application mostly relies generally on several separate processes sharing two processors instead of one. Rare applications that take advantage of two or more processors at once do so only for very specific time-consuming tasks and code for this is almost always written using the non-structural rudimentary control tools or fundamentally non-object oriented messaging systems.
The problem with small use of parallel architectures is not with electronics. There is absolutely no obstacle from electronics art standpoint to, for example, build a computer where there would be a small processor accompanying each small chunk of RAM memory. The problem is we simply still do not have a universal-purpose methodology for describing desired parallelism in general and programming such architectures with plurality of processors in particular.
To make computing faster, a tremendous effort is made to make series of instructions of software conceptually written for single processor somehow run in parallel. Modern processors try to pre-fetch data and code, guess forward, cache data, all in order to partially parallelize software written as non-parallel. This results in extremely complex circuitry using a lot of energy and dissipating a lot of heat, which is the direct result of most data having to go through “narrow throat” of a single processor and single high-speed bus connecting the processor with memory.
Multi-processor architecture, if it could easily be programmed in natural, self-scaling fashion, would solve all these problems. It would be cheaper, consume far less energy, and there would be no physical limits on performance as processors could be added the same way the users today expand amount of RAM in their computers. Simply observing nature proves beyond any doubt that we are only beginning to understand parallel information processing. Our huge, kilowatts of energy wasting supercomputers still cannot replicate image recognition, processing, and storing capabilities of a tiny honey bee, for example.
Send mail to
email@example.com with questions or