[Company Logo Image]    

Home Feedback Contents Search

4.10 Emulation
4.1 Structure 4.2 Compiling 4.3 Example 4.4 Socket 4.5 OOP 4.6 Competing 4.7 Optimization 4.8 Connector 4.9 Merging 4.10 Emulation 4.11 Ordered Calls 4.12 Last-Only 4.13 Group Inputs 4.14 Complex Sys. 4.15 Prior Art... 4.16 Redundant SW 4.17 Visual SW 

Back Home Up Next

4.10 Emulation of Electrical Systems

Stress-flow was developed through years of trial and error attempting to create efficient object-oriented method for real-time signal processing applications. The conceptual problem was somewhat similar to application of visual-programming tools created for instrumentation markets. The key goals, however, were quite different: 1. Having every little node doing all its synchronization work alone and 2. Avoidance of any central supervisory code handling queuing of nodes, scheduling, and synchronization. These requirements were created by essence of application – creating efficient embedded code for monitoring/controlling numerous hardware devices (some of them other processors) at a very high speed. Efficient code meant a cheaper microprocessor that needed to be used in mass production of large number of devices, translating into cost benefits, and commercial success of the undertakings. As such, the solutions created for emulating electrical instruments and their interfaces though using a personal computer were not acceptable here.

A typical simplest function of an embedded signal processing code is to emulate some analog controller. That involves monitoring some inputs and producing one or more outputs based on the values of inputs. The case with single input and single output is completely trivial and can be solved instantly with just about any method. Still very trivial case with two inputs and one output, however, can already show all the challenges associated with the issue.


FIG. 12: An analog circuit to be emulated digitally

Suppose we want to build a digital device that as fast as possible calculates a square root of sum of squares of its inputs as shown on FIG. 12. We have two inputs “a” and “b” and we calculate the square root of the sum of “a” squared and “b” squared: sqrt(a*a+b*b). The triviality of operations performed was selected in order to make the example short to code while explaining various methods of programming it in parallel. If inputs were, for example, matrices, and all the operations extremely time-consuming operations, the templates presented here would be exactly the same. If this were an analog circuit, “a” and “b” could change independently at any time and the output would always instantly reflect the result. But a digital system, by its nature, has to take a finite number of readings per time unit and process each of them one by one. A control-flow solution to the problem would simply involve a free running loop that would read “a”, read “b”, calculate the formula, and output the result. In a trivial case, such a solution would not necessarily a bad one, especially if reading of inputs required explicit action rather than happening automatically. However, big problems would occur when such a solution would be applied to a large system, with a large number of inputs and large number of outputs. First of all, such a system would be incredibly wasteful. The loop would re-process all the inputs together, at the same interval, regardless the fact that only a few inputs would change from sample to sample, with rest of them changing rarely. Second problem was inability to use structural, object-oriented methodology to implement a system based on that method. Ironically, due to lack of a better method, the loop method would often be used even in advanced systems where some of all inputs would come automatically. Rather than use this feature to improve performance, such asynchronous automatic input readings would be latched into a table of input values to be processed by the main loop whenever it was ready to do it. Such a solution was not only incredibly wasteful performance-wise, it would lead to erroneous results if not implemented correctly. Suppose, “a” and “b” changed at the same moment, but their handler wrote only “a” before the main loop fetched both new “a” and an old “b”. Old “b” could have been a value corresponding to an “a” value totally different from the current “a” value. The result would be absolutely wrong. As a result, even such unsophisticated, crude method would require working with hard to use rudimentary synchronization tools (critical sections/semaphores - hardware and/or software) to make the system work. This method would be cumbersome to use even on a single-processor system, and lacking any parallel features, due to the fact that the problem itself had plenty of natural parallelism. For example: Both squares could be calculated at the same time, and the sum and square root could be calculated while squares or new values were already being calculated. This, after all, was the essence of an analog circuit.

This is exactly where the appeal of data-flow concepts came in. If we know when the inputs changed, it should be the fact of changed inputs, not some arbitrarily running control loop that should trigger the recalculation. However, the problem of proper pairing of input data remained. Prior art “virtual instrumentation” market dealt with this issue in one of two ways.

First such prior-art method (used in SoftWire™ and sometimes called triggered data-flow) avoided the data-pairing issue altogether by building the code out of blocks that all had explicit control-in line triggering firing of a node’s calculations and control-out line that would be asserted when the node has finished its calculations. To implement a software instrument emulating our example, the control-in and control-out lines would be added to it. Our instrument’s control-in line would be fed into control-in lines of both square calculations. The moment the instruments’ control-in line was asserted, both “a” and “b” would be latched as input pair for calculation. The square nodes’ control-out lines would go to an AND gate. The output of the gate would go to control-in line of the addition node to initiate addition when both squares were ready.  This was cumbersome, but allowed building a system where the connections handling the need for recalculations could be made after change in a single input parameter.

The second method (used in LabView™) designed the system around requirement that all such related data would come together and that the system would wait until the complete set of data for the node has arrived at its inputs. Again, our entire example would be such a single instrument. The moment both new “a” and new “b” arrived, the instrument would be placed in execution queue firing the calculation of the squares. Once both squares were calculated, this would provide the set of inputs for the addition node, which would place that node in the execution queue and so on. The second method required less effort on part of the user, but would not naturally handle the case with only one parameter changing.

In both prior-art methods, data passing between nodes involves latching it in storage space allocated inside each globally defined instrument, which by definition turns them into so called “static” data-flow machines thus not being able to achieve most parallelism offered by stress-flow where ability exists to run a given node on new data before the previous one has finished processing. Also, the scheduler/supervisor handling all the node dependencies/scheduling in prior art is a very complex, centralized process (see “execution map” in U.S. Patent 5,301,336) further limiting data-flow parallel capabilities. The fact that many nodes/instruments of the graph/program would be often executed sequentially in spite of their remoteness makes a good appearance of parallelism to a human user, but is still far from real low level parallelism that would allow running such an instrument on real multi-processor hardware. This issue was discussed partially in background of this description and will further be dealt with in the next section of the description.


struct calc_inst

{  detach  void dsqrt   (type s)   { result(sqrt(s));    };

   collect void add_a   (type aa)

                add_b   (type bb)  { dsqrt(aa+bb);       };

   detach  void sq1     (type a)   { add_a(a*a);         };   

   detach  void sq2     (type b)   { add_b(b*b);         };                                                                          

   detach  void go (type a,type b) { sq1(a); sq2(b);     };

} aa;


aa.go(10,12);  // example of invocation

FIG. 13: Implementation of circuit in FIG. 12 using stress-flow

Implementation of the device from FIG. 12 using the language of stress-flow is shown on FIG. 13. For now, firing of the whole instrument happens by supplying the new pair of inputs to routine “go.” All routines have all their code in the stressed sections, necessitated by the fact that we need the results outputted at the proper order. If any of the routine passed processing into the relaxed sections, many instances of the same nodes could be running at once. This would mean that sometimes results of calculations on older parameters could arrive after newer ones. In case the “result()” routine involves latching the result to some output hardware we would be latching older data after new ones. We still, however are achieving full parallelism possible for static data-flow machines.


FIG. 13A: Best case execution timeline of code from FIG. 13

FIG. 13A shows best-case execution scenario where unlimited number of processors was available. Due to the fact that all code sq1 and sq2 code is in the stressed section, calculations of square on new data will not begin until previous squares are finished. New data arrives before that time, which means that, by making new calls to sq1/sq2, routine “go” gets itself suspended until previous square calculations finish up. This fact is demonstrated by dotted line representing go routine’s timeline.

There is always a physical limit on time it takes to process individual set of data, which is often called latency of the system. Assuming that individual operations of calculating squares, square root, and addition cannot be broken down into smaller elements and thus be parallelized, the sum of times it takes to run single square operation plus time to run addition plus time to run the square root is the physical limit here -- the lowest latency possible. But our system also limited the frequency at which new data can be fed, in spite having numerous unused processors available. Being able to stall some actions until resources are available is often a desired action. In this case, however, we want to be able to feed and process incoming data as fast as possible.


struct calc_inst

{  static  type out[];

   static  int  last;

   int          curr;


   detach  void dsqrt   (type s)   { out[curr]=sqrt(s);  };

   collect void add_a   (type aa)

                add_b   (type bb)  { dsqrt(aa+bb);       };

   detach  void sq1     (type a)   { add_a(a*a);         };

   detach  void sq2     (type b)   { add_b(b*b);         };

   detach  calc_inst(type a,type b){ curr=last++;


                                     sq1(a); sq2(b);  




new calc_inst(10,12);  // example of invocation

FIG. 14: Another implementation of circuit in. FIG. 12 using stress-flow 

If the goal of calculations was different, and the data, for example, needed to only to be stored in an array, code with dynamic data-flow machine characteristics could be created by method described in the previous chapter. Such code is shown on FIG. 14. All calls to “calc_inst” have their own instance with separate paths/synchronization mechanisms. The paths of different data cannot get mismatched, therefore the calls to sq1 and sq2 can be placed in relaxed section of the routine. Same could be done with all other routines of the object, but this would not really improve performance because all the locks in individual instance of “calc_inst” are free and are used only once.  

FIG. 14A: Best case execution timeline of code from FIG. 14

Due to the fact that “calc_inst” does not stress the caller and does not make him wait will sq1 and sq2 are finished, many input pairs can be fed into the system with minimal delay when a number of processors a waiting for work. This is shown on timeline on FIG. 14A. The overhead of creating an instance and incrementing the table index are the only factors limiting the speed at which the new samples can be fed.

The code from FIG. 14 removed the data feeding speed limitation of FIG. 13 code, but is not really applicable to the problem at hand. When trying to emulate an electrical device, we need to feed data out to either some output hardware or to another software instrument. In both cases it is required that the data is outputted in the right order. In fact, if we could guarantee that the newer data never manages to make it to the end before older data do, the code from FIG. 13 could all be running as “relaxed.” It may appear that the problem would be solved if we created consecutive numbers(tags) or time stamp accompanying every dataset, passed it along to every calculation in the chain, and made sure that at “merging” collection nodes sort queued calls in order. This would roughly be replication of the “tagged” dataflow machine. Not only would it require costly sorting operation of waiting calls in each stress flow atom, but what is far worse, it wouldn’t work in some situations. Imagine we were so unlucky that sq1(a2) and sq2(b2) finished before both sq1(a1) and sq2(b1). Tagging would not help if there was nothing waiting at the “add” construct, the “add” construct would have no information whatsoever that there is something that needs to come through ahead of (a2,b2), and the result of calculations on (a2,b2) pair would still make it through the instrument before (a1,b1) would.

Back Home Up Next
Send mail to info@stressflow.com with questions or comments.
Copyright © 2005-2010. All Rights Reserved. Patents Pending in Multiple Jurisdictions.
First published 03/29/06. Last modified: 06/25/10