Fundamentals of parallel programming in DataRush

Part 1 of a recent article at Embedded.com explores fundamental concepts and techniques for parallel programming. The DataRush framework provides mechanisms for avoiding or solving many of the issues and challenges they mention.

Decomposition

The article points out that standard single-threaded designs must be decomposed into dependent, interacting tasks in order to exploit parallelism. Decomposition typically falls along the lines of task, data, or data flow. Task decomposition requires the developer to explicitly schedule simultaneous operations and coordinate their interactions. DataRush can support this style of parallelism simply by combining assemblies: the overall dataflow network need not be connected. Interaction between the assemblies requires connecting them via ports, which are single-direction, FIFO queues. This limited model of interaction may not support some task decompositions, but it does give the developer a simple means of coordination guaranteed to be free of deadlocks and other concurrency issues. Data decomposition parcels the same operation out across different blocks of data. In DataRush, horizontal partitioning achieves exactly this: all the developer must do to exploit this technique is add the RoundRobinPartition operator to his dataflow. If you need to combine the results of your concurrently operations, the standard library provides RoundRobinUnpartition. Data flow decomposition connects tasks in the exact same way DataRush processes are composed. Of all the decompositions, this is the one most readily achieved under the framework. But this is the technique the article explores in the most depth, cautioning against several pitfalls of the producer/consumer problem:

  1. The consumer may idle awaiting the producer
  2. The producer and consumer are not cleanly decoupled, so careful planning must go into orchestrating their interaction
  3. The producer may idle while the consumer conducts its operation, particularly once the producer is finished creating data

The DataRush framework circumvents all of these issues, in most cases requiring no effort on the part of the developer. All dataflow queues transport data tokens in batches rather than individually. This increases the granularity of data flow and processing, which tends to ameliorate the cost of communication between operators. If enough data is involved, the producer will begin emitting token batches which the consumer may begin processing, avoiding sequencing of production followed by consumption. This means the producer's latency is only that of a single batch; likewise, once the producer is finished, it will only be idle while the final batch passes through the network. Further, the batch size is tunable by the developer, so trouble spots in the network can be easily adjusted. The interface between producer and consumer is strictly governed by the framework, so stages in the pipeline cannot be inappropriately coupled.

Challenges

The article also lists a number of challenges generally encountered by developers writing parallel programs:

  1. Synchronization
  2. Communication
  3. Load balancing
  4. Scalability

Of these, DataRush handles the first two automatically and the last through property configurations in the standard library and customizers in custom operators, but leaves the third up to the developer. Synchronization is implicit in the stepNext() and push() methods of input and output ports. Coordination between operators becomes trivial using this simple interface and DRXML to specify connections. As mentioned above, DataRush implements an efficient mechanism for transporting batches of data tokens between operators. The framework simplifies exchanging data between simultaneously operating processes: the developer sees only an interface for iterating through a process' inputs and pushing to its outputs. Customizers facilitate scalability by allowing assemblies to poll the number of available processors at run time and adjust the structure of the dataflow graph accordingly. For example, Lookup partitions key lookups by the number of available processors by default; ReadDelimitedText horizontally partitions the decoding of the incoming character stream similarly.

Load balancing in the sense discussed in the article is still an issue in DataRush. Their idea of a balanced load is spreading the work to be done evenly among threads. The real concern is keeping all available threads busy so nondeterministic scheduling will effectively utilize the machine. DataRush allocates a thread for each process in your network. Just as you must develop your threaded program such that each thread gets a fair share of the work, you must design your dataflow graph so that each process performs equally. The framework does, however, ensure your application will not deadlock so long as there is memory available to dynamically expand queues.

Conclusion

The DataRush framework greatly facilitates parallel programming by addressing many fundamental issues. Not all applications can be recast as dataflow, but DataRush puts no granularity restriction on custom processes. You could have entire threaded applications run within a single process, though typically a dataflow programmer strives to balance the work done by each process in the network to more effectively utilize the underlying machine. DataRush allows programmers to express applications at a higher level of abstraction, freeing them from dealing with many of the common issues in parallel programming.

Trackback URL for this post:

http://www.pervasivedatarush.com/trackback/81