Beyond Algorithms, Data Mining in the Real World
Data Mining is the data-driven discovery and modeling of hidden patterns from volumes of data. In the real world, data volumes have grown so vast that traditional Data Mining algorithms are at best runtime expensive and at worst rendered unusable due to scaling out of memory. The overwhelming challenge in data mining today is the scalability with both the size and the dimensionality of data. Pervasive Data Mining Solutions are built on Pervasive DataRush. Pervasive DataRush uses a computational model particularly well suited to concurrency and data intensive high performance computing called dataflow. In addition to offering the potential for scaling to problems larger than what the heap (memory) would otherwise permit, dataflow graphs exploit multiple forms of parallelism.
Pervasive Parallelism in Data Mining: the Pervasive DataRush Computational Model
In DataRush, data is transformed into dataflows. Programs are arranged into a set of processes communicating only by way of unidirectional ports that act as FIFO queues. Processes accept data from dataflows via input ports, construct results based upon it, and push the results onto output ports. Because the processes share no state, they can operate concurrently, allowing dataflow applications to take advantage of multiple processor cores. Further, the process developer need not be concerned with threads, deadlock detection, starvation, or concurrent memory access since parallel scheduling and synchronization is handled external to the process.
The Essence of dataflow Programming
The essence of dataflow programming is the concept of execution as the streaming of data flows through a graph. As the data is streaming, only data required by any active operation need be in memory at any given time, allowing very large data sets to be analyzed. Besides offering the potential for scaling to problems larger than what the heap would otherwise permit, dataflow graphs exploit multiple forms of parallelism. By its very nature, a dataflow graph exhibits pipeline parallelism. If each operator generates output incrementally, dependent operators can execute simultaneously, just a few steps behind. Also, if the results of an operator are independent for each piece of data, the operator can be replaced with multiple copies, each receiving a portion of the original input. This is called horizontal partitioning. Finally, the output of an operator might undergo multiple sets of processing and later be merged (this is most prevalent with record data) as input to another operator. The different branches can execute in parallel; this is vertical parallelism.
Pervasive DataRush™ is a library and dataflow engine to construct and execute dataflow graphs in Java. All threading and synchronization is handled by the framework as data is only shared through inputs and outputs. An operator is an extension of an existing class in the framework. A library of common operators is already implemented as part of Pervasive DataRush, in addition to the dataflow engine. A dataflow graph is composed by adding operators to the graph. Operators require their input sources in order to be constructed, so the wiring of outputs to inputs is done as you build the graph. Once you are finished composing, just invoke the run method and the graph begins execution. Because this is all done in Java, composition can be done conditionally based on pre-execution processing. Scalability refers to the ability to speed up your applications linearly with the number of resources available. Therefore, the most common practice would be making adjustments to the graph based on the number of available processors.