HAVE A QUESTION?

Pervasive DataRush Basics

Architecture

Pervasive DataRushTM uses a dataflow architecture. The architecture implements a program that executes as a graph of computation nodes interconnected by dataflow queues. The nodes use the queues to share data. In this sense, dataflow is a shared-nothing architecture. The lack of share state simplifies node implementation, since threads do not have to synchronize share state. The in-memory, blocking queues implement the synchronization required to safely hand off data from node to node.

In Pervasive DataRush, the computation nodes of a dataflow graph are known as "operators." Pervasive DataRush provides a library of ready-to-use operator components. You can also write custom operators to extend the standard library. For example, several of the sample applications have their own implementations of operators.

To support the creation of a dataflow graph for execution, Pervasive DataRush provides a composition phase for constructing operators and linking them in an execution graph. Operator properties can be set to determine both operator composition and runtime behavior. At runtime, a composed graph is realized by creating threads for each computation node, creating dataflow queues, and linking nodes. The Pervasive DataRush execution engine also supports monitoring using Java Management Extensions (JMXs). During the execution phase, statistics objects may be created and MBeans instantiated to export profile and debug information. Pervasive DataRush provides a VisualVM plug-in that can be used within VisualVM to display the exported run-time information.

The next two topics give more detail about composition and execution.

Composition

Pervasive DataRush supports two types of operators, DataflowOperator and DataflowProcess, both Java interfaces. DataflowOperator is a composite operator, used only to compose other operators. After composition, a DataflowOperator no longer exists (it is "compiled away"). DataflowProcess is an executable operator attached to a thread and executed at runtime. Both operator types can be used to set operator properties and can be linked by dataflow queues.

Dataflow queues are not instantiated at composition time to prevent premature access before runtime. During composition, operators are linked using a flow concept. When an operator is composed, its internal structure is created and its methods exposed for obtaining output flows. The output flow of one operator can be passed as input to another operator, which uses the passed-in flow to complete linking.

ApplicationGraph is a special DataflowOperator, used to create an application to run within the Pervasive DataRush engine. Like other graphs, it lives at composition time and has an interface for adding operators. Once composed, the ApplicationGraph can be run. You can use the static method on the GraphFactory to obtain an implementation of an ApplicationGraph.

Execution

After a graph is composed, it is ready to run. The ApplicationGraph interface defines a run method. On instantiation, engine properties set during composition define monitoring structures. At runtime, threads are launched and the main thread then waits for either normal thread completion or an error.

The Pervasive DataRush engine includes a deadlock algorithm that is instantiated whenever a thread has to wait on a queue according to certain criteria. The algorithm looks for cycles in the wait graph. If any are found, then deadlock has occurred. Without intervention, graph execution halts while the deadlock algorithm determines which queue is at fault and expands memory for that queue. Deadlocks are thus often transient and occur only occasionally on a graph under particular stress.

Monitoring

The execution of a Pervasive DataRush application can be monitored using VisualVM, the JMX console that is shipped with the Java JDK. Pervasive DataRush ships with a plugin for VisualVM. The plugin can be found in the plugins directory under the Pervasive DataRush installation. It is contained in the file named "datarush-visualvm-*.nbm." Follow the instructions within VisualVM for installing a new plugin. To obtain runtime information, connect to the JVM executing Pervasive DataRush using VisualVM. You can use the Pervasive DataRush tabs within VisualVM to see the running nodes, view queue information, and obtain general JVM and system information.

» Back to Top

Writing a Pervasive DataRush Application

Join Sample

The Pervasive DataRush release includes several sample applications located in the samples directory. The samples include the source code with sample input data. We'll use the join sample for this tutorial. The purpose of the join sample is to read in two, pre-sorted text files, join them together, and output the results of the join to a third file. This is a very simple example, but is useful to demonstrate several concepts within Pervasive DataRush.

Source code:

package example.join;

import com.pervasive.datarush.graphs.ApplicationGraph;
import com.pervasive.datarush.operators.CompositionContext;
import com.pervasive.datarush.operators.DataflowOperator;
import com.pervasive.datarush.operators.GraphFactory;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedTextProperties;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedTextProperties;
import com.pervasive.datarush.operators.join.JoinMode;
import com.pervasive.datarush.operators.join.JoinSortedRows;

public class JoinSample extends DataflowOperator { 

 String leftInputFile;
 ReadDelimitedTextProperties leftInputProperties;
 String rightInputFile;
 ReadDelimitedTextProperties rightInputProperties;
 JoinMode joinMode;
 String[] keyNames = { "firstName", "lastName" };
 String outputFile;
 WriteDelimitedTextProperties outputProperties;

 public static void main(String[] args) {
  ApplicationGraph app = GraphFactory.newApplicationGraph("Join Sample");
  app.add(new JoinSample());
  app.run();
 }

 public JoinSample() {

  this.leftInputFile = "leftInput.txt";
  this.leftInputProperties = new ReadDelimitedTextProperties();
  this.leftInputProperties.setHeader(true);
  this.rightInputFile = "rightInput.txt";
  this.rightInputProperties = new ReadDelimitedTextProperties();
  this.rightInputProperties.setHeader(true);
  this.joinMode = JoinMode.FULL_OUTER;  
  this.outputFile = "outputFile.txt";
  this.outputProperties = new WriteDelimitedTextProperties();
 }

 @Override
 protected void compose(CompositionContext ctx) {
  // Add reader for left hand side of join
  ReadDelimitedText readLeft = ctx.add(new ReadDelimitedText(leftInputFile, leftInputProperties));

  // Add reader for right hand side of join
  ReadDelimitedText readRight = ctx.add(new ReadDelimitedText(rightInputFile, rightInputProperties));

  // Add join
  JoinSortedRows join = ctx.add( new JoinSortedRows( readLeft.getOutput(),
                 readRight.getOutput(),
             joinMode,
             JoinKey.makeJoinKeys(keyNames,KeyNames)),

  // Add writer for output of join
  ctx.add(new WriteDelimitedText(join.getOutput(), outputFile,
          outputProperties), "write");
 }

 

As an alternative to writing a Pervasive DataRush Java application, the developer can also use the JavaScript interface for a simplified interface into Pervasive DataRush.

An example of the above code below is:

// Read the unit price data
leftInput = readDelim('leftInput.txt', {header:true})

// Read the sales data
rightInput = readDelim('rightInput.txt', {header:true})

// Define the keys to use for the join. In this example, the sources have
// utilize the same field names.
joinKeys=["PRODUCT_ID", "CHANNEL_NAME"]

// Join the two data sources. Note that the order of using the data sources
// is important. With the join operator, the flow used to invoke join is
// the left hand side flow and the flow that is the first parameter is the
// right hand side flow. This is important when using joins like left
// outer or right outer.
output = leftInput.join(rightInput, joinKeys, joinKeys, 'FULL_OUTER')

// Write the output of the join operation to a delimited text file.
output.writeDelim('target/scratch/JoinSampleOutput.txt', {mode:'overwrite', header:true})

 

A DataflowOperator that exposes a main can be invoked directly using the Pervasive DataRush dr script (found in the bin directory). The dr script builds up the command line needed to invoke the JVM to run the wanted application. For example, to run the join application, the following command line can be used:

dr example.join.JoinSample

 

The composition of the join application happens in the JoinSample constructor. Since JoinSample extends DataflowOperator, it requires an overridden compose method. This takes in a CompositionContext. This compose method allows the developer to build graphs and has exposed to it the add method. This method takes either a DataflowOperator or a DataflowNode to add to the composition of JoinSample. Looking at the constructor code, we can see that it first creates and then adds a ReadDelimitedText operator.

This operator reads a text file in delimited form and outputs a dataflow containing the data from the file. Another ReadDelimitedText operator is constructed and added to the graph. This one reads the second file to be joined. Next, the JoinSortedRows operator is created and added to the graph. This operator does the join of the data. Note that we are assuming the two input data sources are already joined. The first two parameters of the join operator constructor are the two dataflows to join together. The join operator is passed the output of the first text reader and the output of the second text reader. In this way, the operators are linked together. Next, a WriteDelimitedText operator is created and added to the graph. This operator reads from a dataflow and writes to a delimited text file. Note that the writer takes the output dataflow of the join as its input. The writer does not produce a dataflow output.

The DataflowOperator implementations are composite operators, including ApplicationGraph. They can contain composite operators that may contain composite operators and so on. We can see this behavior in the above example: a DataflowGraph is created and the JoinSample is added to it. JoinSample contains several composite operators which themselves may use other composites.

When the JoinSample graph is executed, the readers will commence to reading and parsing their input files and stuffing data onto their output queues. In parallel, the join operator will read its two input data queues, join the incoming data by the specified keys fields, and output the data to its output queue. The writer will read data from the join operator and write the data to an output text file. All of the components are able to run in independent threads in pipe-lined fashion, hence taking advantage of multiple CPUs in parallel.

» Back to Top

IDE - Eclipse Integration

Setting the Project Classpath

Since Pervasive DataRush is 100% Java-based and is delivered in the form of a set of jar files, it is easy to integrate with the Eclipse IDE. Start with installing Pervasive DataRush and Eclipse. Create a new Java project within Eclipse. Once the project is created, bring up the properties dialog for the project and navigate to the Java Build Path section. On the Libraries tab, select the option to add external jar files. Navigate to the lib directory within the Pervasive DataRush installation. Select all of the jar files contained in the lib directory. These jar files are now part of the build path for the Java project within Eclipse. The Java interfaces and classes within Pervasive DataRush can be used directly, taking advantage of the useful features within the Eclipse IDE, such as incremental building, code completion, managing of imports, and many others.

Integrating Javadoc

To integrate the Pervasive DataRush Javadoc documentation within Eclipse you have to specify the location of the Javadoc files. After adding the Pervasive DataRush jar files to the classpath for the project, an entry in the navigator is added. This entry should be named "Referenced Libraries." Expand this item to display all of the jar files contained within. You should see a list of the jar files you added in the step above. One of them will have a name starting with "datarush-library-4.*", with a specific version number. Right click on this entry and select Properties from the context menu. A properties dialog is displayed. Select the "Javadoc location" entry. Use the browse button on the right hand side to navigate to the install location of Pervasive DataRush. Select the docs/apidocs directory under the installation area. Eclipse will now be able to find the Javadoc for Pervasive DataRush and integrate it into the Java editor and the Javadoc view.

The picture below is a snapshot of the properties dialog within Eclipse for a Java project. Note that the "Java Build Path" section is selected and the Pervasive DataRush library jar file is expanded within the tree view of the classpath.

This same operation can be executed using the Properties dialog on the Java project itself. Navigate to the "Java Build Path" area and select the Libraries tab. Expand the entry for the "datarush-library-4.*" jar file and click on the Javadoc location entry. Select the Edit button and specify the location of the Javadoc files as demonstrated above.

Executing a Pervasive DataRush Application

Eclipse provides the capability to run and debug Java programs. When building a Pervasive DataRush application that exposes a main method, the run capability of Eclipse can be used to execute the application. Create a new run configuration for the class containing the main method. See the snapshot below of the run configuration dialog:

Use the arguments tab of the run configuration dialog to set the arguments to pass to the application and to the JVM. The program arguments area is used to set the property file option and any other program arguments that are applicable. Use the JVM arguments area to set JVM level options such as the amount of memory to use and the location of the Pervasive DataRush license file. The screen shot below captures the arguments tab of the run dialog with settings for both program and JVM arguments.

Once a run configuration is built for the application it is ready to be executed. This same run configuration can be used to debug the Java code within the Pervasive DataRush application. Simply set breakpoints in your Java code using Eclipse and execute the application using the debug launcher. When a breakpoint is encountered, Eclipse will switch to the debug perspective. You can then step through your Java code using the full power of the Eclipse debugger.

» Back to Top

IDE - NetBeans Integration

Creating a New Library

Begin integrating Pervasive DataRush into your Java projects within NetBeans by creating a new class library. From the NetBeans menu bar, select the Tools->Libraries menu item. This will bring up the Library Manager dialog. Select the option to create a new class library. Name the library "DataRush" and set the type to the default "Class Libraries." Once the library is created, add the Pervasive DataRush jar files to it from the Pervasive DataRush installation area. Do this by selecting the "Classpath" tab and clicking on the "Add Jar/Folder" button. Navigate to the Pervasive DataRush installation direction and add all of the jar files located in the lib directory. The screenshot below shows a DataRush library entry with all of the DataRush jars added.

Now select the Javadoc tab and click on the "Add Zip/Folder" button. Navigate to the docs folder within the Pervasive DataRush installation area and select the apidocs folder. This will make the Javadoc available to any projects using the Pervasive DataRush library within NetBeans. The screenshot below shows the DataRush library entry with the addition of the Javadoc location.

Using the Pervasive DataRush Library in a Project

Now that the Pervasive DataRush class library has been created it can be used within a Java project. This will make the DataRush jar files and dependencies available to the Java projects classpath and enable the viewing of the Pervasive DataRush Javadoc documentation. To add the Pervasive DataRush library to a Java project, right click on the project within the NetBeans project navigator. Select the "Properties" menu item and the property editor will appear. Select the "Libraries" entry in the navigation pane. In the Libraries section of the dialog, select the "Compile" tab and click on the "Add Library ..." button. A list of libraries will be presented. Select the Pervasive DataRush library that was added in the previous section. The project is now ready to take advantage of Pervasive DataRush.

The screenshot below shows the project property dialog with the Pervasive DataRush library added to the project's compile classpath.

Running a Pervasive DataRush Application

Now that you are able to build your project that is utilizing Pervasive DataRush, you will want to run your Pervasive DataRush application. Within NetBeans, this is accomplished by creating a run configuration. The default configuration cannot be used as it is required to set program and JVM arguments to run a Pervasive DataRush application.

To create a new run configuration, select your Java project in the project navigator, right click, and then select the Configuration->Customize menu item. The project properties dialog will appear with the Run section already selected. Click on "New" to create a new configuration. Enter a name for the configuration, the class to execute, the program arguments, and the JVM arguments. The screenshot below shows a run configuration built for the Join sample application contained in the Pervasive DataRush release.

The arguments to the class to invoke include the standard arguments that the Pervasive DataRush command-line helper classes support. The -pf application option specifies the property file that contains properties to pass to the Pervasive DataRush application. The -e option can be used to set engine configuration values. In this example, the licensePath is set to the path name of the Pervasive DataRush license file. The VM Options can be used to set the amount of memory to use and other JVM arguments.

Once this configuration is saved it can be used to run the Pervasive DataRush application by clicking the run icon on the NetBeans toolbar. This configuration can also be used to debug your Java code. Simply set break points as normal using the NetBeans Java editor and then use the debug icon on the NetBeans toolbar. You can now debug your Pervasive DataRush application and any operators that you have written.

This quick tutorial gave an overview of how to get started using Pervasive DataRush within the powerful NetBeans IDE. You will be able to take advantage of the full featured Java editor with code completion and other useful capabilities. You can also run, debug, and profile your Pervasive DataRush applications and components using NetBeans.

» Back to Top