Quick Start

Pervasive DataRush Basics

Writing a Pervasive DataRush Application

IDE - Eclipse

IDE - NetBeans

Quick Start

Installation

To install and configure Pervasive DataRush, follow the directions given in the README file.

This Javadoc (Application Programming Interface document) has pages corresponding to the items in the navigation bar. 

PDR requires an installation of Java 6. If you do not currently have a version of the Java 6 JDK installed, download the latest release and install it.

Run the Samples

To ensure that you have PDR installed and configured correctly, you can run the samples included in the installation. From the command line, make the samples directory your current directory. Execute the runsamples (or runsamples.bat) script. Each of the sample applications will be executed within the PDR engine. You can view the runsamples script for an example of how to use the dr script from the command line.

Return to Top 

Pervasive DataRush Basics

Architecture

PDR uses a dataflow architecture. The architecture implements a program that executes as a graph of computation nodes interconnected by dataflow queues. The nodes use the queues to share data. In this sense, dataflow is a shared-nothing architecture. The lack of share state simplifies node implementation, since threads do not have to synchronize share state. The in-memory, blocking queues implement the synchronization required to safely hand off data from node to node.

In PDR, the computation nodes of a dataflow graph are known as operators. PDR provides a library of ready-to-use operator components. You can also write custom operators to extend the standard library. For example, several of the sample applications have their own implementations of operators.

To support the creation of a dataflow graph for execution, PDR provides a composition phase for constructing operators and linking them in an execution graph. Operator properties can be set to determine both operator composition and runtime behavior. At runtime, a composed graph is realized by creating threads for each computation node, creating dataflow queues, and linking nodes. The Pervasive DataRush execution engine also supports monitoring using JMX. During the execution phase, statistics objects may be created and MBeans instantiated to export profile and debug information. Pervasive DataRush provides a VisualVM plug-in that can be used within VisualVM to display the exported run-time information.

The next two topics give more detail about composition and execution.

Composition

Pervasive DataRush supports two types of operator, DataflowGraph and DataflowNode, both Java interfaces. DataflowGraph is a composite operator, used only to compose other operators. After composition, a DataflowGraph operator no longer exists (it is "compiled away"). DataflowNode is an executable operator attached to a thread and executed at run time. Both operator types can be used to set operator properties and can be linked by dataflow queues.

Dataflow queues are not instantiated at composition time to prevent premature access before run time. During composition, operators are linked using a flow concept. When an operator is composed, its internal structure is created and its methods exposed for obtaining output flows. The output flow of one operator can be passed as input to another operator, which uses the passed-in flow to complete linking.

ApplicationGraph is a special DataflowGraph, used to create an application to run within the Pervasive DataRush engine. Like other graphs, it lives at composition time and has an interface for adding operators. Once composed, the ApplicationGraph can be run. You can use the static method on the OperatorFactory to obtain an implementation of an ApplicationGraph.

Execution

After a graph is composed, it is ready to run. The ApplicationGraph interface defines a run method. On instantiation, engine properties set during composition define monitoring structures. At run time, threads are launched and the main thread then waits for either normal thread completion or an error.

The PDR engine includes a deadlock algorithm that is instantiated whenever a thread has to wait on a queue according to certain criteria. The algorithm looks for cycles in the wait graph. If any are found, then deadlock has occurred. Without intervention, graph execution halts while the deadlock algorithm determines which queue is at fault and expands memory for that queue. Deadlocks are thus often transient and occur only occasionally on a graph under particular stress.

Monitoring

The execution of a PDR application can be monitored using VisualVM, the JMX console that is shipped with the Java JDK. Pervasive DataRush ships with a plugin for VisualVM. The plugin can be found in the plugins directory under the PDR installation. It is contained in the file named datarush-visualvm-*.nbm. Follow the instructions within VisualVM for installing a new plugin. To obtain runtime information, connect to the JVM executing Pervasive DataRush using VisualVM. You can use the Pervasive DataRush tabs within VisualVM to see the running nodes, view queue information, and obtain general JVM and system information.

Return to Top

Writing a Pervasive DataRush Application

Join Sample

The PDR release includes several sample applications located in the samples directory. The samples include the source code with sample input data. We'll use the join sample for this tutorial. The purpose of the join sample is to read in two, pre-sorted text files, join them together and output the results of the join to a third file. This is a very simple example, but is useful to demonstrate several concepts within PDR.

Source code:

public class JoinSample extends DataflowGraphBase
{
private static final String[] JSON_PROPERTY_KEYS = {
"leftInputProperties", "rightInputProperties", "joinKeys", "outputProperties"
};

public static void main(String[] args) {
// Parse command-line flags, and read and parse properties file
CliConfig cli = new CliParser().parseArgs(args);
cli.parse(JSON_PROPERTY_KEYS);
// Reflectively construct a configured JoinSample, and run it
cli.run(JoinSample.class);
}

public JoinSample (
@Param("leftInputFile") String leftInputFile,
@Param("leftInputProperties") DelimitedTextProperties leftInputProperties,
@Param("rightInputFile") String rightInputFile,
@Param("rightInputProperties") DelimitedTextProperties rightInputProperties,
@Param("joinMode") JoinMode joinMode,
@Param("joinKeys") String[] keyNames,
@Param("outputFile") String outputFile,
@Param("outputProperties") DelimitedTextProperties outputProperties) {

// Add reader for left hand side of join
ReadDelimitedText readLeft = add(
new ReadDelimitedText(leftInputFile, leftInputProperties), "readLeft");

// Add reader for right hand side of join
ReadDelimitedText readRight = add(
new ReadDelimitedText(rightInputFile, rightInputProperties), "readRight");

// Add join
JoinSortedRows join = add(
new JoinSortedRows(readLeft.getOutput(), readRight.getOutput(), keyNames, keyNames, joinMode),
"join");

// Add writer for output of join
add(new WriteDelimitedText(join.getOutput(), outputFile, outputProperties), "write");
}
}

 

Note that JoinSample extends DataflowGraphBase, instead of using ApplicationGraph. This sample uses the PDR CLIParser and CLIConfig helper classes and so does not need to be an ApplicationGraph. The CLIConfig class exposes a run method that is invoked within the main of JoinSample. The CLIParser is used to parse the arguments passed to the main method. One of the standard options (-pf) can be used to pass in a property file. The CLIParser maps between the properties within the given properties file, by name, to the annotations on the JoinSample constructor. For example, if the property file contains a property named "leftInputFile", the value for that property will be passed to the JoinSample constructor as the leftInputFile parameter. Javascript Object Notation (JSON) is used to help construct complex properties such as lists, maps and Java objects.

A DataflowGraph that exposes a main and uses the CLI helper classes to process arguments can be invoked directly using the PDR dr script (found in the bin directory). The dr script builds up the command line needed to invoke the JVM to run the wanted application. For example, to run the join application, the following command line can be used:

dr example.join.JoinSample -pf join.properties

The composition of the join application happens in the JoinSample constructor. Since JoinSample extends DataflowGraphBase, it has exposed to it the add method. This method takes either a DataflowGraph or a DataflowNode to add to the composition of JoinSample. Looking at the constructor code, we can see that it first creates and then adds a ReadDelimitedText operator.This operator reads a text file in delimited form and outputs a dataflow containing the data from the file. Another ReadDelimitedText operator is constructed and added to the graph. This one reads the second file to be joined. Next, the JoinSortedRows operator is created and added to the graph. This operators does the join of the data. Note that we are assuming the two input data sources are already joined. The first two parameters of the join operator constructor are the two dataflows to join together. The join operator is passed the output of the first text reader and the output of the second text reader. In this way, the operators are linked together. Next, a WriteDelimitedText operator is created and added to the graph. This operator reads from a dataflow and writes to a delimited text file. Note that the writer takes the output dataflow of the join as its input. The writer does not produce a dataflow output.

Now, back to the main method. The CLIParser is used to parse the arguments passed to the class by the invocation from the JVM (which was invoked by the dr script). The parser finds the -pf option, parses the property file and creates a CliConfig object. The class to invoke (JoinSample) and the parsed config are passed to the run method. The run method creates an ApplicationGraph, constructs the JoinSample and adds it to the application graph, and then executes the application. The run method does not return until the JoinSample application completes execution.

The DataflowGraph implementations are composite operators, including ApplicationGraph. They can contain composite operators that may contain composite operators and so on. We can see this behavior in the above example: a DataflowGraph is created and the JoinSample is added to it. JoinSample contains several composite operators which themselves may use other composites. 

When the JoinSample graph is executed, the readers will commence to reading and parsing their input files and stuffing data onto their output queues. In parallel, the join operator will read its two input data queues, join the incoming data by the specified keys fields and output the data to its output queue. The writer will read data from the join operator and write the data to an output text file. All of the components are able to run in independent threads in pipe-lined fashion, hence taking advantage of multiple CPUs in parallel.

 

Return to Top

IDEs

Eclipse Integration

Setting the Project Classpath

Since Pervasive DataRush is 100% Java based and is delivered in the form of a set of jar files, it is easy to integrate with the Eclipse IDE. Start with installing Pervasive DataRush and Eclipse. Create a new Java project within Eclipse. Once the project is created, bring up the properties dialog for the project and navigate to the Java Build Path section. On the Libraries tab, select the option to add external jar files. Navigate to the lib directory within the PDR installation. Select all of the jar files contained in the lib directory. These jar files are now part of the build path for the Java project within Eclipse. The Java interfaces and classes within PDR can be used directly, taking advantage of the useful features within the Eclipse IDE, such as incremental building, code completion, managing of imports and many others.

Integrating Javadoc

To integrate the Pervasive DataRush Javadoc documentation within Eclipse you have to specify the location of the Javadoc files. After adding the PDR jar files to the classpath for the project, an entry in the navigator is added. This entry should be named "Referenced Libraries". Expand this item to display all of the jar files contained within. You should see a list of the jar files you added in the step above. One of them will have a name starting with "datarush-library-4.*", with a specific version number. Right click on this entry and select Properties from the context menu. A properties dialog is displayed. Select the "Javadoc location" entry. Use the browse button on the right hand side to navigate to the install location of Pervasive DataRush. Select the docs/apidocs directory under the installation area. Eclipse will now be able to find the Javadoc for Pervasive DataRush and integrate it into the Java editor and the Javadoc view.

The picture below is a snapshot of the properties dialog within Eclipse for a Java project. Note that the "Java Build Path" section is selected and the PDR library jar file is expanded within the tree view of the classpath.

 

 

This same operation can be executed using the Properties dialog on the Java project itself. Navigate to the "Java Build Path" area and select the Libraries tab. Expand the entry for the "datarush-library-4.*" jar file and click on the Javadoc location entry. Select the Edit button and specify the location of the Javadoc files as demonstrated above.

Executing a Pervasive DataRush Application

Eclipse provides the capability to run and debug Java programs. When building a PDR application that exposes a main method, the run capability of Eclipse can be used to execute the application. Create a new run configuration for the class containing the main method. See the snapshot below of the run configuration dialog:

 

 

Use the arguments tab of the run configuration dialog to set the arguments to pass to the application and to the JVM. The program arguments area is used to set the property file option and any other program arguments that are applicable. Use the JVM arguments area to set JVM level options such as the amount of memory to use and the location of the Pervasive DataRush license file. The screen shot below captures the arguments tab of the run dialog with settings for both program and JVM arguments.

 

Screenshot of run configuration dialog on the arguments tab

 

Once a run configuration is built for the application it is ready to be executed. This same run configuration can be used to debug the Java code within the Pervasive DataRush application. Simply set breakpoints in your Java code using Eclipse and execute the application using the debug launcher. When a breakpoint is encountered, Eclipse will switch to the debug perspective. You can then step through your Java code using the full power of the Eclipse debugger.

Return to Top

NetBeans Integration

Creating a New Library

Begin integrating Pervasive DataRush into your Java projects within NetBeans by creating a new class library. From the NetBeans menu bar, select the Tools->Libraries menu item. This will bring up the Library Manager dialog. Select the option to create a new class library. Name the library "DataRush" and set the type to the default "Class Libraries". Once the library is created, add the DataRush jar files to it from the DataRush installation area. Do this by selecting the "Classpath" tab and clicking on the "Add Jar/Folder" button. Navigate to the PDR installation direction and add all of the jar files located in the lib directory. The screenshot below shows a DataRush library entry with all of the DataRush jars added.

 

Screenshot of the NetBeans Library Manager dialog

 

Now select the Javadoc tab and click on the "Add Zip/Folder" button. Navigate to the docs folder within the PDR installation area and select the apidocs folder. This will make the PDR Javadoc available to any projects using the PDR library within NetBeans. The screenshot below shows the DataRush library entry with the addition of the Javadoc location.

 

Screenshot of NetBeans Library Manager dialog

 

Using the Pervasive DataRush Library in a Project

Now that the Pervasive DataRush class library has been created it can be used within a Java project. This will make the DataRush jar files and dependencies available to the Java projects classpath and enable the viewing of the PDR Javadoc documentation. To add the PDR library to a Java project, right click on the project within the NetBeans project navigator. Select the "Properties" menu item and the property editor will appear. Select the "Libraries" entry in the navigation pane. In the Libraries section of the dialog, select the "Compile" tab and click on the "Add Library ..." button. A list of libraries will be presented. Select the PDR library that was added in the previous section. The project is now ready to take advantage of Pervasive DataRush.

The screenshot below shows the project property dialog with the PDR library added to the project's compile classpath.

 

Screenshot of NetBeans project property dialog

 

Running a Pervasive DataRush Application

Now that you are able to build your project that is utilizing Pervasive DataRush, you'll want to run your PDR application. Within NetBeans, this is accomplished by creating a run configuration. The default configuration cannot be used as it is required to set program and JVM arguments to run a PDR application.

To create a new run configuration, select your Java project in the project navigator, right click and select the Configuration->Customize menu item. The project properties dialog will appear with the Run section already selected. Click on New to create a new configuration. Enter a name for the configuration, the class to execute, the program arguments and the JVM arguments. The screenshot below shows a run configuration built for the Join sample application contained in the Pervasive DataRush release.

 

Screenshot of run configuration within NetBeans

 

The arguments to the class to invoke include the standard arguments that the PDR command-line helper classes support. The -pf application option specifies the property file that contains properties to pass to the PDR application. The -e option can be used to set engine configuration values. In this example, the licensePath is set to the path name of the Pervasive DataRush license file. The VM Options can be used to set the amount of memory to use and other JVM arguments.

Once this configuration is saved it can be used to run the PDR application by clicking the run icon on the NetBeans toolbar. This configuration can also be used to debug your Java code. Simply set break points as normal using the NetBeans Java editor and then use the debug icon on the NetBeans toolbar. You can now debug your PDR application and any PDR operators that you have written.

This quick tutorial gave an overview of how to get started using Pervasive DataRush within the powerful NetBeans IDE. You will be able to take advantage of the full featured Java editor with code completion and other useful capabilities. You can also run, debug and profile your PDR applications and components using NetBeans.