Pervasive DataRush FAQ


What is Pervasive DataRush™?
Why Pervasive? What led to the development of Pervasive DataRush?
Why is Pervasive DataRush needed?
Why only “data-intensive” applications? Can't I used DataRush for purely compute intensive applications?
Why aren't current techniques around parallelism working?
Why is a new generation of hyper-parallel software urgently needed?
What is the Pervasive DataRush "secret sauce"? How is it different from other approaches?
What does Pervasive DataRush™ technology look like "under the covers"?
How far can Pervasive DataRush scale?
Why is the "auto-scaling" in Pervasive DataRush so important?
Can Pervasive DataRush also scale "down"?
Does Pervasive DataRush come with a library of pre-built components?
Can I write my own components for DataRush?
Are there any applications already written on Pervasive DataRush?
Why did you write Pervasive DataRush in Java instead of C/C++; isn't Java too slow?
Are there any third-party dependencies inside DataRush?
How big is the disk/memory/startup footprint of Pervasive DataRush?
Will Pervasive DataRush work on MPP-style clusters or grids?
What is the competition for Pervasive DataRush?
What will DataRush cost?
What will Pervasive DataRush licensing terms be?
How do I get support on Pervasive DataRush?
How do I report problems with Pervasive DataRush?


What is Pervasive DataRush™?

Pervasive DataRush is a 100% Java framework for developing highly parallel data and computationally intensive applications to run and scale on emerging multicore hardware without the need for any knowledge of concurrent programming techniques.
--Go To TOP--

Why Pervasive? What led to the development of Pervasive DataRush?

Pervasive has been in the business of developing database and data integration software for more than 25 years. As a leading supplier of data infrastructure software with thousands and thousands of customers, we are at the forefront of trends (and challenges!) around data. Looking over the horizon we can see a data volume "crisis" looming (volumes growing at 100% per year, exacerbated by trends like the WWW, RFID, XML, etc.), and want to be part of the solution. Being part of the solution going forward, however, will require a fresh look at how to technically and economically scale data-intensive applications, and we are bringing exactly that kind of thinking to Pervasive DataRush.
--Go To TOP--

Why is Pervasive DataRush needed?

Organizations of all sizes are drowning in oceans of data (and it's only going to get worse). At the same time everybody is under pressure to process that data within ever-shrinking time windows. There’s growing consensus that the only feasible answer to this BIG problem is parallelism. So Pervasive DataRush is being launched to give Java developers a new weapon, one that allows them to develop a whole new generation of high-performing, auto-scaling, hyper-parallel data-intensive applications on multicore platforms.
--Go To TOP--

Why only “data-intensive” applications? Can't I used DataRush for purely compute intensive applications?

The core engine that powers DataRush is based on decades of computer science research into what is called “dataflows”. Any application or algorithm that can be designed as a dataflow graph (a 'directed graph' to be precise) is an excellent candidate for implementation using DataRush. Some business and scientific problems just aren't dealing with bulk data and are more process oriented than dataflow oriented. Also, some problems such as Online Transaction Processing (OLTP) are best handled with J2EE or .NET server architectures.
--Go To TOP--

Why aren't current techniques around parallelism working?

There have been serious efforts to exploit parallelism for decades, and to some degree they are "working", but compared to what we think is possible, they generally fall woefully short. A seminal moment in hyper-parallelism occurred in the 80's with a notion that could be described as: "..hey, if mass-market CPUs only cost $3, why don't we just lash hundreds or thousands of them together and build a hyper-parallel hardware platform?". As it turned out, this foray (let's call it SMP - Symmetric Multi-Processing) bumped into significant challenges - it was extremely difficult to lash even 2 or 4 CPUs together in a tightly-coupled SMP design, and so the parallel "industry" (some overlap with HPC - High Performance Computing) took two fateful turns:

1) True SMP boxes were produced, but only by highly specialized players (e.g., Convex), and then only for VERY high-end customers who could afford the extremely expensive technology. While most of these specialty hardware players have disappeared, examples of this kind of high-end SMP box continue today within the product lines of the major manufacturers (e.g., the Sun Fire and PRIMEPOWER systems).

2) The rest of the effort moved to what we'll call MPP (Massively Parallel Processing), with an argument that might be described as: "..hey, if we can't lash together hundreds of CPUs in the preferred tightly-coupled SMP way, then let's just lash together hundreds of PCs on a network - at least that's achievable." Users would tie together "clusters" of thin-node (single CPU) PCs to at least have a "platform" for running parallel applications. The latest advances in this area are around grids (even bigger clusters), and moving to fat (multi-CPU) nodes within the clusters.

The second path (MPP-style clusters) has been the dominant one in turns of adoption (especially if we consider "loose" coupling of a handful of servers/blades), but in some ways, the move to MPP was a tragic detour that derailed any chance for genuine parallelism (beyond the simple, task-based parallelism predominantly in use today) to see truly widespread adoption. While it technically "works" it has serious and in our opinion permanent pitfalls.

One problem is simply hardware heterogeneity - too many different "moving" parts is the culprit in many a failed architecture, and lashing together different systems, over different protocols is a recipe for brittleness and high cost. But the bigger and more enduring problem is the stunning lack of "standard" software available for parallel processing. To write highly parallel applications today beyond simple task-based parallelism is to largely throw yourself and your finite resources at a world of fractured, obscure, proprietary and often ancient software tools and languages. All the modern benefits that we take for granted in the serial, single-threaded software world (single address space, great tools/languages, etc.) are miserably absent in the world of parallel software today.

In any case, MPP has all the hallmarks of niche/specialty solutions, and none of the grinding power of commoditization that drives all really successful paradigm shifts. But all is not lost - riding in like the cavalry is the multicore revolution (we call it "the revenge of SMP"), which will finally bring the hyper-parallel hardware revolution originally dreamed about 20 years ago.

But who will develop the software?
--Go To TOP--

Why is a new generation of hyper-parallel software urgently needed?

Since the advent of the microprocessor 30 years ago, virtually all system platforms have been built on a single CPU. This means that the vast majority of microprocessor-based software platforms take a simple view of the world: one CPU and a software stack supporting single-threaded applications. Even with the development of multi-threaded OS layers and chip-level multi-threading in recent years, the vast majority of programmers have remained with the simple serial model for application programming. And why not? Programming in a serial, single-threaded world is much simpler, and on top of that, for the last 30 years each succeeding generation of CPU produced so much sheer clock-speed horsepower gain that almost all existing software would enjoy an automatic speed-boost, simply by the introduction of new CPUs - no need to change the serially threaded code at all.

But those halcyon days are coming to a close - the historical model of ever-accelerating clock speeds on CPUs has hit the proverbial wall (cost, heat, etc.), and every single CPU manufacturer in the world is now racing to build CPUs on the model that will come to dominate – multicore. Benchmark performance of CPUs is moving from the clock-rate of a single core as the measuring bar to the harnessing into a single chip of multiple, cheaper to produce cores that can perform work in parallel – what is sometimes called "throughput" computing.

This heralds a tremendously exciting new cycle of energy, investment and performance from the hardware vendors, but there is a "gotcha". The "gotcha" is that, while this multicore revolution in microprocessor design will provide years and years of eye-popping "throughput" improvements from ever-denser packings of multiple cores, it also unleashes for the bulk of developers an entirely new paradigm for how to exploit this newfound parallel power - concurrent programming. And that presents several challenges:

1) For the first time in generations, as each new round of multicore hardware is released, existing single-threaded applications won't run any faster.

2) There is very little opportunity to "tune" single-threaded applications to run faster. Programming in a hyper-parallel multicore world means much of today's application (and even system) software will have to be rewritten. This will be especially urgent for data-intensive applications.

3) Concurrent programming in a world where applications will soon have hundreds of threads available is a fundamentally different (and harder) practice than serial programming on one thread, and there is a vast skills gap between the two among developers.

4) The typical software developer urgently needs new “auto-scaling”, hyper-parallel frameworks to deal with this situation.
--Go To TOP--

What is the Pervasive DataRush "secret sauce"? How is it different from other approaches?

Concurrent programming in a world of multicore-based hyper-parallelism is going to be hard- no way around that. The reason programming frameworks are built is exactly to tackle a hard problem and hide the complexity from the developer who needs to concentrate on solving a higher-order problem. This is what we have done with our patent-pending Pervasive DataRush technology.

It is a 100% pure Java framework that, using dataflow principles, allows developers to employ an extensive and customizable library of components ranging from simple to very sophisticated. Some components are even "self-customizing", with late-binding facilities to dynamically adjust parallel execution strategies. Using Pervasive DataRush, developers can build "simple" dataflows that are then "compiled" into highly parallelized auto-scaling dataflow execution graphs, fully ready to exploit the underlying multicore hardware even as it grows. All the classic challenges around concurrent programming (e.g., deadlocks, memory management, queue management, threading, etc.) are handled by the DataRush framework. There is a more technical presentation available for download on the website that goes into more detail.
--Go To TOP--

What does Pervasive DataRush™technology look like "under the covers"?

There is a technical overview available for download on the website that goes into greater detail, but here are some key concepts:

1) Built on dataflow principles leveraging a well-established paradigm for handling data-intensive work in a highly parallel and scalable way

2) High-level XML dataflow composition language with some nascent UI tooling, allowing developers to create parallel applications quickly and efficiently

3) Exploits all kinds of parallelism:

- Task parallelism

- Pipeline parallelism

- Horizontal partitioning

- Vertical partitioning (this last one is quite interesting - imagine the additional scaling multiplier you could get from being able to quickly and easily "divide and conquer" at the field/column/element level)

4) Component architecture that comes with a pre-populated library of operators and is fully extensible (write your own components!), and allows the assembly/composition of higher-order components

5) Special components that support execution-time "customization" - meaning they guide operators (components) in determining how to scale themselves for higher degrees of parallelism if they discover additional hardware resources at runtime

6) A framework and dataflow execution engine that hides much of the complexities (thread management, deadlocks, etc.) around concurrent programming
--Go To TOP--

How far can Pervasive DataRush scale?

Pervasive DataRush was invented to be able to automatically and continually scale applications that are both data and computation heavy, as long as you are willing to throw more (increasingly inexpensive and commoditized) hardware at the application. All data and computation heavy applications are eventually either IO-bound or CPU-bound or both. Pervasive DataRush can help with both, given the right mix of auto-parallelizing components, but is probably going to be more explosive around CPU-boundedness, given the rapid advances we are seeing and continuing to forecast around multicore chips and the massively multi-threaded future they promise.

While benchmarking results don't necessarily translate across applications we anticipate that many but NOT all data-intensive applications will enjoy, provided no IO constraints, approximately 60%+ linear throughput gains for every doubling of cores. Our internal testing has so far confirmed this on 2-, 4-, 8- and 16-way machines, and we are continuing to expand our testing to larger SMP configurations. Some data-intensive applications generate dataflow graphs that will encounter memory-boundedness, but they are few, and we are working to improve these cases as well. Note that the following two questions are also very relevant to the scaling issue.
--Go To TOP--

Why is the "auto-scaling" in Pervasive DataRush so important?

Traditional notions of "scaling" normally mean using parallelism to scale up for ever-larger data volumes or shrinking time windows or both. However, within the status quo it is important to realize that the majority of traditional solutions built on parallel techniques are notoriously "brittle" - that is, they were engineered by typically very expensive and rare concurrent programming talent with specific knowledge of the current-at-the-time hardware configuration, who then implemented scenarios that, while highly optimized for that specific scenario/configuration, will in almost all cases NOT automatically scale as additional hardware resources are thrown at the problem. This can be a very frustrating moment for users/buyers, who experience a fairly painless hardware expansion, but then find the software application/solution does not scale for the new hardware, and in fact has to be "tuned"at length or in some cases re-written altogether. Of course, with the gathering speed of advances around multicore, and the ease with which users will be able to quickly, easily and inexpensively upgrade their hardware, the frustration around "lagging" software will increase, and this mismatch between hardware and software scalability, absent an auto-scaling development platform like Pervasive DataRush, will become jarring, even critical.
--Go To TOP--

Can Pervasive DataRush also scale "down"?

Historically, the benefits of parallelism have remained the exclusive province of very high-end (read: expensive) hardware and software configurations. While Pervasive DataRush will play a key role in this arena, the biggest opportunity to "spread the wealth" of parallelism is by bringing it to the masses.

Every industry has that moment when benefits formerly enjoyed only by the upper tier come cascading down the commodity curve, and we feel multicore is bringing one of those inflection points to the software industry. As outrageous as it may sound now, in the coming years we will be commanding armies of hyper-parallel machines, each with the ability to run hundreds, even thousands of parallel threads, and each at a stunningly low price point. And what are the odds that, on this new breed of inexpensive and hyper-parallel hardware, users will want to fork out enterprise-level spending for brittle software parallelism on those machines? Pretty much zero.

The pressure is on the software industry to quickly come up with hyper-parallel software answers, and we believe Pervasive DataRush is one of those answers. Because of the very small footprint, including startup time, size and cost, of Pervasive DataRush, it can be used profitably on virtually any hardware configuration, from the biggest down to the smallest. Why shouldn't the user who in 2010 spends $10,000 on a mega-powerful 32-core server or workstation also enjoy the benefits of software hyper-parallelism? In the same way that users will always want to economically reduce the runtimes of their monster data-intensive applications from days to hours, why stop there? Why not reduce the hours-long jobs to minutes? The minutes-long jobs to seconds? Why can't we shrink the “wall-clock” time of every data-intensive application?
--Go To TOP--

Does Pervasive DataRush come with a library of pre-built components?

Yes. With a foundation of many years of research and development, Pervasive DataRush ships with a complete library of "foundation" components. The pre-existing portfolio includes reader and writer components for JDBC, flat files of all kinds, and XML. In addition there are a series of low-level components available for basic data manipulation, as well as more sophisticated components for join, lookup, sort, aggregation, etc. Over time, there will be a rich library of pre-built components/operators so many users will be able to "build" highly parallel and auto-scalable applications by simply stitching together a series of "standard" components in a novel way, thereby building a custom dataflow for virtually any application purpose. Of course, via the open SDK, users can also extend the framework with their own components for any task.
--Go To TOP--

Can I write my own components for DataRush?

Yes, in fact we encourage it, and we believe this can unleash the power of Pervasive DataRush for your application needs. There is a rich SDK and a full set of samples, including source code, illustrating how to write your own Pervasive DataRush components to extend the framework for your particular needs. We fully expect application developers to develop highly valuable add-on components to the Pervasive DataRush framework that allow their particular applications to scale for their domain space. Once the new components are developed they can be freely and creatively combined with the existing foundation components to produce custom dataflow-based user applications.
--Go To TOP--

Are there any applications already written on Pervasive DataRush?

Yes, for three years Pervasive has been shipping a product called Pervasive Data Profiler™ built entirely on Pervasive DataRush. Data profiling is a highly data-intensive activity where entire source files/tables are scanned, and computationally intensive metrics (e.g., Sum, Avg, Min, Max, Tests, Frequency Distributions, regulatory compliance checks, etc.) are performed on any number of columns in all the rows/records simultaneously. Needless to say, scalability is very important for such an application and Pervasive DataRush has performed up to expectations with near-linear scalability as the hardware is expanded with additional multicore CPU resources.
--Go To TOP--

Why did you write Pervasive DataRush in Java instead of C/C++; isn't Java too slow?

1) The software industry focuses too often only on run-time performance, and while that is important (after all, you're reading this FAQ!), it is equally important to focus on design-time (and maintenance/QA-time) performance. In this respect, Java is superb with good design-time tools, a productive language, rich third-party libraries and portability across HW platforms.

2) There was a time when Java was materially slower than C/C++, but those days are largely over in our opinion. There are many very large players including Sun, IBM, Oracle and BEA who are committed to delivering constant performance improvements around Java.

3) The whole point of Pervasive DataRush is to enable you to ride the coming wave of multicore-based parallelism. As each new generation of hardware improvements rapidly washes over us, the legacy advantage in wringing the last ounce of performance from serial algorithms diminishes.
--Go To TOP--

Are there any third-party dependencies inside DataRush?

Other than Java, no. And this is a very important point. Many practitioners around parallelism and concurrent programming, both Java and non-Java, think that this "nut" of parallelism can only be cracked with either major extensions to existing languages or new languages altogether. Needless to say, both approaches are costly, fraught with risk, and introduce entire new complexities and delays around learning and compatibility. While the temptation is always there to "invent" YAL (yet another language) we are convinced that is not necessary. If we’re right, you save a ton of time and money, and all you need is a JVM!
--Go To TOP--

How big is the disk/memory/startup footprint of Pervasive DataRush?

Pervasive DataRush was built from the beginning to be very small. This contrasts starkly with other tools and products in the space. The quick metrics are:

1) The DataRush engine takes about 2 MB of disk space

2) Basic memory footprint is 1 MB, though of course, it can and will consume more memory to execute data-intensive applications

3) Startup time is sub-second
--Go To TOP--

Will Pervasive DataRush work on MPP-style clusters or grids?

The short answer is no. Pervasive DataRush needs a single JVM, which itself generally implies a single address space and OS. It’s worth pointing out that to the degree the MPP/cluster/grid folks can make the hardware look like a single SMP-style machine with a single OS and JVM (e.g. Azul Systems) then even those environments are fair game for Pervasive DataRush.

We are making the bet that has been proven right many times before (e.g. the inexorable march of Moore's Law) – once the hardware vendors get the commodity multicore wheel turning then constant, eye-popping advances in processor capabilities ensue at blinding speed.
--Go To TOP--

What is the competition for Pervasive DataRush?

We see few if any natural competitors out there. Obviously there are other players tackling the problem of massive data volumes with tools/solutions that employ high degrees of parallelism, but most are ill-suited to help Java developers find embeddable technology to vault over the ugly reality of concurrent programming and into the hyper-parallel future. A quick list of other approaches/players might include:

1) The commercial RDBMS vendors. There is frankly outstanding parallelism going on inside these products whose top engineers have been working on dataflow-based parallelism inside database engines for years, but who wants to drag a heavy database around, and be forced into back-breaking loads/unloads, just because you need parallelism? Pervasive DataRush is a better solution for many computer applications.

2) There is some interesting work going on around stream-based data "engines" that appears parallel-friendly, and relevant to certain vertical, data-intensive application needs, but then again, you could build your own streaming data engine using Pervasive DataRush.

3) A motley crew of fat, pricey, MPP-style ETL options focused on niche scenarios.

4) Some academic projects here and there, but almost all deal with compute-intensive applications and not data-intensive ones.

Given the relatively tiny footprint of Pervasive DataRush in terms of technology, learning curve and economics, we really don't see another solution out there.
--Go To TOP--

What will DataRush cost?

Pricing has not been finalized, but the whole history of Pervasive Software is delivering fast, easy and affordable "embedded" data engines to developers. Stay tuned to the website for further developments around pricing, but thoughts so far are:

1) For development purposes, the SDK/Developer Seat will always be free

2) There will be an optional annual Support Subscription option for those developers who need/want it. Free community support will be fostered on www.pervasivedatarush.com

3) When it comes to production deployment of applications using the Pervasive DataRush runtime, you will have to obtain a license from Pervasive Software:

------- There will be friendly academic licensing options. Provided no commercial deployments are contemplated there will be no cost

------- For commercial developers (ISVs - both traditional and hosted) there will be embedded OEM options that fit your business model

------- For end-users and corporate developers there will be very simple pricing that likely aligns with the scale of the hardware platform on which it is deployed
--Go To TOP--

What will Pervasive DataRush licensing terms be?

Some of this is still under discussion, but current plans have the SDK and developer seat UI pieces being offered under a no-cost license for any development purposes, including academic uses, and the runtime dataflow "engine" being licensed under an annual subscription model when used for production/commercial purposes. Stay tuned to the website for further details.
--Go To TOP--

How do I get support on Pervasive DataRush?

There will be two options:

1) Pervasive will host free community-based support on www.pervasivedatarush.com. While there are no guarantees, Pervasive DataRush experts will be monitoring the support activity in the forums, and participating where it makes sense.

2) Pervasive will also offer a "per developer" annual Support Subscription, which will allow direct support interactions with Pervasive experts, as well as other benefits.
--Go To TOP--

How do I report problems with Pervasive DataRush?

There are a series of forums for user feedback at www.pervasivedatarush.com. While there are no guarantees, these forums will be monitored by Pervasive DataRush experts, and such feedback will factor into successful evolution and growth of the product. Those developers purchasing an annual Support Subscription will receive direct access to the Pervasive DataRush support team.
--Go To TOP--

Have you looked at Azul Systems as a DataRush platform? The Vega series offers hundreds of Cores and close to a terrabyte of memory, all of which can be brought to bear on a single JVM (the hardware allows pauseless garbage collect so memory limits aren't a factor). Any idea whether DataRush technology could do in terms of allowing a Java application to actually use HUNDREDS of threads without blocking?

Yes, we have been working with Azul for awhile, and we were so excited to light up all the cores on their awesome box.

The only hitch today is that they have not yet released their Java 6 JVM, which we require. But they are in beta with it, we have run on it, and I have heard that Cliff and his team are even using DataRush to test their platform as they continue to refine it.

We are hopeful that they will support Java 6 soon, and can't wait to try it out at a customer site.

Thanks for the comment!
steveh

I think discounting the whole MPP field is a little premature, especially when very larger data flows are required to be computed. It is unlikely that the number of cores in a single processor will keep up with the rising tide of data. Also the virtualization of a single JVM across a grid will probably prove prove challenging since the Data Rush optimizations will wind up fighting and conflicting with the cross processor virtualization optimizations. Data Rush sounds like a good idea to harness the power of multiple cores, but I don't think its the scalability panacea it claims to be.

Hi, and thank you for your comment.

You are correct that the rising tide of data seems to be unstoppable, and it is true that we have yet to see a single JVM running across multiple boxes.

But the increase in the number of cores in a single machine is relentless. We are already running on 32-core platforms from Sun and HP using processors from Intel and AMD, and the announcements from all of these vendors keep coming, for example the DL785 Proliant announcement last week from HP. So inexpensive computing resources continue to grow.

More importantly from our perspective, DataRush completely changes the design time costs as compared to MPI or OpenMP or other grid-based alternatives, and we believe that the scarce resource is developers, not compute power. Have you had a chance to read the JDJ cover story this month? The entire app was coded and tuned and deployed in a month, which is quite fast.

In a larger sense, DataRush should be considered as an alternative to grid-based approaches, and we believe it will prove to be faster to design and code, more robust in deployment on varying platforms, and produce very high performance in data-intensive situations.

We hope you will keep watching as we move to our next release, as we develop and deploy more early applications, and as the product continues to gain acceptance.

Thanks!
steveh

Idea of using runtime dataflow engine is not so unique.
There are many similar products to DataRush. Probably, the most famous one is BMDFM - Binary Modular DataFlow Machine.

No one's claiming the idea of dataflow is unique. In fact, we've explored technologies like BMDFM, Morrison's flow-based programming, and MIT's StreamIt. Each of these projects is interesting in its own right, but we feel their goals are not exactly our own. Note DataRush provides a Java shop a framework for parallelizing their apps without struggling through all the standard concurrency issues. Any team's decision to use DataRush must be based upon an informed comparison of the alternatives. Feel free to download the SDK and give it a try in your next project.