Press clips, articles, and reviews of DataRush

Building Multi-core Ready Java Applications, Part I and Part 2

Author: 
Eric Bruno
Publication: 
JavaLobby
Date: 
2008-04-21

With the advent of multi-core processors, the existing subject of symmetric multiprocessing has been thrust to the forefront in the development community. As multi-core CPUs make parallel processing systems more prevalent, and more affordable, there is an increasing need for frameworks that help to handle threading, synchronization, deadlock detection, memory management, data pipelining, vertical/horizontal data partitioning, and so on.

This two-part article series will discuss some of the special design, development and testing techniques that must be used to take full advantage of multiprocessor systems. It examines some frameworks available to make Java-based parallel programming easier and even transparent in some cases.

Part 1: Symmetric Multiprocessing Overview

Symmetric multiprocessing (SMP) is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory. Most common multiprocessor systems today use an SMP architecture, where the system allows any processor to work on any task no matter where the data for that task is located in memory. In fact, all system resources are available to all processors, and hence all applications, equally. With proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.

Part 2 of this article will focus on a framework from Pervasive Software, called Pervasive DataRush™,which reduces the complexity of parallel programming. The issues around parallel computing, as applied to the problem of loading large amounts of text-based data into an RDBMS, will be introduced, and Pervasive DataRush will be explored as a way to solve them.

Source URL: http://java.dzone.com/news/building-multi-core-ready-java

Links:
[1] http://java.dzone.com/news/build-multi-core-ready-java-ap-0

Transforming Big Data

Author: 
John E. West
Publication: 
http://www.hpcwire.com
Date: 
2008-03-28

"There is a lot that we still don't know about the architectures, tools and techniques needed to effectively process the data we are amassing at work and at play in much of the first and second world. But, as with multicore programming techniques, data intensive computing provides the HPC community the opportunity to leverage products and models developed in the commodity community to advance the state of the art in our own field."

Crunching Big Data with Java: One Team, One Month, One JVM

Author: 
Jim Falgout
Publication: 
Java Developers Journal
Date: 
2008-03-25

"We have shown that a successful approach for data analytic problems is to take a data-oriented view of the solution. This type of analysis leads to designs that are transferred easily to implementations using dataflow techniques. It was demonstrated during the course of this article how a small team of three developers created a fuzzy matching application in a short amount of time using these techniques. The debugging and profiling tools built into Java and the dataflow framework used were instrumental in optimizing the application not only for minimal runtime, but for optimal resource (CPU) utilization. The development focus on parallelization and utilization lead to good scalability, allowing the application to show much faster runtimes on configurations with more processing cores. This is important because we don't want to have to re-code as more cores arrive on the scene.

The productivity of Java, its excellent IDE support, and the wide variety of Java libraries available make it an excellent platform for software development. My team's work on the fuzzy matching application demonstrates that Java can also be used for applications that are a mix of data-intensive and compute-intensive elements. The Java platform provides an excellent mix of design-time and runtime performance and scalability. With new architectural approaches and dataflow library extensions, Java can be turned into a formidable data-crunching machine."

Pervasive Software's Datarush

Author: 
Dr Peter Dzwig & Dr Russel Winder
Publication: 
www.it-director.com
Date: 
2008-03-11

Pervasive Software (NADAQ: PVSW) is company that many of us may have come across in the past. They have been around for some twenty-five years in the database and search businesses, going back to SoftCraft in the early eighties and then Btrieve with its eponymous ISAM product, subsequently becoming Pervasive Software. The company has been quoted for about a decade. Our interest in them arises because their Datarush product is targeted at doing heavy searching and data analysis on large databases running on multi-core based systems.

The rate at which we can capture data from commercial, financial, telecommunications, banking and a raft of other industries is growing explosively. At the same time the need for fast, deep analysis becomes ever more acute. The only way to address this in future will be through the deployment of large multiprocessor, multi-core systems.

Datarush addresses an important issue; the gap between what hardware vendors are promising to deliver by way of MCP hardware (Xeon, Opteron, etc., etc.) and what the software industry is able to deliver in terms of commercial applications to exploit that hardware. Datarush is not a silver bullet as far as parallel programming goes. Pervasive openly state this. What it does do is provide a set of tools targeted at what this particular sector is going to require, at least in the near term. Requirements will grow and change as people realise what parallel computing can achieve.

Datarush is targeted at applications that need to process very large datasets (sets with several hundred million records have been trialled to date) and that are analytical rather than straightforward, embarrassingly parallel transaction-based systems. In this respect Pervasive are very clear that the throughput that they have seen with their present trial applications are showing very real improvements in terms both of throughput and analytic capability.

Datarush is a framework that sits atop a JVM running on hardware from vendors such as Sun, IBM, HP, Dell, Azul, SGI. The framework is therefore operating system agnostic, running on Linux, Solaris, AIX, HP UX, and Windows Server. Of course the efficient use of multicore hardware by Datarush depends on how well the operating system and the JVM implementation manage the hardware. Not all platforms are equal. Datarush provides a series of modules that provide functions to allow, for example, sorting and collation, ETL, extraction, profiling and similar functions in a format that is readily applied by the user. The Datarush model is therefore essentially a thread-based approach at the functional level. That is, Datarush uses the thread pool to implement the parallelism that is required by the algorithms created using its framework.

Interestingly, Pervasive has chosen to provide a framework through Datarush that is a functional, dataflow programming model to provide the necessary mapping between the application and the underlying hardware. Programmers use one of the predefined functions, or some combination of them, to build their process and execute it. By abstracting in this way applications developers do not have to worry about low-level issues such as shared memory, synchronisation, and locking. occam and Erlang have long had a similar approach to managing parallelism, one that is increasingly viewed by many as more appropriate for programmers to work with. By separating applications developers and users from the need to consider issues such as synchronisation, they are freed to focus on the data itself and the algorithms for its analysis. Greater performance should then be gained without a lot of highly specialist effort and time that more low-level approaches require. Pervasive has produced a series of claims in which the performance increases rapidly with the number of processors applied to the problem. We have not verified these figures independently, but they seem to be consistent with the model that they propose.

Pervasive have trialled Datarush on a wide variety of systems from Azul's 384-processor engine to AMD's Barcelona processor using a range of JVM implementations. We haven't yet seen the results of the trials, but we expect them to show very good results. Anyone with large-scale data-analysis requirements would do well to "watch this space".

As with any data-oriented system, it is actually the balance between compute power, memory access times and I/O that determines performance. For a well-balanced system and many classes of database problems, it seems that Datarush has a viable approach for most users. One would expect that, with a little trialling, most large-scale analysis applications ought to be able to exploit a lot of the advantage of MCPs using a system like Datarush.

Others active in the search-space arena have looked at parallel processor-based databases previously. In the 1980s and 1990s many companies, from Oracle down, experimented with implementations that could use the power of multi-processor arrays. Many of the lessons learnt made their way into production software and have been in circulation for a number of years. These lessons will, in some respects, translate to multi-core under a threads-based model, too. What still remains to be addressed in the longer term is what happens when hardware goes beyond the currently limited multicore systems. In other words, what happens when the threads model starts to reach the limit of its usefulness. Techniques that work well on a limited number of virtually independent processors (multi-processor mainframes for instance) don't necessarily translate to high core-count MCP engines. The underlying technologies hint that the approaches developed to underpin the Datarush framework may well prove portable.

Datarush is at present in beta-2 and will go to beta-3 later in the year.

Pervasive Software Works with HP to Help Developers Tap the Power of Multicore Processing

Author: 
Pervasive & HP
Publication: 
www.pervasive.com/company/press/releases_show.asp?cid=684
Date: 
2008-03-10

Pervasive Software Works with HP to Help Developers Tap the Power of Multicore Processing

Pervasive DataRush Scales Linearly on 32-Core HP Integrity Server Running HP-UX 11i

AUSTIN, Texas – March 10, 2008 - Pervasive Software® Inc. (NASDAQ: PVSW), a global value leader in embeddable data management and agile integration software, today announced the results of two benchmark tests that ran Pervasive DataRush™, a Java development framework for quickly building highly parallel data processing applications for today's multicore hardware, on a 32-core HP Integrity server running HP-UX 11i with HP Java Platform, Standard Edition 6 (Java SE 6) Version 6.0.00. The K-means and Levenshtein distance tests recorded linear scalability for Pervasive DataRush across all 32 cores.

"Building parallelized software, especially for data-intensive applications, remains complex, time-consuming and error-prone," said Mike Hoskins, CTO and general manager of Integration Products at Pervasive. "The key to Pervasive DataRush's power is its ability to deliver speed and dividing processing up in an intelligent fashion. With these benchmarks, we've demonstrated that Pervasive DataRush not only hides the complexity of parallel data processing but seamlessly scales to light up all 32 cores in a state-of-the-art multicore SMP machine. We're excited to align our priorities for innovation in data processing with leading companies like HP that are developing multicore hardware, operating systems and Java support with the developer community in mind."

Pervasive DataRush gives Java developers a framework to write applications that harness the power of multicore – once – in a scalable and cost-effective way, eliminating the need to rewrite code each time an application runs on a server with additional cores. Pervasive DataRush enables applications to leverage the benefits of multicore processing while eliminating the strain of constantly re-optimizing applications in a fast-evolving hardware environment.

"These benchmark results effectively demonstrate that Pervasive DataRush with HP's Java SE 6 on HP Integrity servers running HP-UX 11i provides an exemplary platform for developing, integrating and deploying Java applications to harness multicore processing," said Lorraine Bartlett, worldwide director of Integrity server marketing, HP. "This achievement supports the reality of Java's ‘write once, run anywhere' vision and HP's commitment to Java as an open, vendor-neutral standard platform."

About Pervasive DataRush
Pervasive DataRush is now available as a free beta download at www.pervasivedatarush.com.

About Pervasive Software
Pervasive Software (NASDAQ: PVSW) helps companies get the most out of their data investments through embeddable data management and agile integration software. The embeddable PSQL database engine allows organizations to successfully embrace new technologies while maintaining application compatibility and robust database reliability in a near-zero database administration environment. Pervasive's agile, multi-purpose integration platform accelerates the sharing of information between multiple databases, applications, or hosted business systems and allows customers to re-use the same software for diverse integration scenarios. For more than two decades, Pervasive products have delivered value with a compelling combination of performance, flexibility, reliability and low total cost of ownership. Pervasive's hallmark is the size, diversity and loyalty of its customer base, partners and channels: tens of thousands of customers in virtually every industry, in more than 150 countries, rely on Pervasive to manage, integrate, analyze and secure their critical data. For additional information, go to www.pervasive.com

Cautionary Statement
This release may contain forward-looking statements, which are made pursuant to the safe harbor provisions of the Private Securities Litigation Reform Act of 1995. All forward-looking statements included in this document are based upon information available to Pervasive as of the date hereof, and Pervasive assumes no obligation to update any such forward-looking statement.

###

All Pervasive brand and product names are trademarks or registered trademarks of Pervasive Software Inc. in the United States and other countries. All other marks are the property of their respective owners.

No rush for Pervasive Software's DataRush, but the time is right

Author: 
John Barr
Publication: 
451 Report
Date: 
2008-02-05

Event summary:
Pervasive Software has been using the DataRush parallel Java framework in its data-profiling tool for some time. DataRush has been in beta for over a year, although general availability had been planned for last year. Successful scalability tests with 'lighthouse' customers have paved the way for a production release in 2008.

The 451 take:
The world is going multicore, but parallel programming skills are thin on the ground. The market is slow to adopt new programming models, and some of the traditional parallel programming tools are (possibly unfairly) seen as old hat. Pervasive Software's development of DataRush – a Java framework for highly parallel, data-intensive applications – could be in the right place at the right time. While there are no special tools for helping the developer analyze an application (and hence optimize its parallel performance), Java Management Extensions (JMX) does let you see performance data once you are up and running with DataRush.

Details:
Pervasive Software's evolutionary plan has been to first build its own products, then to add value by exploiting DataRush internally, and finally to make DataRush available as a general-purpose parallel programming framework for Java environments. DataRush supports parallelism in symmetric multiprocessing systems – not clusters or grids.

Parallel programming can be complicated. To get the best performance, a developer often has to understand what is happening at a low level to handle cache management, threading issues and performance tuning. However, it's difficult to do this with Java, as the Java Virtual Machine (JVM) handles thread affinity and it's difficult for a developer to dig too deep. While this is a limitation for an expert wanting to wring the last drop of scalability and performance, it's still a very appropriate approach for taking multicore and parallel processing to the masses. (Note that Intel had a patent issued last year titled 'Flexible acceleration of Java thread synchronization on multiprocessor computers,' which addresses the issue of Java thread affinity, although no product implementation is available yet.)

DataRush tests on synthetic parallel applications have delivered scalability and performance of 28 times faster on 32 cores. This is excellent, considering that the JVM hides the processor affinity and cache management from the application.

Competitive landscape:
The Java Grande Forum was established a decade ago to encourage the development of Java language design for parallel, data-intensive applications. That effort and Java OpenMP petered out after a few years, despite initial industry interest. Pervasive says it doesn't see much competition out there for parallel Java frameworks, a conclusion that we agree with.

The main competition for the exploitation of multicore systems is parallel-processing environments for C++, not Java. The open standard in this space is OpenMP, and all the major compiler companies have good implementations. OpenMP's focus is more on hot computational kernels, while Pervasive's is on building scalable, data-intensive applications. But there's a big overlap between the two.

Others take different approaches. RapidMind has language extensions to C++ that allow the developer to express parallelism in a way that can more easily be exploited by a variety of parallel architectures, while Connective Logic Systems uses graphical tools to describe the structure of parallel applications.