DataRush Benchmarks

We will be posting DataRush benchmarks here as they become available. Not all benchmarks share the same goals and objectives—some may show how DataRush scales by adding processing cores, while others seek to compare DataRush vs. other non-paralllel implementations. It is extremely important to read each of the benchmark goals below, as well as the technology configuration notes. Goal: We know that understaffed and overwhelmed IT projects typically result in a developer just "getting it done" when it comes time to implement any algorithm or application. There is barely any time to develop, unit test and system test the application; let alone implement complex threading, data management, deadlock detection, etc... So a common baseline question is: "Take a known problem (algorithm). What is the performance of a typical implementation using 'regular' Java vs. the same algorithm powered by the DataRush Java framework?" Clustering for Large Datasets: DataRush vs. Java The Algorithm: K-Means is a simple clustering algorithm. It generally involves multiple passes of the data set; computing clusters based on minimizing error measurements. However, an even more simplified form exists in one pass. This one-pass algorithm does not create the best clusters, but does a fair job of a first guess. The algorithm uses an epsilon (maximum error distance) to determine if a two points are close enough to be considered in the same cluster. A simple distance measurement such as Euclidean distance is used in the distance comparison. It is fairly computationally intensive, using the square root of the sum of squares distance measurement and comparing every point's distance to every centroid. Using floating point data makes it even more compute intensive. Results: What we see here in the graph above is that DataRush is able to take advantage of the multicore server and the algorithm now runs in 1/8th the previous time (i.e. an 88% reduction in runtime). Of course, we are not implying that an expert-trained software engineer couldn't build a multi-threaded Java application to improve the performance of the non-DataRush implementation. Back to our Assumptions and Goals above—what we are showing is the massive ROI a company can achieve using any of their Java developers to quickly build data-intensive applications. And in this case (not true for all algorithms in the world), if the organization needed the job to run in half the time, they would simply double the cores available to DataRush at runtime. Not many developers possess the know-how or time to develop a custom threaded Java applications that could auto-scale without changes as processors are added. Assumptions: We assume the DataRush developer has already taken the 1 or 2 weeks to learn our data processing framework. Once this assumption is taken into account, the time to implement with or without DataRush is approximately the same. Also, we assume this is not a RAID array I/O test, so for this benchmark we do all data generation in memory for both implementations. Technical Parameters:
    4 Dual Core CPUs (8 cores total) with 64 GB RAM
    10 million records of data
    100 fields in each record
    Cluster every 2 columns as a data point
    Therefore, 500 million K-means operations per run

A Key Piece to the Search Puzzle: DataRush vs. Java and Perl The Algorithm: Levenshtein distance is a somewhat simple word matching algorithm popularized by Vladimir Levenshtein in 1965. Distance is expressed as an integer that specifies the minimum number of single character edits required to make string 1 equal string 2. See Wikipedia entry for further reading. Used with other fuzzy matching techniques, edit distance could form the basis of a search or information surveillance application. If we did use this algorithm for part of a large enterprise search application, a prudent analysis would be to stress Levenshtein distance computations on massive datasets to understand its performance characteristics. The benchmark below compares edit distance written in Perl or Java (non-threaded) to Pervasive DataRush. The DataRush Java code is substantially the same as the non-threaded Java code—just minor changes to wrapper the edit distance class with a DataRush interface and an implementation of a DataRush "customizer". Results: What we see here in the graph above is that DataRush is able to take advantage of the multicore SMP server and the algorithm runs in almost 1/7th the previous time (i.e. an 85% reduction in runtime). Unfortunately, the Perl script ran for 20 hours and we were forced to euthanize the poor thing. We are not implying that an expert-trained software engineer couldn't build a multi-threaded Java or Perl application to improve the performance of the non-DataRush implementations. Back to our Assumptions and Goals above—what we are showing is the massive ROI a company can achieve using any of their Java developers to quickly build highly-parallel, data-intensive applications. And in this case (not true for all algorithms in the world), if the organization needed the job to run in half the time, they would simply double the cores available to DataRush at runtime. Not many developers possess the know-how or time to develop a custom threaded Java and Perl applications that could auto-scale without changes to code as processors are added. Assumptions: We assume the DataRush developer has already taken the 1 or 2 weeks to learn our data processing framework. Once this assumption is taken into account, the time to implement with or without DataRush is approximately the same. Also, we assume this is not an I/O test, so for this benchmark we did all data generation in memory for all three implementations. Technical Parameters:
    4 Dual Core CPUs (8 cores total) with 64 GB RAM
    10 million records of data
    100 fields in each record
    Every 2 columns require an edit distance computation
    Therefore, 500 million Levenshtein operations per run

Trackback URL for this post:

http://www.pervasivedatarush.com/trackback/26

Why I should use DataRush is not really explained. This only explains why there is a difference between standard single threaded code and some random multithreading framework. No real usefull information about WHY DataRush should be a BETTER multithreading framework than anything else on the market (or even simpler - use ones brain and just code it so it works!)