490X better performance, no way! Really?

Yes! A little more background on the problem. Part of the DataRush team quickly developed a matching application for a customer. More on that in a later blog entry. The output of the application is record pairs that are considered to be matches. And that's where we stopped as the customer decided to implement their own "roll-up". Not being sure what they meant by roll-up but glad it wasn't on our schedule, we marched on.

Until the overall process ran and the step to do the "roll-up" took over 3 hours. But the matching step ran in minutes. Ok, now we care about roll-ups, so what are they? It's basically taking the set of record pairs and rolling them up into larger sets. For instance, record A and record B are matched. Record B and Record C are matched. So we want Record A, B and C to be in the same set. Only one record in a set wins, the others are thrown out as duplicates. That's easy with 3 records. Not so easy with millions.

Right, doesn't sound to hard. It looks like a disjoint-set problem. One of the guys on the team rememberd an algorithm from his undergrad days, did a little Googling and then coded the algorithm in Java. He quickly wrapped that code in a DataRush process. The next step was building an application that read the input data (the output of the matching app), ran it through the disjoint-set algorithm and then wrote the output. DataRush already has very parallelized readers and writers, so that part was easy.

Next step: punch the run button and 22 seconds later, we are done. With the same "roll-up" output that was generated by 3+ hours of running SQL statements.

Many data jobs are not well suited to RDBMS processing and run much faster outside of the database. Looks like we found another one!

Trackback URL for this post:

http://www.pervasivedatarush.com/trackback/199