HAVE A QUESTION?

Pervasive DataMatcher Finds Duplicate Data

Pervasive DataRush DataMatcher™ is a highly accurate and high-throughput fuzzy matching solution that finds matching records in large data sources. Inconsistencies in data are inevitable when you consider the multiple sources of data. While conventional approaches do well at finding exact matches, they cannot handle variations or even minor inconsistencies in the data and can also suffer performance degradation as the size of the data increases.

Turn to Pervasive DataRush DataMatcher…

  • If you need to find not only exact matches, but also variations (e.g. Mike and Michael)
  • If you need to clean duplicates as a result of errors in data entry (e.g. Michael and Micheal)
  • If you need to correlate records from multiple data sources
  • If you need to quickly find matches in very large and growing data sets

Whether you need to detect fraudulent behavior, load clean data into a BI system, reduce leakage in your claims management process, conform to compliance requirements, or improve data quality, Pervasive DataMatcher is a key part of the solution.

The seemingly straightforward task of data matching on large disparate datasets is a daunting challenge for many organizations. Even with a fairly small data set, comparing each record to every other record can generate an overwhelming amount of data. For example, comparing every record of a 100,000-row data set would involve nearly 5 billion record comparisons.

Pervasive DataRush DataMatcher is designed to scale seamlessly on large, complex data sets with the ability to match on any or all fields in a data set, including fuzzy matching. Pervasive DataRush DataMatcher is built on top of the Pervasive DataRush Parallel Dataflow Engine, enabling the solution to crunch through massive data sets quickly and accurately on commodity multicore hardware.

Capabilities

  • Fuzzy matching to detect duplicate records in single or multiple data sets.
  • Multi-thread performance provides unprecedented response times for large data sets.
  • Ability to match on any combination of fields in data sets.
  • Ability to handle multilingual and international character sets with full Unicode support.
  • Data partitioning to segregate data into like groups for deeper comparison, delivering high scalability and better application performance as more processor core resources are available.
  • Encodings include Soundex, Refined Soundex, Metaphone, Double Metaphone and sub-string.
  • Fields comparisons using matching algorithms including Levenshtein Edit Distance, Jaro, Jaro-Winkler, Jaro-Hef, Damerau-Levenshtein, Q-gram, Positional Q-gram, prefix, suffix, and exact match.