Pervasive DataMatcher Finds Duplicate Records 

Pervasive DataMatcher™ is a highly accurate and high-throughput fuzzy matching solution that finds matching records in large data sources. Inconsistencies in data are inevitable when you consider the multiple sources of data. While conventional approaches do well at finding exact matches, they cannot handle variations or even minor inconsistencies in the data and can also suffer performance degradation as the size of the data increases.

Turn to Pervasive DataMatcher…

  • If you need to find not only exact matches, but also variations (e.g. Mike and Michael)
  • If you need to clean duplicates as a result of errors in data entry (e.g. Michael and Micheal)
  • If you need to correlate records from multiple data sources 
  • If you need to quickly find matches in very large and growing data sets

Whether you need to detect fraudulent behavior, load clean data into a BI system, reduce leakage in your claims management process, conform to compliance requirements, or improve data quality, Pervasive DataMatcher is a key part of the solution.

The seemingly straightforward task of data matching on large disparate datasets is a daunting challenge for many organizations.  Even with a fairly small dataset, comparing each record to every other record can generate an overwhelming amount of data. For example, comparing every record of a 100,000-row dataset would involve nearly 5 billion record comparisons.

Pervasive DataMatcher is designed to scale seamlessly on large, complex datasets with the ability to match on any or all fields in a dataset, including fuzzy matching. Pervasive DataMatcher is built on top of the Pervasive DataRush Parallel Dataflow Engine, enabling the solution to crunch through massive datasets quickly and accurately on commodity multicore hardware. 


Capabilities

  • Fuzzy matching to detect duplicate records in single or multiple datasets
  • Multi-thread performance provides unprecedented response times for large datasets
  • Ability to match on any combination of fields in datasets
  • Ability to handle multilingual and international character sets with full Unicode support
  • Data partitioning to segregate data into like groups for deeper comparison, delivering high scalability and better application performance as more processor core resources are available
  • Encodings include Soundex, Refined Soundex, Metaphone, Double Metaphone and sub-string
  • Fields comparisons using matching algorithms including Levenshtein Edit Distance, Jaro, Jaro-Winkler, Jaro-Hef, Damerau-Levenshtein, Q-gram, Positional Q-gram, prefix, suffix and exact match