PDR Data Mining Operators
The Pervasive DataRush (PDR) Marketing and Sales Optimization system is a scalable and efficient dataflow implementation of an end-to-end Knowledge Discovery Process. PDR is a library and dataflow engine used to construct and execute dataflow graphs in Java. All threading and synchronization is handled by the framework as data is only shared through inputs and outputs.
An operator is an extension of an existing class in the framework. A library of common operators is already implemented as part of PDR, in addition to the dataflow engine. A dataflow graph is composed by adding operators to the graph. Operators require their input sources in order to be constructed, so the wiring of outputs to inputs is done as you build the graph. Once you are finished composing, you invoke the run method and the graph begins execution. Because this is all done in Java, composition can be done conditionally based on pre-execution processing.
Scalability refers to the ability to speed up your applications linearly with the number of resources available. The common library of operators in PDR are designed in such a way as to automatically instrument the amount of parallelism based on the hardware. So dataflow graphs built using PDR tend to automatically scale with the number of cores. Therefore, the most common practice associated with scalability is identifying bottlenecks in user designed portions of the dataflow graph which become obvious as the number of cores is increased and then using the built-in PDR partitioning operators to automatically parallelize the user designed portions of the graph.
The solution includes PDR data mining operators for the following tasks:
- Data Preprocessing using vertical parallelism (partitioned by columns/attributes).
- Calculate Basic Statistics: Mean, Median, Standard Deviation, Kurtosis, Skewness, Min, Max.
- Calculate Distinct Values on Categorical Variables for Discretization.
- Calculate Pearson Correlation Coefficient for Feature Selection.
- Calculate Basic Statistics on Missing Values: counts, counts by rows, counts by columns.
- Data Cleansing using horizontal parallelism (partitioned by rows/customer records)
- Missing Value Replenishment
- Missing Value imputation
- Impute columns with standard deviation below a threshold
- Data Normalization
- Z-Scaling (zero mean, 1 std dev)
- Range Scaling (Max – Min range)
- Log Scaling for Skewness less than threshold
- Feature Selection using horizontal parallelism (partitioned by rows/customer records)
- Pearson Correlation to target/labels
- Classification Model
- Winnow Algorithm for numerical variables (vertically partitioned)
- Naïve Bayes algorithm for categorical variables (horizontally partitioned)
- Evaluation
- Prediction Accuracy, % correctly classified
- Sensitivity and Specificity based on True Positives, False Positives, True Negatives, False Negatives
- ROC analysis, Area-under-Curve (AUC)
Benefits
- Data Preprocessing using vertical parallelism
- Data Cleansing using horizontal parallelism
- Feature Selection using horizontal parallelism
- Classification Model
- Evaluation
Additional PDR Solutions: