Data Visualization & Analysis

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum

GeneXproTools Online Guide Learn how to use the 5 modeling platforms of GeneXproTools with the Online Guide

Last update: February 19, 2014

Data Visualization & Analysis

Introduction
Datasets
Variables & Models
Variable Charts & Analyses: Sequential Distribution Chart; Bivariate Line Chart; Histogram; Scatter Plot; Statistics Charts; Variable Importance; Summary Statistics; Outlier Detection & Removal; Model Analysis
Record Charts & Analyses: Sequential Distribution Chart; Histogram; Scatter Plot; Summary Statistics; Error Analysis

Introduction

The Data Panel of GeneXproTools is a multifaceted platform for data analysis and visualization of both your variables and records. The rich web of interactions between different kinds of variables (original input variables, derived variables, all the models in the run History, and specific model variables) and different datasets (the training, the validation or both datasets) allows you to perform powerful analyses of both your variables and records.

Datasets

In the Data Panel, GeneXproTools allows you to choose different datasets so that you can analyze and visualize both your variables and your records for particular sets of data. You can either analyze your data as a whole where you perform different analyses for all of your data or you can analyze the training and validation/test datasets separately. For example, you can check for outliers in all your variables using the total of your data or just the training or validation dataset.

Or you can compare the histograms of certain variables in the training and validation datasets, comparing them with their overall distribution in the union of both datasets, just by switching from one dataset to the other.

Or you can visualize correlations between each independent variable and the response (the dependent variable) or between any two pairs of variables in your data, including original and derived variables, all the generated models and their variables and all types of model outputs, which in Classification and Logistic Regression include not only the predicted class but also the raw model output and the probability.

Or, on the Record Analytics Platform of the Data Panel, you can compare the record prototypes for different subsets of variables both in the training and validation sets and then see how they compare with the overall prototype evaluated for the whole data.

Variables & Models

The new sophisticated Data Panel of GeneXproTools is a powerful analytics platform for analyzing and visualizing not only the variables of the input data (the Original Variables), but also all the new features you create from the original variables (the Derived Variables) and all the models you generate (the History Models and the Active Model with its specific Model Variables). For Classification and Logistic Regression problems, this also includes being able to choose the type of model output: the raw model output, the probability or the predicted class.

Thus, in the new Data Panel you can not only study your original variables but also see how they relate to the new features and the models you create. For example, you can analyze the correlation between all the variables of a model and the model output.

Or you can see exactly what points are associated with classification errors in your classification and logistic regression models by highlighting the misclassifications. In Regression and Time Series Prediction you can also highlight model outliers based on the relative or absolute error. This functionality of highlighting different subsets of records is available for the Sequential Distribution Chart, the Bivariate Line Chart and the Scatter Plot.

Or you can evaluate and visualize the variable importance of all the variables in a model, which you can perform quickly and easily for all the models in the Run History by browsing the models right in the Data Panel.

On the Record Analytics Platform you can browse different subsets of records by choosing different categories in the Browse Records combobox. For example, in Classification and Logistic Regression problems you can browse just the positives or negatives, the records that were misclassified for each of the models in the Run History, or just the false positives or the false negatives. In Regression and Time Series Prediction you can also browse different record categories, such as different kinds of model outliers and hits.

Variable Charts & Analyses

Variable charts and analyses include variables that are broadly defined to include not only the original variables but also the derived variables and the models (which of course are a special kind of derived variable).

By combining all these kinds of variables with different charts and other visualization tools to highlight different records, such as outliers and misclassifications, very powerful analyses can be done with ease and very quickly in the Variable Analytics Platform of GeneXproTools so that you can better understand both your input variables and the models you generate with them.

Sequential Distribution Chart

The Sequential Distribution Chart, with the option to show the standard deviation lines and the average line, offers a simple and very effective way of detecting outliers and analyzing the distribution of values for all your variables.

You can also use the Sequential Distribution Chart to highlight points that are being misclassified by the current model or that result in strong or weak responses of the current model (these responses are called model outliers as they are defined in relation to a model).

The Sequential Distribution Chart can also be used to analyze the distribution of positive and negative records across a variable range to help you spot simple patterns in your data.

Bivariate Line Chart

The Bivariate Line Chart is a very powerful and flexible tool that allows very useful comparisons of any pair of different variables. With the Bivariate Line Chart you can select any two variables and then plot them in order or sorted in different ways. This chart also allows you to scale your variables so that you can compare them in a more meaningful way. In addition, you can also use this chart in combination with different types of Highlight Options, such as misclassifications, false positives, false negatives, true positives, true negatives, positives, negatives, outliers and hits.

Histogram

With the Histogram you can visualize very quickly the distribution of values of all your variables. GeneXproTools allows you to browse easily from one variable to the other and also change the number of bins in your histograms.

Also interesting is the analysis of the distribution of model outputs, especially in Classification and Logistic Regression where you can analyze how your models are evaluating the classifications: is there a clear separation between the two classes or is there some overlap?

Scatter Plot

Like the Bivariate Line Charts, Scatter Plots also allow the comparison of any pair of different variables which are easily selected using up-downs both for the X-axis and Y-axis.

Scatter Plots are powerful analytic tools for showing the correlation between two variables, especially when the regression line and the regression equation with its slope and intercept are also shown.

Scatter Plots are also useful for detecting outliers, as these points usually fall far away from the main cloud of points.

The Scatter Plots of GeneXproTools can also be combined with different highlighting and synchronization tools, so that model outliers, misclassifications and other record categories can be easily spotted.

In addition, the Sync functionality provides an extremely useful tool for analyzing different models very easily. For example, you can very quickly determine if your models are similar or not in how they are misclassifying the same records or if they seem to be capturing different patterns in the data and are therefore good candidates for creating a better ensemble.

Statistics Charts

The Statistics Charts offer a simple and clear way of analyzing the summary statistics of all your variables: minimum, maximum, average, median, and standard deviation. Moreover, for all different subsets of variables GeneXproTools also plots the slope, intercept, correlation coefficient and R-square, all evaluated against the response variable.

These statistics are also shown on the Statistics Report, but the Statistics Charts aggregate them all together by statistic so that you can visualize the summary statistics of all your variables quickly with just one glance. For example, if you have a lot of models, you can compare their summary statistics very quickly by browsing all the Statistics Charts in succession.

Variable Importance

The Variable Importance Chart is a special kind of Statistics Chart, as it pertains only to the variables of a model. So, by selecting Model Variables in the Variables Combobox, GeneXproTools computes the variable importance of all the variables of the current model and shows the results both in the Statistics Report and in the Variable Importance Chart.

By browsing all the models in the run History, you can quickly analyze the type and importance of the variables in every one of your models.

GeneXproTools uses a sophisticated stochastic method to compute the variable importance of all the variables in a model. For all kinds of models (Classification, Logistic Regression, Regression, Time Series Prediction and Logic Synthesis), the importance of each variable is computed by randomizing its input values and then computing the decrease in the R-square between model outputs and actual values. The results for all variables are then normalized so that they add up to 1.

Summary Statistics

Summary statistics for all kinds of variables (original, derived and models) and across different datasets are computed and shown in the Data Panel, both in the Statistics Report and in the Statistics Charts. These statistics include:

Minimum
Maximum
Average
Median
Standard Deviation
Slope vs Response
Intercept vs Response
Correlation Coefficient vs Response
R-square vs Response

Outlier Detection & Removal

Detecting outliers in input variables and being able to remove them is an important tool as outliers may impact negatively on modeling. GeneXproTools provides you with different tools for detecting outliers and allows you to remove them from your datasets through the Delete Records Window.

Different charts can be used to detect outliers in GeneXproTools, including the Sequential Distribution Chart, the Histogram and also the Scatter Plot. For example, the Sequential Distribution Chart, with both the average and standard deviation lines clearly visible, provides a simple and very efficient way for spotting outliers. Moreover, by allowing you to copy the IDs of all the outliers in the format required for record deletion in the Delete Records Window, you can now remove all the outliers very easily and quickly.

The Histogram offers another dimension to outlier detection, allowing you to see quickly if there are any gaps in the distribution of values.

The Scatter Plot can also be used to detect outliers, letting you quickly visualize the cloud of points and if there are any outliers worth investigating.

Model Analysis

By choosing History Models or Model Variables in the Variables Combobox, GeneXproTools allows you to perform a myriad of model analyses in the Data Panel. For example, you can compare summary statistics across all models such as minimum and maximum values, average and standard deviation, correlation coefficient and R-square, and slope and intercept values.

For Classification and Logistic Regression, GeneXproTools also allows you to compare your models using different model outputs: the raw model output, the probability and the predicted class.

For each model GeneXproTools computes the variable importance of all its variables, showing the results both in the Statistics Report and in the Variable Importance Chart.

A simpler kind of variable importance can be inferred from the analysis of scatter plots, where each model variable can be plotted both against the response and the model output. In Classification and Logistic Regression you can also perform these analyses against the raw model output and the probability.

The summary of the regression analysis of all model variables against the response can be quickly accessed by selecting the Correlation Coefficient Chart or the R-square Chart in the Statistics Charts.

Record Charts & Analyses

GeneXproTools also supports extensive record analyses in the Record Analytics Platform in the Data Panel. In the Record Analytics Platform you can analyze different types of records using different charts and browsing tools. For example, by comparing each record with different record prototypes you can gain insight into both your data and your models.

Sequential Distribution Chart

The Sequential Distribution Chart allows you to browse quickly all your records or subsets of records in your datasets, such as the positive or negative records. By plotting each record side by side with different record prototypes, such as the global centroid or medoid, class centroids or medoids and so on, it allows a much faster analysis and understanding of your data. For example, even for multivariate problems with 5-20 predictor variables, you can easily spot basic patterns in your data. In the example below with 9 predictor variables for diagnosing breast cancer, one can clearly identify, even without a model, if a patient has breast cancer or not by comparing their results with the class centroids and global centroid.

The Sequential Distribution Chart is also an essential tool for error analysis as it allows you to browse the records that were misclassified by your models, such as misclassifications, false negatives or false positives. By performing this type of error analysis, you can gain insight into both your models and your data. For example, error analysis can give you some ideas about better ways to improve your models, such as requiring additional tests or variables or creating new features that will improve the predictive accuracy of your models.

Histogram

By plotting the histogram of each record, GeneXproTools allows you to study the distribution of values across different types of variables. The detailed analysis of record histograms might be useful for error analysis to help you find patterns in the histograms of records that, for example, are being systematically misclassified by most models.

The analysis of Record Histograms is particularly useful for datasets with a large number of variables, especially if they are scaled. Also interesting is the analysis of record histograms of large model ensembles. For example, by browsing the subset of misclassified records you can also gain some insights about your models and your data.

The analysis of Record Histograms of input variables is also useful, especially if the variables are scaled and there are a large number of them. GeneXproTools allows you to normalize your data for purposes of visualization, allowing you to revert to the raw inputs after you’re finished analyzing your data.

Scatter Plot

The Records Scatter Plot of GeneXproTools goes beyond a simple scatter plot where you plot different pairs of records. Indeed, by allowing you to plot certain records against pre-computed record prototypes such as the global centroid or the class centroids of classification and logistic regression data, the Records Scatter Plot can also be used to perform a wide range of analyses, including error analysis. For example, browsing just the model outliers or the misclassified records and plotting them against selected record prototypes, can be useful to reveal important aspects of both your data and your models.

Summary Statistics

GeneXproTools also provides summary statistics for all the record prototypes it supports (global centroid, global medoid and global extrema for all modeling categories and class centroids and class medoids for Classification, Logistic Regression and Logic Synthesis) and shows them in the Statistics Report.

GeneXproTools also provides real time statistics for the record under study in the Record Charts by synchronizing the record stats in the Statistics Report with the current chart.

Error Analysis

All record charts (Sequential Distribution Chart, Scatter Plot and Histogram) can be used to perform error analysis, contributing in different ways to this important tool. In all cases, you select a subset of misclassified records (all misclassifications or just the false positives or false negatives in Classification and Logistic Regression problems) or the model outliers (model outputs above a pre-specified error threshold) in Regression and Time Series Prediction problems. For example, you can choose to browse only the records for which the predicted values are above a pre-specified error threshold, say 10% of the actual value.

In Classification and Logistic Regression, GeneXproTools allows you to specify exactly the type of misclassifications, allowing you to analyze only the false positives or the false negatives or all the misclassifications.

GeneXproTools also allows you to browse the correct classifications, either the true positives or the true negatives or both, which of course is also important for error analysis.