|
|
|
|
Last update: February 19, 2014
|
|
|
|
|
|
Data Visualization & Analysis
|
Introduction |
The Data Panel of GeneXproTools is a multifaceted platform for data analysis and
visualization of both your variables
and records. The rich web of interactions between different kinds of
variables (original input variables, derived variables, all the models in
the run History, and specific model variables) and different
datasets (the training, the validation or
both datasets) allows you to perform powerful
analyses of both your variables and records.
|
Datasets |
In the Data Panel, GeneXproTools allows you to choose different
datasets so that you
can analyze and visualize both your variables and your
records
for particular sets of data. You can either
analyze your data as a whole where you perform different analyses for all of your data or
you can analyze the training and validation/test datasets separately. For example, you can check for
outliers
in all your variables using the total of your data or just the training or validation dataset.
Or you can compare the histograms of certain variables in the training and
validation datasets, comparing them with their overall distribution in the union of
both datasets, just by switching from one dataset to the other.
Or you can visualize correlations between each independent variable and the response
(the dependent variable) or
between any two pairs of variables in your data, including original and derived variables,
all the generated models and their variables and
all types of model outputs, which in Classification and
Logistic Regression include
not only the predicted class but also the raw model output and the probability.
Or, on the Record Analytics Platform of the Data Panel, you can compare the
record prototypes for
different subsets of variables
both in the training and validation sets and then see how they compare with the
overall prototype evaluated for the whole data.
|
Variables & Models |
The new sophisticated Data Panel of GeneXproTools is a
powerful analytics platform for
analyzing and visualizing not only the variables of the input data (the Original Variables),
but also all the new features you create from the original variables (the Derived Variables) and
all the models you generate (the History Models and the
Active Model with its specific
Model Variables). For Classification and Logistic Regression problems, this also includes
being able to choose the type of model output: the
raw model output, the probability
or
the predicted class.
Thus, in the new Data Panel you can not only study your original variables but also see
how they relate to the new features and the models you create. For example, you can analyze
the correlation between all the variables of a model and the model output.
Or you can see exactly what points are associated with classification errors in
your classification and logistic regression models
by highlighting the misclassifications. In
Regression and Time Series Prediction you can also
highlight model outliers based on the relative or
absolute error. This
functionality of highlighting different subsets of
records is available for the Sequential Distribution
Chart, the Bivariate Line Chart and the Scatter Plot.
Or you can evaluate and visualize the variable importance of all the variables in a model,
which you can perform quickly and easily for all the models in the Run History by browsing
the models right in the Data Panel.
On the Record Analytics Platform you can browse
different subsets of records by choosing
different categories in the Browse Records
combobox. For example, in Classification and
Logistic Regression problems you can browse just the
positives or negatives, the records that were misclassified for each of
the models in the Run History, or just the false
positives or the false negatives. In Regression and
Time Series Prediction you can also browse different
record categories, such as different kinds of model
outliers and hits.
|
Variable Charts & Analyses |
Variable charts and analyses include variables that are broadly defined to include
not only the original variables but also the
derived variables and the models
(which of course are a special kind of derived variable).
By combining all these kinds of variables with different charts and other visualization tools
to highlight different records, such as outliers and misclassifications, very
powerful analyses can be done with ease and very quickly
in the Variable Analytics Platform of
GeneXproTools so that you can better understand
both your input variables and the models you generate with them.
Sequential Distribution Chart
The Sequential Distribution Chart, with the option to show the
standard deviation lines and
the average line, offers a simple and very effective way of
detecting outliers and analyzing the
distribution of values for all your variables.
You can also use the Sequential Distribution Chart to
highlight points that are being
misclassified by the current model or that result in strong or weak responses of
the current model (these responses are called
model outliers as they are defined in relation to
a model).
The Sequential Distribution Chart can also be used to analyze the distribution of
positive and negative records across a variable range to help you spot simple patterns
in your data.
Bivariate Line Chart
The Bivariate Line Chart is a very powerful and flexible tool that allows very useful
comparisons of any pair of different variables. With the Bivariate Line Chart you can select
any two variables and then plot them in order or sorted in different ways.
This chart also
allows you to scale your variables so that you can compare them in a
more meaningful way.
In addition, you can also use this chart in combination with different types of
Highlight Options, such as
misclassifications, false positives, false
negatives, true positives, true negatives,
positives, negatives, outliers and hits.
Histogram
With the Histogram you can visualize very quickly the distribution
of values of all your variables.
GeneXproTools allows you to browse easily from one variable to the other and also change
the number of bins in your histograms.
Also interesting is the analysis of the distribution of model outputs, especially in
Classification and Logistic Regression where you can analyze how your models are evaluating
the classifications: is there a clear separation between the two classes or is there
some overlap?
Scatter Plot
Like the Bivariate Line Charts, Scatter Plots also allow the comparison of any
pair of
different variables which are easily selected using up-downs
both for the X-axis and Y-axis.
Scatter Plots are powerful analytic tools for showing the
correlation between
two variables, especially when the regression line and the
regression equation
with its slope and intercept are also shown.
Scatter Plots are also useful for detecting outliers,
as these points usually fall
far away from the main cloud of points.
The Scatter Plots of GeneXproTools can also be combined with different
highlighting and
synchronization tools, so that model outliers, misclassifications
and other record categories can be
easily spotted.
In addition, the Sync functionality provides
an extremely useful tool for analyzing different
models very easily. For example, you can very
quickly determine if your models are similar or not
in how they are misclassifying the same records or
if they seem to be capturing different patterns in
the data and are therefore good candidates for
creating a better ensemble.
Statistics Charts
The Statistics Charts offer a simple and clear way of analyzing the
summary statistics
of all your variables: minimum, maximum,
average, median, and standard deviation. Moreover,
for all different subsets of variables GeneXproTools also plots the
slope, intercept, correlation coefficient and
R-square,
all evaluated against the response variable.
These statistics are also shown on the Statistics Report, but the Statistics Charts
aggregate them all together by statistic so that you can visualize the summary statistics
of all your variables quickly with just one glance. For example, if you have a lot of
models, you can compare their summary statistics very quickly by browsing all the
Statistics Charts in succession.
Variable Importance
The Variable Importance Chart is a special kind of Statistics Chart, as it pertains
only to the variables of a model. So, by selecting Model Variables in the
Variables Combobox, GeneXproTools computes the
variable importance of all
the variables of the current model and shows the results both in the Statistics Report
and in the Variable Importance Chart.
By browsing all the models in the run History, you can quickly analyze the type and
importance of the variables in every one of your models.
GeneXproTools uses a sophisticated stochastic method to compute the variable importance
of all the variables in a model. For all kinds of models (Classification,
Logistic Regression, Regression, Time Series Prediction and
Logic Synthesis),
the importance of each variable is computed by randomizing its input values and
then computing the decrease in the R-square between model outputs and actual
values.
The results for all variables are then normalized so that they add up to 1.
Summary Statistics
Summary statistics for all kinds of variables (original, derived and models) and across different datasets are computed and shown in the Data Panel, both in the
Statistics Report and in the Statistics Charts. These statistics include:
- Minimum
- Maximum
- Average
- Median
- Standard Deviation
- Slope vs Response
- Intercept vs Response
- Correlation Coefficient vs Response
- R-square vs Response
Outlier Detection & Removal
Detecting outliers in input variables and being able to remove them is an important
tool as outliers may impact negatively on modeling. GeneXproTools provides you with different tools for
detecting outliers and allows you
to remove them from your datasets through the Delete Records Window.
Different charts can be used to detect outliers in GeneXproTools, including the
Sequential Distribution Chart, the
Histogram and also the
Scatter Plot. For example,
the Sequential Distribution Chart, with both the average and standard deviation lines
clearly visible, provides a simple and very efficient way
for spotting outliers.
Moreover, by allowing you to copy the IDs of all the outliers in the format
required for record deletion in the Delete Records Window, you can now
remove
all the outliers very easily and quickly.
The Histogram offers another dimension to outlier detection, allowing you to see
quickly if there are any gaps in the distribution of
values.
The Scatter Plot can also be used to detect outliers, letting you quickly visualize
the cloud of points and if there are any outliers worth investigating.
Model Analysis
By choosing History Models or Model Variables in the Variables Combobox, GeneXproTools
allows you to perform a myriad of model analyses in the Data Panel. For example,
you can compare summary statistics across all models such as minimum and maximum values,
average and standard deviation, correlation coefficient and R-square, and slope and
intercept values.
For Classification and Logistic Regression, GeneXproTools also allows you to compare
your models using different model outputs: the raw
model output, the probability and
the predicted class.
For each model GeneXproTools computes the variable importance
of all its variables, showing the results both in the
Statistics Report and in the
Variable Importance Chart.
A simpler kind of variable importance can be inferred from the analysis of scatter plots,
where each model variable can be plotted both against the response and the model output.
In Classification and Logistic Regression you can
also perform these analyses against the raw model
output and the probability.
The summary of the regression analysis of all
model variables against the response can be quickly accessed by selecting the
Correlation Coefficient
Chart or the R-square Chart in the
Statistics Charts.
|
Record Charts & Analyses |
GeneXproTools also supports extensive record analyses in the
Record Analytics Platform in the Data Panel.
In the Record Analytics Platform you can analyze different types of records using different charts and browsing tools.
For example, by comparing each record with different
record prototypes you can gain insight
into both your data and your models.
Sequential Distribution Chart
The Sequential Distribution Chart allows you to browse quickly all your records or
subsets of records in your datasets, such as the positive or negative records.
By plotting each record side by side with different
record prototypes, such as
the global centroid or medoid, class centroids or medoids and so on, it allows
a much faster analysis and understanding of your data. For example, even for
multivariate problems with 5-20 predictor variables, you can easily spot basic patterns
in your data. In the example below with 9 predictor variables for diagnosing breast cancer,
one can clearly identify, even without a model, if a patient has breast cancer or not
by comparing their results with the class centroids and global centroid.
The Sequential Distribution Chart is also an essential tool for
error analysis
as it allows you to browse the records that were misclassified by your models,
such as misclassifications, false negatives or false positives. By performing this type of error analysis,
you can gain insight into both your models and your data.
For example, error analysis can
give you some ideas
about better ways to improve your models, such as requiring additional tests or
variables or creating new features that will improve the predictive accuracy of your models.
Histogram
By plotting the histogram of each record, GeneXproTools allows you to study
the distribution of values across different types of variables. The detailed analysis
of record histograms might be useful for error analysis to help you find patterns
in the histograms of records that, for example, are being systematically misclassified
by most models.
The analysis of Record Histograms is particularly useful for datasets with
a large number of variables, especially if
they are scaled. Also interesting is the
analysis of record histograms of large model ensembles. For example, by browsing
the subset of misclassified records you can also gain some insights about
your models and your data.
The analysis of Record Histograms of input variables is also useful, especially
if the variables are scaled and there are a large number of them. GeneXproTools
allows you to normalize your data for purposes of visualization,
allowing you to revert to the raw inputs after you’re finished analyzing your data.
Scatter Plot
The Records Scatter Plot of GeneXproTools goes beyond a simple scatter plot
where you plot different pairs of records. Indeed, by allowing you to plot
certain records against pre-computed record prototypes such as the
global centroid or
the class centroids of classification and logistic regression data, the
Records Scatter Plot can also be used to perform a
wide range of analyses,
including error analysis. For example, browsing just the
model outliers or
the misclassified records and plotting them against selected
record prototypes,
can be useful to reveal important aspects of both your data and your models.
Summary Statistics
GeneXproTools also provides summary statistics for all the
record prototypes
it supports (global centroid, global medoid and global extrema for all
modeling categories and class centroids and class medoids for
Classification,
Logistic Regression and Logic Synthesis) and shows them in the
Statistics Report.
GeneXproTools also provides real time statistics for the record under study in
the Record Charts by synchronizing the record stats
in the Statistics Report with the current chart.
Error Analysis
All record charts (Sequential Distribution Chart,
Scatter Plot and Histogram) can be used to perform
error analysis, contributing in
different ways to this important tool. In all cases, you select a subset
of misclassified records (all misclassifications or just the false positives or
false negatives in Classification and Logistic Regression problems) or the
model outliers (model outputs above a pre-specified error
threshold) in Regression and Time Series Prediction problems.
For example, you can choose to browse only the records for which the
predicted values are above a pre-specified error threshold, say 10% of the actual value.
In Classification and Logistic Regression, GeneXproTools allows you to specify exactly
the type of misclassifications, allowing you to
analyze only the false positives or
the false negatives or all the misclassifications.
GeneXproTools also allows you to browse the correct classifications, either the
true positives or the true negatives or both, which of course is also important
for error analysis.
See Also:
Related Tutorials:
Related Videos:
|
|
|
|
|