|
|
|
|
Last update: February 19, 2014
|
|
|
|
|
|
Dataset Partitioning & Sub-sampling
Through the Dataset Partitioning Window and the sub-sampling schemes in the General Settings Tab,
GeneXproTools allows you to split your data into different datasets that can be used to:
- Create the models (the training dataset or a sub-set of the training dataset).
- Check and select the models during the design process (the validation dataset or a sub-set of the validation dataset).
- Test the final model (a sub-set of the validation set reserved for testing).
Of all these datasets, the training dataset is the only one that is mandatory as GeneXproTools
requires data to create data models. The validation and test sets are optional and
you can indeed
create models without checking or testing them. Note however that this approach is not recommended
and you have indeed better chances of creating good models if you check their generalization
regularly not only
during model evolution but also during model selection. However if you don’t have
enough data,
you can still create good models with GeneXproTools as the learning algorithms of GeneXproTools
are not prone to overfitting the data. In addition, if you are using GeneXproTools to create
random forests, the need for validating/testing the models of the ensemble is less important
as ensembles tend to generalize better than individual models.
Training Dataset
The training dataset is used to create the models, either in its entirety or as a
sub-sample
of the training data. The sub-samples of the training data are managed in the Settings Panel.
GeneXproTools supports different sub-sampling schemes, such as bagging and mini-batch.
For example, to operate in bagging mode you just have to set the sub-sampling to Random.
In addition, you can also change the number of records used in each bag, allowing you
to speed up evolution if you have enough data to get good generalization.
Besides Random sampling (which is done with replacement), you can also choose Shuffled
(which is done without replacement); Balanced Random (used in classification and
logistic regression runs where a sub-sample is randomly generated so that the proportion
of positives and negatives are the same); and Balanced Shuffled (similar to Balanced Random,
but with the sampling done without replacement).
All types of random sampling can be used in mini-batch mode, which is an extremely useful
sampling method for handling big datasets. In mini-batch mode a sub-sampling of the
training data is generated each p generations (the period of the mini-batch, which
is adjustable and can be set in the Settings Panel) and is used for training. This way,
for large datasets good models can be generated quickly using a high percentage of
the records in the training data, without stalling the whole evolutionary process
with a huge dataset that is used each generation. It’s important however to find a
good balance between the size of the mini-batch and an efficient model evolution.
This means that you’ll still have to choose an appropriate number of records in order to ensure
good generalizability, which is true for all datasets, big and small. A simple rule
of thumb is to see if the best fitness is increasing overall: if you see it fluctuating
up and down, evolution has stalled and you need either to increase the mini-batch size or
increase the time between batches (the period). The chart below shows clearly the overall
upward trend in best fitness for a run in mini-batch mode with a period of
50.
The training dataset is also used for evaluating data statistics that are used in certain models,
such as the average and the standard deviation of predictor variables used in models created
with standardized data. Other data statistics used in GeneXproTools include: pre-computed
suggested mappings for missing values; training data constants used in Excel worksheets
in models and ensembles deployed to Excel; min and max values of variables when normalized data
is used, and so on.
It’s important to note that when a sub-set of the training dataset is used in a run, the
training data constants pertain to the data constants of the entire training dataset as
defined in the Data Panel. For example, this is important when designing ensemble models
using different random sampling schemes selected in the Settings Panel.
On the other hand when a sub-set of the training dataset is used to evolve a model, the
model statistics or constants, if they exist, remain fixed and become an integral part of
the evolved model. Examples of such model parameters include the evolvable rounding thresholds
of classification models and logistic regression models.
In addition to random sampling schemes, GeneXproTools supports non-random sampling schemes,
such as using just the odd or even records; the top half or bottom half records; and the top n or
bottom n records.
Validation Dataset
GeneXproTools supports the use of a validation dataset, which can be either loaded as a
separate dataset or generated from a single dataset using GeneXproTools partitioning algorithms.
Indeed if during the creation of a new run a single dataset is loaded, GeneXproTools automatically
splits the data into Training and Validation/Test datasets. GeneXproTools uses optimal strategies
to split the data in order to ensure good model design and evolution. These default partition strategies
offer useful guidelines, but you can choose different partitions in the Dataset Partitioning Window
to meet your needs. For instance, the Odds/Evens partition is useful for
times series data, allowing
for a good split without losing the time dimension of the original data, which obviously can help in
better understanding both the data and the generated models.
GeneXproTools also supports sub-sampling for the validation dataset, which as explained for the
training dataset above, is controlled in the Settings Panel.
The sub-sampling schemes available for the validation data are exactly the same available for the
training data, except of course for the
mini-batch strategy which pertains only to the training data.
It’s worth pointing out, however, that the same sampling scheme in the training and validation data
can play very different roles. For example, by choosing the Odds or the Evens, or the Bottom Half or
Top Half for validation, you can reserve the other part for testing
and only use this test dataset at the very
end of the modeling process to evaluate the accuracy
of your model.
Another ingenious use of all the random sampling schemes available for
the validation set (Random,
Shuffled, Balanced Random and Balanced Shuffled, with the last two available only for Classification,
Logistic Regression and Logic Synthesis) consists of calculating the cross-validation accuracy of
a model. A clever and simple way to do this, consists of creating a run with the model you want
to cross-validate. Then you can copy this model n times, for instance by importing it n times.
Then in the History Panel you can evaluate the performance of the model for different sub-samples
of the validation dataset. The average value for the fitness and
favorite statistic shown in the
summary of the History models consist of the cross-validation results for your model.
Below is an example of a 30-fold cross-validation
evaluated for the training and validation datasets
using random sampling of the respective datasets.
Test Dataset
The dividing line between a test dataset and a validation dataset is not always clear.
A popular
definition comes from modeling competitions, where part of the data is hold out and not
accessible to the
people doing the modeling. In this case, of course, there’s no other choice: you create your
model and then others check if it is any good or not. But in most real situations people do have
access to all the data and they are the ones who decide what goes into training, validation and
testing.
GeneXproTools allows you to experiment with all these scenarios and you can choose what works best
for the data and problem you are modeling. So if you want to be strict, you can hold out part of
the data for testing and load it only at the very end of the modeling process, using the
Change Validation Dataset functionality of GeneXproTools.
Another option is to use the technique described
above for the validation dataset, where you
hold out part of the validation data for testing. For example, you hold
out the Odds or the Evens,
or the Top Half
or Bottom Half. This obviously requires strong willed and very disciplined people, so it’s perhaps
best practiced only if a single person is doing the modeling.
The takeaway message of all these what-if scenarios is that, after working with GeneXproTools
for a while, you’ll be comfortable with what is good practice in testing the accuracy of
GeneXproTools models. We like to claim that GeneXproTools is not prone to overfitting and now
with all the partitioning and sampling schemes of GeneXproTools you can develop a better sense of
the quality of the models generated by GeneXproTools.
And finally, the same cross-validation technique described above
for the validation dataset can be
performed for the test dataset.
See Also:
Related Tutorials:
Related Videos:
|
|
|
|
|