Creating a Model Ensemble with Large Datasets
In this age of Big Data we are asked many times how to process large datasets with GeneXproServer. The viability depends on the definition of the problem at hand but in many cases it can be solved using simple procedures. The solution proposed here is to process different runs using different subsets of the data both for training and for validation. Using GeneXproServer we can load these subsets into as many runs as we like and at the end it creates a single run with all the best models of each subset. We can then load that run into GeneXproTools and deploy the models to Excel as an ensemble. This model ensemble is the final model.
Assumptions
The main assumption that we make in this example is that the data sits in a database in a remote server. The training data is in a table named TrainingData and the validation data is in a table named ValidationData. For the sake of this example, both tables have 1 million records and each one has ten columns named
v0-v9 where v9 is the response variable.
The database server can be any supported database but the exact SQL may be different from the presented below. Finally, we also assume that GeneXproServer is installed on a server with several CPUs to speed up the processing.
Template GeneXproTools File and the Job Definition
The first step is to create a small sampling of the training data to create the template GeneXproTools file. Create the new run loading
the traing data and leave all the settings in their default values. It is a good idea to minimize the size of this GeneXproTools file as it will be copied many times. To do this use a small dataset and delete the history. We will call this file MyRun.gep.
Now that we have this template run we can pass to the creation of the job definition file:
<job filename="MyRun.gep"
path="c:\examples"
createconsolidatedrun="2"
feedback="2"
async="yes"
xmlns="http://tempuri.org/gxpsApi.xsd">
</job>
The job will be run in the folder c:\work and the user interface will be updated every two seconds. At the end of the process GeneXproServer will create a new run and will select the model with the best validation fitness from each of the created runs.
The selection criteria has several other options as defined in their documentation.
Finally, we set the async property to yes to allow GeneXproServer to leverage all the processors of the computer.
The next step is to create the runs that will load each data
sample and process them. We start with the skeleton:
<run id="1" stopcondition="generations" value="500" type="start"></run>
Each run will run for 500 generations.
Then we add the data loading directives:
<run id="1" stopcondition="generations" value="500" type="start">
<datasets>
<dataset type="training" records="10000">
<connection type="database" format="responselast">
<oledbconnectionstring>Server=srv01;Database=db;Uid=user;Pwd=pass;</oledbconnectionstring>
<sqlstatement>SELECT * FROM TrainingData ORDER BY Rand() LIMIT 10000</sqlstatement>
</connection>
</dataset
<dataset type="validation" records="10000">
<connection type="database" format="responselast">
<oledbconnectionstring>Server=srv01;Database=db;Uid=user;Pwd=pass;</oledbconnectionstring>
<sqlstatement>SELECT * FROM ValidationData ORDER BY Rand() LIMIT 10000</sqlstatement>
</connection>
</dataset>
</datasets>
</run>
The data is randomized in the database which in some cases can be a very
time consuming operation. One other option is to pre-randomize the data in both tables and then do a straight select paginating the data from run to run.
Now that we have the skeleton run in place we need to replicate it as many times as required. Since we have 1 million records and we are selecting the data randomly it is safe to assume that if we create 100 runs then we will have covered the complete dataset both for training and validation. Depending on your requirements it may be enough to process a much smaller number of runs.
When replicating the runs it is required that the run id is incremented and that no two runs have the same id.
This can be done using a simple script such as this example in Python.
Then you can run this script in a command line and pipe it to a text file:
python replicator.py > jobdefinition.xml
And then wrap the contents in the job definition node defined above.
The final result can be downloaded from here.
Processing
To process the resulting job definition, open a command line, navigate to the path of the job definition and run the following command:
gxps50x.exe jobdefinition.xml
If the job definition is correctly defined and the template GeneXproTools file is ready then GeneXproServer will start and it will process as many files in parallel as there are logical CPUs in the computer.
After all the runs have been processed GeneXproServer will create a run named MyRun_consolidated.gep that contains all the selected models:
The final step is to create an ensemble with all these models using the Deploy Ensemble to Excel functionality
of GeneXproTools:
When this process is over the final ensemble with all
its models can be observed within Excel:
See Also:
Related Tutorials:
Related Videos:
|