Creating a Model Ensemble with Large Datasets

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum

GeneXproServer Online Guide Learn how to seamlessly integrate your models with your systems & workflow with the Online Guide.

Intellisense in Visual Studio

Editions

Last update: February 19, 2014

Creating a Model Ensemble with Large Datasets

In this age of Big Data we are asked many times how to process large datasets with GeneXproServer. The viability depends on the definition of the problem at hand but in many cases it can be solved using simple procedures. The solution proposed here is to process different runs using different subsets of the data both for training and for validation. Using GeneXproServer we can load these subsets into as many runs as we like and at the end it creates a single run with all the best models of each subset. We can then load that run into GeneXproTools and deploy the models to Excel as an ensemble. This model ensemble is the final model.

Assumptions

The main assumption that we make in this example is that the data sits in a database in a remote server. The training data is in a table named TrainingData and the validation data is in a table named ValidationData. For the sake of this example, both tables have 1 million records and each one has ten columns named v0-v9 where v9 is the response variable.

The database server can be any supported database but the exact SQL may be different from the presented below. Finally, we also assume that GeneXproServer is installed on a server with several CPUs to speed up the processing.

Template GeneXproTools File and the Job Definition

The first step is to create a small sampling of the training data to create the template GeneXproTools file. Create the new run loading the traing data and leave all the settings in their default values. It is a good idea to minimize the size of this GeneXproTools file as it will be copied many times. To do this use a small dataset and delete the history. We will call this file MyRun.gep.

Now that we have this template run we can pass to the creation of the job definition file:

<job filename="MyRun.gep" 
      path="c:\examples" 
      createconsolidatedrun="2"
      feedback="2" 
      async="yes"
      xmlns="http://tempuri.org/gxpsApi.xsd">
</job>

The job will be run in the folder c:\work and the user interface will be updated every two seconds. At the end of the process GeneXproServer will create a new run and will select the model with the best validation fitness from each of the created runs. The selection criteria has several other options as defined in their documentation. Finally, we set the async property to yes to allow GeneXproServer to leverage all the processors of the computer.

The next step is to create the runs that will load each data sample and process them. We start with the skeleton:

<run id="1" stopcondition="generations" value="500" type="start"></run>

Each run will run for 500 generations.

Then we add the data loading directives:

<run id="1" stopcondition="generations" value="500" type="start">
  <datasets>
	<dataset type="training" records="10000">
		<connection type="database" format="responselast">
			<oledbconnectionstring>Server=srv01;Database=db;Uid=user;Pwd=pass;</oledbconnectionstring>
			<sqlstatement>SELECT * FROM TrainingData ORDER BY Rand() LIMIT 10000</sqlstatement>
		</connection>
	</dataset
	<dataset type="validation" records="10000">
		<connection type="database" format="responselast">
			<oledbconnectionstring>Server=srv01;Database=db;Uid=user;Pwd=pass;</oledbconnectionstring>
			<sqlstatement>SELECT * FROM ValidationData ORDER BY Rand() LIMIT 10000</sqlstatement>
		</connection>
	</dataset>
  </datasets>
</run>

The data is randomized in the database which in some cases can be a very time consuming operation. One other option is to pre-randomize the data in both tables and then do a straight select paginating the data from run to run.

Now that we have the skeleton run in place we need to replicate it as many times as required. Since we have 1 million records and we are selecting the data randomly it is safe to assume that if we create 100 runs then we will have covered the complete dataset both for training and validation. Depending on your requirements it may be enough to process a much smaller number of runs.

When replicating the runs it is required that the run id is incremented and that no two runs have the same id. This can be done using a simple script such as this example in Python.

Then you can run this script in a command line and pipe it to a text file:

python replicator.py > jobdefinition.xml

And then wrap the contents in the job definition node defined above. The final result can be downloaded from here.

Processing

To process the resulting job definition, open a command line, navigate to the path of the job definition and run the following command:

gxps50x.exe jobdefinition.xml

If the job definition is correctly defined and the template GeneXproTools file is ready then GeneXproServer will start and it will process as many files in parallel as there are logical CPUs in the computer.

After all the runs have been processed GeneXproServer will create a run named MyRun_consolidated.gep that contains all the selected models:

The final step is to create an ensemble with all these models using the Deploy Ensemble to Excel functionality of GeneXproTools:

When this process is over the final ensemble with all its models can be observed within Excel:

See Also:

Online Learning System Using Job Definitions

Related Tutorials:

How to Create a Mini Cluster with GeneXproServer
Getting Started with Regression
Getting Started with Classification
Getting Started with Logistic Regression
Getting Started with Time Series Prediction
Getting Started with Logic Synthesis

Related Videos:

What's New in GeneXproTools 5.0?

Time Limited Trial

Try GeneXproServer for free for 30 days!

Released February 19, 2014

Last update: 5.0.5667