GeneXproTools 4.0 ships with four different Sample
Runs for Classification: Iris
Virginica, Breast Cancer, Credit
Screening, and DNA Microarrays. To try any one of them, you just
have to click its link on the Welcome Screen of GeneXproTools.
The Iris Virginica sample run is a simple real-world problem for
distinguishing Iris Virginica from other two irises: Iris Setosa and
Iris Versicolor. The original iris dataset contains fifty examples each of
the three types of iris. In this sample run, though, the sub-problem Virginica versus Not Virginica is
analyzed, where 100 randomly chosen samples are used for training and the remaining 50 for testing.
You will see that this is an easy problem for GeneXproTools and
exceptionally good models with 100% accuracy on the testing set can
be easily created.
The Breast Cancer sample run is a complex real-world problem for
diagnosing breast cancer based on nine different cell analysis (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses).
The dataset cancer1 used in this sample run was obtained from PROBEN1,
with 350 samples used for training and 174 for testing. And again
you will see that GeneXproTools is able to create exceptionally good
models for diagnosing breast cancer with predictive accuracies around 99%.
The Credit Screening sample run is a high-dimensional real-world
problem. And the goal here is to decide whether to approve or not a customer’s request for a credit card. Each sample in the dataset represents a real credit card application and the output describes whether the bank granted the credit card or not. This problem has 51 input attributes, all of them unexplained in the original dataset for confidentiality reasons.
The dataset card1 used in this sample run was obtained from PROBEN1,
with 345 samples used for training and 172 for testing. And here you
will see that GeneXproTools is able to create very good models with
high predictive accuracy.
The DNA Microarrays sample run is an extremely complex real-world
problem with thousands of variables. The training dataset consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes.
And the testing dataset consists of 34 samples, with 20 ALL and 14 AML.
In this sample run, "0" was used to represent "ALL" and "1" to
represent "AML". The 7129 genes were numbered d0-d7128.
Thus, like in all DNA microarrays problems, in the
ALL-AML Leukemia problem, the number of samples available both for training and testing is
quite small, which obviously poses some challenges.
First, it is common to try and narrow down the search space with
sophisticated discretization algorithms to filter out the noise or
irrelevant genes. And although good results have been reported (data
size reductions of 50%-98%), the problem still remains: hundreds or
thousands of variables to be mined using just a few dozen samples.
The great advantage of using GeneXproTools in DNA microarray studies
is that you can use the raw data (obviously you can also use
the filtered data) and still obtain excellent results. Obviously,
not all the models with a good accuracy on the training data will
have good predictive accuracy, but by making several runs one can
select the top 10-20 models and then cross-reference the genes
(attributes) used in all of them. You can then select and copy these
most important genes from the Data
Panel and create a much smaller dataset for creating the final
model.
For instance, for the
ALL-AML Leukemia problem, a total of 308 promising genes were
identified in 25 good runs, that is, runs with 100% training
accuracy and testing accuracies between 91.18%-97.06%. Of these 308
promising genes, only 11 (genes 759, 1881, 2287, 2407, 4362, 4846, 5485, 6040, 6587, 6638,
and 6854) appeared in two or more models; and of these, only five (genes
759, 1881, 2287, 4846, and 6854) appeared in more than three models,
with the most prevalent being genes 1881, 2287, and 4846, with
eight, six, and four appearances, respectively. So, it is a good
guess that the genes mostly to be involved in ALL-AML leukemia are
genes 1881, 2287, and 4846, which is an exceptionally good starting
point for tackling leukemia.
References
Golub, T. R., D. K. Slonim, P. Tamayo, C.
Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R.
Downing, M. A. Caligiuri, C. D. BloomÞeld, and E. S. Lander, 1999.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring. Science, 286:531-537.
Prechelt, L., 1994. PROBEN1 - A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Univ.
Karlsruhe, Germany.
|