Sample Runs

Settings and Features

Sample Runs

GeneXproTools 4.0 ships with four different Sample Runs for Classification: Iris Virginica, Breast Cancer, Credit Screening, and DNA Microarrays. To try any one of them, you just have to click its link on the Welcome Screen of GeneXproTools.

The Iris Virginica sample run is a simple real-world problem for distinguishing Iris Virginica from other two irises: Iris Setosa and Iris Versicolor. The original iris dataset contains fifty examples each of the three types of iris. In this sample run, though, the sub-problem Virginica versus Not Virginica is analyzed, where 100 randomly chosen samples are used for training and the remaining 50 for testing. You will see that this is an easy problem for GeneXproTools and exceptionally good models with 100% accuracy on the testing set can be easily created.

The Breast Cancer sample run is a complex real-world problem for diagnosing breast cancer based on nine different cell analysis (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses). The dataset cancer1 used in this sample run was obtained from PROBEN1, with 350 samples used for training and 174 for testing. And again you will see that GeneXproTools is able to create exceptionally good models for diagnosing breast cancer with predictive accuracies around 99%.

The Credit Screening sample run is a high-dimensional real-world problem. And the goal here is to decide whether to approve or not a customer’s request for a credit card. Each sample in the dataset represents a real credit card application and the output describes whether the bank granted the credit card or not. This problem has 51 input attributes, all of them unexplained in the original dataset for confidentiality reasons. The dataset card1 used in this sample run was obtained from PROBEN1, with 345 samples used for training and 172 for testing. And here you will see that GeneXproTools is able to create very good models with high predictive accuracy.

The DNA Microarrays sample run is an extremely complex real-world problem with thousands of variables. The training dataset consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. And the testing dataset consists of 34 samples, with 20 ALL and 14 AML. In this sample run, "0" was used to represent "ALL" and "1" to represent "AML". The 7129 genes were numbered d₀-d₇₁₂₈.

Thus, like in all DNA microarrays problems, in the ALL-AML Leukemia problem, the number of samples available both for training and testing is quite small, which obviously poses some challenges.

First, it is common to try and narrow down the search space with sophisticated discretization algorithms to filter out the noise or irrelevant genes. And although good results have been reported (data size reductions of 50%-98%), the problem still remains: hundreds or thousands of variables to be mined using just a few dozen samples.

The great advantage of using GeneXproTools in DNA microarray studies is that you can use the raw data (obviously you can also use the filtered data) and still obtain excellent results. Obviously, not all the models with a good accuracy on the training data will have good predictive accuracy, but by making several runs one can select the top 10-20 models and then cross-reference the genes (attributes) used in all of them. You can then select and copy these most important genes from the Data Panel and create a much smaller dataset for creating the final model.

For instance, for the ALL-AML Leukemia problem, a total of 308 promising genes were identified in 25 good runs, that is, runs with 100% training accuracy and testing accuracies between 91.18%-97.06%. Of these 308 promising genes, only 11 (genes 759, 1881, 2287, 2407, 4362, 4846, 5485, 6040, 6587, 6638, and 6854) appeared in two or more models; and of these, only five (genes 759, 1881, 2287, 4846, and 6854) appeared in more than three models, with the most prevalent being genes 1881, 2287, and 4846, with eight, six, and four appearances, respectively. So, it is a good guess that the genes mostly to be involved in ALL-AML leukemia are genes 1881, 2287, and 4846, which is an exceptionally good starting point for tackling leukemia.

References

Golub, T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomÞeld, and E. S. Lander, 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286:531-537.

Prechelt, L., 1994. PROBEN1 - A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Univ. Karlsruhe, Germany.

Home | Contents | Previous | Next