I’ve been using R to accomplish a lot of statistical and data mining related tasks at work recently and it has really brought me back to my *nix scripting roots. While the learning curve with R is somewhat steep compared to more graphical tools like SAS, Statistica or SPSS where it truly shines is it’s ability to write applications to handle statistical computing tasks. Much like UNIX, R forgoes the graphical environment for a component based approach that more easily turns tasks into repeatable scripts. This can be particularly useful in tuning a data mining algorithm to a specific problem.

### The Problem

I had a data mining task at work which I decided after some initial exploration to use the Random Forest classification algorithm for its ability to handle large amounts of input variables (1000s) without variable deletion and has particularly good methods for dealing with missing data while maintaining accuracy in an unbiased manner. The Random Forest algorithm has several tuning parameters in it’s R implementation that can effect the accuracy of the model, particularly the number of trees to use as well as the number of variables to use in each tree. Typically, it requires several iterations of executing the algorithm manually followed by an analysis of the results to determine the ideal settings for these two options and even after several iterations the best possible settings might be unknown. There are some general rules for each of these parameters but in practice they are highly data set dependant. With the number of trees generally more trees is ideal, but at what point does the diminishing returns on adding trees outweigh the added complexity and processing costs. In addition in some cases adding more trees can actually increase the error rate. Choosing the number of variables to include in each decision node of the random forest model is highly data set dependent. The general rule is to use the square root of the total number of input variables for this parameter. However, in cases where there are many insignificant variables this number may need to be increased, but this must be done with caution as choosing two many variables may result in reduction in the randomness of the random forest model and will greatly reduce the overall accuracy.

### The Solution

Not wanting to leave anything up to chance I wrote a short script in R to do the work for me. This script will execute the model with hundreds of combinations of the two tuning parameters (Number of Trees and Number of Variables). After each execution the accuracy of the specific model is calculated and the output stored, along with the tuning parameters, for further review. Once the script has completed it’s a fairly simple task to sift through the output to find the combination with the lowest error rate.

Here’s the code:

for(i in 10:100)
{
for(j in 1:50)
{
# Set tuning parameters
mtryVal <- i
ntreeVal <- 500 + (j * 30)
set.seed(stageOne$seed)
# Execute the random forest model
crs$rf <- randomForest(Gold_1_Success ~ .,
data=stageOne$dataset[crs$sample,c(stageOne$input, stageOne$target)],
ntree=ntreeVal,
mtry=mtryVal,
importance=TRUE,
na.action=na.roughfix,
replace=FALSE)
# Display the tuning parameters and corresponding error rate
print(paste("number of trees: ", ntreeVal, " number of variables: ",mtryVal))
print(roc(stageOne$rf$y, stageOne$rf$votes))
print(linebreak)
}
}

The results of each iteration were somewhat variable with a range of 71-75% accuracy. Using this methodology not only saved hours of manual effort but it also resulted in selection of the most accurate model possible. The final results were a little over 2% better than using the “general rule” tuning parameters which seems somewhat small, but depending on the business case that 2% difference could be thousands of dollars or even the difference between the model being accepted for production use or discarded.