The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
library(heuristica)
This document provides a simple example of how to compare the out-of-sample performance of different models in the heuristica package.
Replication
# Use this seed to exactly replicate my tables and graphs below.
#set.seed(3)
# Remove it to see a new sampling-- and whether the overall conclusions still
# hold.
Helper functions
First let’s load the heuristica package to get the heuristics we will compare. It also includes functions to calculate accuracy.
Let’s enter the models we want to test:
<-c(ttbModel, unitWeightModel, regModel, minModel) vec_of_models
Here’s a function that does cross-validation taking the vector of models, criterion column, columns to fit, the dataset, and the number of repetitions as input:
<- function(vec_of_models, criterion_col, cols_to_fit, data, reps,training_proportion){
crossV <- vector()
fitting <- vector()
prediction for(i in 1:reps){
#randomly sample training and test row indexes
<- sample(1:nrow(data), nrow(data)*training_proportion)
train <- setdiff(1:nrow(data), train)
test
#create training and test set
<- data[train,]
training_set <- data[test,]
test_set
# If a regression is overdetermined (e.g. has too many columns(), it will
# drop the right-most columns. To instead make it drop random columns,
# we shuffle the column order.
<- sample(cols_to_fit)
shuffled_cols_to_fit
<-list()
models<- 0
y for (mod in vec_of_models) { #fit the models to the training_set
<- y+1
y <- mod(training_set, criterion_col, shuffled_cols_to_fit)
models[[y]]
}
#calculate percentage of correct predictions
<- percentCorrectList(training_set, models)
fittingAccuracy <- percentCorrectList(test_set, models)
predictionAccuracy <- rbind(fitting,fittingAccuracy)
fitting <- rbind(prediction,predictionAccuracy)
prediction
}
<- (rbind(colMeans(fitting),colMeans(prediction)))
results rownames(results) <- c("Fitting","Prediction")
results }
City population
Then we can just run this function to calculate predictive accuracy for different training and test set sizes. First let’s have the models predict the populations of 83 German cities using 9 binary cues. The criterion column may change depending on your data set, so set it correctly!
data("city_population")
<- city_population
data_set <- 3
criterion_col <- 4:ncol(data_set) cols_to_fit
Below we have the models train on 0.5 of the data (50%) and predict the other half, and we repeat this for 100 samples of splitting the data in half.
<- 100
reps <- 0.5
training_proportion <- crossV(vec_of_models, criterion_col, cols_to_fit, data_set, reps,training_proportion)
results round(results, 1)
## ttbModel unitWeightModel regModel minModel
## Fitting 74.9 73.6 75.9 70.2
## Prediction 72.4 71.7 73.6 68.4
Finally, let’s plot the results:
library(ggplot2)
library(reshape)
rownames(results) <- c("Fitting","Prediction")
<- melt(results) p
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
colnames(p) <- c("condition","model","value")
ggplot(p, aes(x=condition, y=value, colour=model,group=model)) +
geom_line() +
geom_point() +
xlab("Condition") + ylab("Proportion correct")
High school drop-outs
Now do the same analysis for the high school drop-out data set. It has 23 real-valued cues (rather than binary cues) for 63 Chicago public high schools.
Note that this data set has na’s, so we use na.omit to clean them because not all heuristics can handle them properly.
data(highschool_dropout)
<- na.omit(highschool_dropout)
data_set <- 4
criterion_col <- 6:ncol(data_set)
cols_to_fit
<- 100
reps <- 0.5
training_proportion <- crossV(vec_of_models, criterion_col, cols_to_fit, data_set, reps,training_proportion)
results
rownames(results) <- c("Fitting","Prediction")
<- melt(results) p
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
colnames(p) <- c("condition","model","value")
ggplot(p, aes(x=condition, y=value, colour=model,group=model)) +
geom_line() +
geom_point() +
xlab("Condition") + ylab("Proportion correct")
Discussion
The performance of all models drops when they are predicting unseen data. In the city population dataset the rank-order of the models remains the same for fitting and prediction. However, when predicting high-school dropout some of the simple models (TTB and UnitWeightModel) outperform linear regression in prediction. These results suggests that different environmental structures (such as the number of cues in the environment) favor different strategies.
How would other models compare to take-the-best? Try some of the existing models in the heuristica package (e.g., logRegModel for logistic regression) or create your own model (see vignette on ‘how to make a heuristic’).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.