Type: | Package |
Title: | Credit Scorecard Modelling Utils |
Version: | 0.0.1.0 |
Maintainer: | Arya Poddar <aryapoddar290990@gmail.com> |
Description: | Provides infrastructure functionalities such as missing value treatment, information value calculation, GINI calculation etc. which are used for developing a traditional credit scorecard as well as a machine learning based model. The functionalities defined are standard steps for any credit underwriting scorecard development, extensively used in financial domain. |
License: | GPL-2 | GPL-3 |
LazyData: | TRUE |
RoxygenNote: | 6.0.1 |
Imports: | car, e1071, gbm, partykit, randomForest, reshape2, sqldf, stringr, stats, ggplot2, utils |
NeedsCompilation: | no |
Packaged: | 2019-04-14 16:04:05 UTC; ARYASOURYA |
Author: | Arya Poddar [aut, cre], Aiana Goyal [ctb], Kanishk Dogar [ctb] |
Repository: | CRAN |
Date/Publication: | 2019-04-14 20:53:03 UTC |
Clubbing class of categorical variables with low population percentage with another class of similar event rate
Description
The function groups classes of categorical variables, which have population percentage less than a threshold, with another class of similar event rate. If a class of exactly same event rate is not available, it is clubbed with the one having a higher event rate closest to it.
Usage
cat_new_class(base, target, cat_var_name, threshold, event = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
cat_var_name |
column name or array of column names of categorical variable on which the operation is to be done, to be passed as string |
threshold |
threshold population percentage below which the class will be considered to be be clubbed with another class, to be provided as decimal/fraction |
event |
(optional) the event class, to be passed as 0 or 1 (default is 1) |
Value
The function returns an object of class "cat_new_class" which is a list containing the following components:
base_new |
a dataframe after clubbing low percentage classes with another class of similar or closest but higher event rate |
cat_class_new |
a dataframe with mapping between original classes and new clubbed classes (if any) |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Kanishk Dogar <Kanishkd4@gmail.com>
Examples
data <- iris[1:110,]
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data_newclass <- cat_new_class(base = data,target = "Y",cat_var_name = "Species",threshold = 0.1)
IV table for individual categorical variable
Description
The function takes base data, target and the categorical variable for which IV is to be calculated. It returns a dataframe with the WOE and IV value of the variable.
Usage
categorical_iv(base, target, variable, event = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
variable |
categorical variable name for which IV is to be calculated, to be passed as string |
event |
(optional) the event class, to be passed as 0 or 1 (default is 1) |
Value
The function returns a dataframe.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Aiana Goyal <aianagoel002@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
cat_iv <- categorical_iv(base = data,target = "Y",variable = "Species",event = 1)
Clubbing class of a categorical variable with low population percentage with another class of similar event rate
Description
The function groups classes of categorical variable, which have population percentage less than a threshold, with another class of similar event rate. If a class of exactly same event rate is not available, it is clubbed with the one having a higher event rate closest to it.
Usage
club_cat_class(base, target, variable, threshold, event = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
variable |
column name of categorical variable on which the operation is to be done, to be passed as string |
threshold |
threshold population percentage below which the class will be considered to be be clubbed with another class, to be provided as decimal/fraction |
event |
(optional) the event class, to be passed as 0 or 1 (default is 1) |
Value
The function returns a dataframe after clubbing low percentage classes with another class of similar or closest but higher event rate.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Kanishk Dogar <kanishkd4@gmail.com>
Examples
data <- iris[1:110,]
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data_clubclass <- club_cat_class(base = data,target = "Y",variable = "Species",threshold = 0.2)
Variable reduction based on Cramer's V filter
Description
The function returns a list of variables that can be dropped because of high correlation with another variable, based on Cramer's V and IV. If V1 and V2 have a Cramer's V value more than a user defined threshold, the variable with lower IV will be recommended to be dropped by this function. The variable which got dropped wont be considered for dropping any more variables.
Usage
cv_filter(cv_table, iv_table, threshold)
Arguments
cv_table |
dataframe of class cv_table with three columns - var_1, var_2, cv_value |
iv_table |
dataframe of class iv_table with two columns - Variable_name, iv |
threshold |
Cramers' V value above which one of the variable will be recommended to be dropped |
Value
An object of class "cv_filter" is a list containing the following components:
retain_var_list |
list of variables remaining post CV filter |
dropped_var_list |
list of variables that can be dropped based on CV filter |
dropped_var_tab |
CV correlation value for dropped variables as a dataframe |
threshold |
threshold CV value used as input parameter |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
cv_tab_list <- cv_table(data, c("Species", "Sepal.Length"))
cv_tab <- cv_tab_list$cv_val_tab
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
iv_tab <- iv_table_list$iv_table
cv_filter_list <- cv_filter(cv_table = cv_tab,iv_table = iv_tab,threshold = 0.5)
cv_filter_list$retain_var_list
cv_filter_list$dropped_var_list
cv_filter_list$dropped_var_tab
cv_filter_list$threshold
Pairwise Cramer's V among a list of categorical variables
Description
The function gives a dataframe with pairwise Cramer's V value between all possible combination of categorical variables from the list of variables provided.
Usage
cv_table(base, column_name)
Arguments
base |
input dataframe |
column_name |
column name or array of column names for which Cramer's V is to be calculated |
Value
An object of class "cv_table" is a list containing the following components:
cv_val_tab |
pairwise Cramer's V value as a dataframe |
single_class_var_index |
array of column index of variables with only one class |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Sepal.Length <- as.character(floor(data$Sepal.Length))
cv_tab_list <- cv_table(data, c("Species", "Sepal.Length"))
cv_tab_list$cv_val_tab
cv_tab_list$single_class_var_index
Cramer's V value between two categorical variables
Description
The function gives the pairwise Cramer's V value between two input categorical variables.
Usage
cv_test(base, var_1, var_2)
Arguments
base |
input dataframe |
var_1 |
categorical variable name, to be passed as string |
var_2 |
categorical variable name, to be passed as string |
Value
The function returns a dataframe with pairwise CV value.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Sepal.Length <- as.character(floor(data$Sepal.Length))
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
cv_result <- cv_test(base = data,var_1 = "Species",var_2 = "Sepal.Length")
Getting the split value for terminal nodes from decision tree
Description
The function takes a ctree type model, with only one numerical variable, as argument input and gives a dataframe with the minimum and maximum value of each node. The intervals are open ended at lower limit and closed at upper limit.
Usage
dtree_split_val(desc_model, variable)
Arguments
desc_model |
ctree class model with one variable |
variable |
numerical variable name which on which decision tree was run, to be passed as string |
Value
The function returns a dataframe giving the lower and upper bound of split values of each node.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Y <- ifelse(data$Species=="setosa",1,0)
Recursive Decision Tree partitioning with monotonic event rate along with IV table for individual numerical variable
Description
The function takes base data, target and the numerical variable which is to be binned. It returns the optimal cuts based on recursive partitioning decision tree such that the trend of event rate holds good ie. it is strictly monotonically increasing or decreasing. If missing values are imputed by any extreme value, the same can be passed as an argument, and it will be shown as a different category. The output is a dataframe with the WOE and IV value.
Usage
dtree_trend_iv(base, target, variable, num_missing = -99999,
mincriterion = 0.1, event = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
variable |
numerical variable name which is to be binned into categorical buckets, to be passed as string |
num_missing |
(optional) imputed missing value for numerical variable or an array of values which are to be kept as different bucket in binning step (default value is -99999) |
mincriterion |
(optional) the value of the test statistic or (1 - p-value) that must be exceeded in order to implement a split (default value is 0.1) |
event |
(optional) the event class, to be passed as 0 or 1 (default is 1) |
Value
The function returns a dataframe with count and iv.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Aiana Goyal <aianagoel002@gmail.com>
Examples
data <- iris
data$Y <- ifelse(data$Species=="setosa",1,0)
dtree_trend_tab <- dtree_trend_iv(base = data,target = "Y",variable = "Sepal.Length",event = 1)
Creates confusion matrix and its related measures
Description
The function takes the base dataframe with observed/actual and predicted columns. The actual/predicted class preferably should be binary and if not, it will be considered as event vs rest. It computes the performance measures like accuracy, precision, recall, sensitivity, specificity and f1 score.
Usage
fn_conf_mat(base, observed_col, predicted_col, event)
Arguments
base |
input dataframe |
observed_col |
column / field name of the observed event |
predicted_col |
column / field name of the predicted event |
event |
the event class, to be passed as string |
Value
An object of class "fn_conf_mat" is a list containing the following components:
confusion_mat |
confusion matrix as a table |
accuracy |
accuracy measure |
precision |
precision measure |
recall |
recall measure |
sensitivity |
sensitivity measure |
specificity |
specificity measure |
f1_score |
F1 score |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data$Y_pred <- sample(0:1,size=nrow(data),replace=TRUE)
fn_conf_mat_list <- fn_conf_mat(base = data,observed_col = "Y",predicted_col = "Y_pred",event = 1)
fn_conf_mat_list$confusion_mat
fn_conf_mat_list$accuracy
fn_conf_mat_list$precision
fn_conf_mat_list$recall
fn_conf_mat_list$sensitivity
fn_conf_mat_list$specificity
fn_conf_mat_list$f1_score
Creates random index for k-fold cross validation
Description
The function base and returns a list of length k, to be used for k-fold cross validation sampling. Each element of the returned list is an array of random index for sampling for k-fold cross validation.
Usage
fn_cross_index(base, k)
Arguments
base |
input dataframe |
k |
number of cross validation |
Value
The function a list of length k, each holding an array of index/row number for sampling the base.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data$Y_pred <- sample(0:1,size=nrow(data),replace=TRUE)
data_k_list <- fn_cross_index(base = data,k = 5)
data_k_list$index1
data_k_list$index2
data_k_list$index3
data_k_list$index4
data_k_list$index5
Computes error measures between observed and predicted values
Description
The function takes the input dataframe with observed and predicted columns and computes mean absolute error, mean squared error and root mean squared error terms.
Usage
fn_error(base, observed_col, predicted_col)
Arguments
base |
input dataframe |
observed_col |
column / field name of the observed event |
predicted_col |
column / field name of the predicted event |
Value
An object of class "fn_error" is a list containing the following components:
mean_abs_error |
mean absolute error between observed and predicted value |
mean_sq_error |
mean squared error between observed and predicted value |
root_mean_sq_error |
root mean squared error between observed and predicted value |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data$Y_pred <- sample(0:1,size=nrow(data),replace=TRUE)
fn_error_list <- fn_error(base = data,observed_col = "Y",predicted_col = "Y_pred")
fn_error_list$mean_abs_error
fn_error_list$mean_sq_error
fn_error_list$root_mean_sq_error
Calculating mode value of a vector
Description
The function returns the mode of a vector. The vector can be of any datatype ie. numerical or categorical.
Usage
fn_mode(x)
Arguments
x |
a vector of string or number |
Value
The function returns the mode value of the input vector.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
fn_mode(c(1,2,3,1,4,1,7))
Redefines target value
Description
The function redefines the "binary" target variable to be used for modelling. It takes the variable or field name of the target and the event class. It changes the target field name to "Target", changes the events into 1 and non-events as 0 and places the target column at the end of the dataframe before returning it as output.
Usage
fn_target(base, target, event)
Arguments
base |
input dataframe |
target |
column / field name for the target variable, to be passed as string |
event |
the event class, to be passed as string |
Value
The function returns a dataframe after changing the target classes to 0 or 1.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data2 <- fn_target(base = data,target = "Y",event = 1)
Performance measure table with Gini coefficient, KS-statistics and Gini lift curve
Description
The function takes a dataframe along with a model or the name of a column with predicted value. If a model (only lm or glm works is guaranted to work perfectly) is provided as argument, the response on the data is predicted. Otherwise, if the data already contains a predicted column, it can be referred as an argument. The predicted column, thus obtained, is classified into bands to get the Gini coefficient, Kolmogorov-Smirnov statistics and Gini lift curve. The number of bands required can be passed as argument, with default value as 10 ie. decile binning is done. Otherwise, the cutpoints for converting the predicted value into bands can also be specified.
Usage
gini_table(base, target, col_pred = F, model = F, brk = F,
quantile_pt = 10, event_rate_direction = "decreasing")
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
col_pred |
(optional) column name which contains the predicted value, not required if "model"=TRUE (default value is FALSE) |
model |
(optional) object of type lm or glm model, required only if "col_pred"=FALSE (default value is FALSE) |
brk |
(optional) array of break points of predicted value (default value is FALSE) |
quantile_pt |
(optional) number of quantiles to divide the predicted value range (default value is 10) |
event_rate_direction |
(optional) directionality of event rate with increasing value of predicted column, to be chosen among "increasing" or "decreasing" (default value is decreasing) |
Value
An object of class "gini_table" is a list containing the following components:
prediction |
base with the predicted value as a dataframe |
gini_tab |
gini table as a dataframe |
gini_value |
gini coefficient value |
gini_plot |
gini curve plot |
ks_value |
Kolmogorov-Smirnov statistic |
breaks |
break points |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Aiana Goyal <aianagoel002@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y_pred <- sample(300:900,size=nrow(data),replace=TRUE)
gini_tab_list <- gini_table(base = data,target = "Y",col_pred = "Y_pred",quantile_pt = 10)
gini_tab_list$prediction
gini_tab_list$gini_tab
gini_tab_list$gini_value
gini_tab_list$gini_plot
gini_tab_list$ks_value
gini_tab_list$breaks
Hyperparameter optimisation or parameter tuning for Gradient Boosting Regression Modelling by grid search
Description
The function runs a grid search with k-fold cross validation to arrive at best parameter decided by some performance measure. The parameters that can be tuned using this function for gradient boosting regression modelling algorithm are - ntree, depth, shrinkage, min_obs and bag_fraction. The objective function to be minimised is the error (mean absolute error / mean squared error / root mean squared error). For the grid search, the possible values of each tuning parameter needs to be passed as an array into the function.
Usage
gradient_boosting_parameters(base, target, ntree, depth, shrinkage, min_obs,
bag_fraction, error = "rmse", cv = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
ntree |
number of trees to be fitted |
depth |
maximum depth of variable interactions |
shrinkage |
learning rate |
min_obs |
minimum size of terminal nodes |
bag_fraction |
fraction of the training set observations randomly selected for next tree |
error |
(optional) error measure as objective function to be minimised, to be chosen among "mae", "mse" and "rmse" (default value is "rmse") |
cv |
(optional) k vakue for k-fold cross validation to be performed (default value is 1 ie. without cross validation) |
Value
An object of class "gradient_boosting_parameters" is a list containing the following components:
error_tab_detailed |
error summary for each cross validation sample of the parameter combinations iterated during grid search as a dataframe |
error_tab_summary |
error summary for each combination of parameters as a dataframe |
best_ntree |
ntree parameter of the optimal solution |
best_depth |
depth parameter of the optimal solution |
best_shrinkage |
shrinkage parameter of the optimal solution |
best_min_obs |
cost min_obs of the optimal solution |
best_bag_fraction |
bag_fraction parameter of the optimal solution |
runtime |
runtime of the entire process |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
gbm_params_list <- gradient_boosting_parameters(base = data,target = "Y",ntree = 2,depth = 2,
shrinkage = 0.1,min_obs = 0.1,bag_fraction = 0.7)
gbm_params_list$error_tab_detailed
gbm_params_list$error_tab_summary
gbm_params_list$best_ntree
gbm_params_list$best_depth
gbm_params_list$best_shrinkage
gbm_params_list$best_min_obs
gbm_params_list$best_bag_fraction
gbm_params_list$runtime
Variable reduction based on Information Value filter
Description
The function returns a list of variables that can be dropped because of low discriminatory power, based on Information Value. If IV for a variable is less than a user defined threshold, the variable will be recommended to be dropped by this function.
Usage
iv_filter(base, iv_table, threshold)
Arguments
base |
input dataframe |
iv_table |
dataframe of class iv_table with two columns - Variable_name, iv |
threshold |
threshold IV value below which the variable will be recommended to be dropped |
Value
An object of class "iv_filter" is a list containing the following components:
retain_var_tab |
variables remaining post IV filter as a dataframe |
retain_var_name |
array of column names of variables to be retained |
dropped_var_tab |
variables that can be dropped based on IV filter as a dataframe |
threshold |
threshold IV value used as input parameter |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
ivf_list <- iv_filter(base = data,iv_table = iv_table_list$iv_table,threshold = 0.02)
ivf_list$retain_var_tab
ivf_list$retain_var_name
ivf_list$dropped_var_tab
ivf_list$threshold
WOE and IV table for list of numerical and categorical variables
Description
The function takes column indices of categorical and numerical variables and returns a list with four dataframes - WOE table of numerical variables, categorical variables, consolidated table of both numerical & categorical variables and a IV table.
Usage
iv_table(base, target, num_var_name = F, num_missing = -99999,
cat_var_name = F, mincriterion = 0.1, event = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
num_var_name |
column name or array of column names of numerical variable for which IV is to be calculated, to be passed as string |
num_missing |
(optional) imputed missing value for numerical variable or an array of values which are to be kept as different bucket in binning step (default value is -99999) |
cat_var_name |
column name or array of column names of categorical variable for which IV is to be calculated, to be passed as string |
mincriterion |
(optional) the value of the test statistic or (1 - p-value) that must be exceeded in order to implement a split (default value is 0.1) |
event |
(optional) the event class, to be passed as 0 or 1 (default is 1) |
Value
An object of class "iv_table" is a list containing the following components:
num_woe_table |
numerical woe table with IV as a dataframe |
cat_woe_table |
categorical woe table with IV as a dataframe |
woe_table |
numerical and categorical woe table with IV as a dataframe |
iv_table |
Variable with IV value as a dataframe |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Aiana Goyal <aianagoel002@gmail.com>
Kanishk Dogar <kanishkd4@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
iv_table_list$num_woe_table
iv_table_list$cat_woe_table
iv_table_list$woe_table
iv_table_list$iv_table
Missing value imputation
Description
The function imputes the missing value in the input dataset. For numerical variables, missing values can be replaced by four possible method - 1. "mean" - mean or simple average of the non-missing values ; 2. - "median" - median or the 50th percentile of the non-missing values; 3. "mode"- mode or the value with maximum frequency among the non-mising values; 4. special extreme value of users' choice to be passes as an argument (-99999 is the default value). For categorical value, missing class can be replaced by two possible methods - 1. "mode" - mode or the class with maximum frequency among the non-mising values; 2. special class of users' choice to be passes as an argument ("missing_value" is the default class). The target column will remain unchanged.
Usage
missing_val(base, target, num_missing = -99999,
cat_missing = "missing_value")
Arguments
base |
input dataframe |
target |
column/field name of the target variable, to be passed as a string |
num_missing |
(optional) method for replacing missing values for numerical type fields - to be chosen between "mean", "median", "mode" or a value of users' choice (default value is -99999) |
cat_missing |
(optional) method for replacing missing values for categorical type fields - to be chosen between "mode" or a class of users' choice (default value is "missing_value") |
Value
The function returns an object of class "missing_val" which is a list containing the following components:
base |
a dataframe after imputing missing values |
mapping_table |
a dataframe with mapping between original variable and imputed missing value (if any) |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data[sample(1:nrow(data),size=25),"Sepal.Length"] <- NA
data[sample(1:nrow(data),size=10),"Species"] <- NA
missing_list <- missing_val(base = data,target = "Y")
missing_list$base
missing_list$mapping_table
Binning numerical variables based on cuts from IV table
Description
The function takes the num_woe_table output from a class "iv_table". Based on the split points from the num_woe_table, the numerical variables are binned into categories.
Usage
num_to_cat(base, num_woe_table, num_missing = -99999)
Arguments
base |
input dataframe |
num_woe_table |
num_woe_table class from iv table output |
num_missing |
(optional) imputed missing value for numerical variable (default value is -99999) |
Value
The function returns a dataframe after converting the numerical variables into categorical classes.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
num_cat <- num_to_cat(base = data,num_woe_table = iv_table_list$num_woe_table)
Clubbing of classes of categorical variable with low population percentage into one class
Description
The function groups the classes of a categorical variable which have population percentage less than a threshold as "Low_pop_perc". The user can choose whether to club the missing class or keep it as separate class. The default setting is that missing classes are not treated separately.
Usage
others_class(base, target, column_name, threshold, char_missing = NA)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
column_name |
column name or array of column names of the dataframe on which the operation is to be done |
threshold |
threshold population percentage below which the class is to be classified as others, to be provided as decimal/fraction |
char_missing |
(optional) imputed missing value for categorical variable if its to be kept separate (default value is NA) |
Value
base |
a dataframe after converting all low percentage classes into "Low_pop_perc" class |
mapping_table |
a dataframe with mapping between original classes which are now "Low_pop_perc" class (if any) |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris[c(1:110),]
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
data$Species <- as.character(data$Species)
data_otherclass <- others_class(base = data,target = "Y",column_name = "Species",threshold = 0.15)
Hyperparameter optimisation or parameter tuning for Random Forest by grid search
Description
The function runs a grid search with k-fold cross validation to arrive at best parameter decided by some performance measure. The parameters that can be tuned using this function for random forest algorithm are - ntree, mtry, maxnodes and nodesize. The objective function to be minimised is the error (mean absolute error / mean squared error / root mean squared error). For the grid search, the possible values of each tuning parameter needs to be passed as an array into the function.
Usage
random_forest_parameters(base, target, model_type, ntree, mtry,
maxnodes = NULL, nodesize, error = "rmse", cv = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
model_type |
to be chosen among "regression" or "classification" |
ntree |
number of trees to be fitted |
mtry |
number of variable to be sampled as split criteria at each node |
maxnodes |
(optional) Maximum number of terminal nodes (default is NULL ie. no restriction on depth of the trees) |
nodesize |
minimum size of terminal nodes |
error |
(optional) error measure as objective function to be minimised, to be chosen among "mae", "mse" and "rmse" (default value is "rmse") |
cv |
(optional) k vakue for k-fold cross validation to be performed (default value is 1 ie. without cross validation) |
Value
An object of class "random_forest_parameters" is a list containing the following components:
error_tab_detailed |
error summary for each cross validation sample of the parameter combinations iterated during grid search as a dataframe |
error_tab_summary |
error summary for each combination of parameters as a dataframe |
best_ntree |
ntree parameter of the optimal solution |
best_mtry |
mtry parameter of the optimal solution |
maxnodes |
maxnodes parameter of the optimal solution |
best_nodesize |
nodesize parameter of the optimal solution |
runtime |
runtime of the entire process |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Aiana Goyal <aianagoel002@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
rf_params_list <- random_forest_parameters(base = data,target = "Y",
model_type = "classification",ntree = 2,mtry = 1,nodesize = 3)
rf_params_list$error_tab_detailed
rf_params_list$error_tab_summary
rf_params_list$best_ntree
rf_params_list$best_mtry
rf_params_list$maxnodes
rf_params_list$best_nodesize
rf_params_list$runtime
Random sampling of data into train and test
Description
The function does random sampling of the data and split it into train and test datasets. Training base percentage and seed value(optional) is taken as arguments. If seed value is not specified, random seed will be generated on different iterations.
Usage
sampling(base, train_perc = 0.7, seed = NA, replace = F)
Arguments
base |
input dataframe |
train_perc |
(optional) percentage of total base to be kept as training sample, to be provided as decimal/fraction (default percentage is 0.7) |
seed |
(optional) seed value (if not given random seed is generated) |
replace |
(optional) whether replacement will e with or without replacement (default is FALSE ie. without replacement) |
Value
An object of class "sampling" is a list containing the following components:
train_sample |
training sample as a dataframe |
test_sample |
test sample as a dataframe |
seed |
seed used |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
sampling_list <- sampling(base = data,train_perc = 0.7,seed = 1234)
sampling_list$train
sampling_list$test
sampling_list$seed
Converting coefficients of logistic regression into scores for scorecard building
Description
The function takes a logistic model as input and scales the coefficients into scores to be used for scorecard generation. The
Usage
scalling(base, target, model, point = 15, factor = 2, setscore = 660)
Arguments
base |
base input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
model |
input logistic model from which the coefficients are to be picked |
point |
(optional) points after which the log odds will get multiplied by "factor" (default value is 15) |
factor |
(optional) factor by which the log odds must get multiplied after a step of "points" (default value is 2) |
setscore |
(optional) input for setting offset (default value is 660) |
Value
The function returns a dataframe with the coefficients and scalled scores for each class of all explanatory variables of the model.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
num_cat <- num_to_cat(base = data,num_woe_table = iv_table_list$num_woe_table)
log_model <- glm(Y ~ ., data = num_cat, family = "binomial")
scaling_tab <- scalling(base = num_cat,target = "Y",model = log_model)
Scoring a dataset with class based on a scalling logic to arrive at final score
Description
The function takes the data, with each variable as class. The dataframe of class scalling is used to convert the class into scores and finally arrive at the row level final scores by adding up the score values.
Usage
scoring(base, target, scalling)
Arguments
base |
input dataframe with classes same as scalling logic |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
scalling |
dataframe of class scalling with atleast two columns - Variable, Category, Coefficient, D(i,j)_hat, Score |
Value
The function returns a dataframe with classes converted to scores and the final score for each record in the input dataframe.
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
x <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
iv_table_list <- iv_table(base = data,target = "Y",num_var_name = x,cat_var_name = "Species")
num_cat <- num_to_cat(base = data,num_woe_table = iv_table_list$num_woe_table)
log_model <- glm(Y ~ ., data = num_cat, family = "binomial")
scaling_tab <- scalling(base = num_cat,target = "Y",model = log_model)
score_tab <- scoring(base = num_cat,target = "Y",scalling = scaling_tab)
Hyperparameter optimisation or parameter tuning for Suppert Vector Machine by grid search
Description
The function runs a grid search with k-fold cross validation to arrive at best parameter decided by some performance measure. The parameters that can be tuned using this function for support vector machine algorithm are - kernel (linear / polynomial / radial / sigmoid), degree of polynomial, gamma and cost. The objective function to be minimised is the error (mean absolute error / mean squared error / root mean squared error). For the grid search, the possible values of each tuning parameter needs to be passed as an array into the function.
Usage
support_vector_parameters(base, target, scale = T, kernel, degree = 2,
gamma, cost, error = "rmse", cv = 1)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
scale |
(optional) logical vector indicating the variables to be scaled (default value is TRUE) |
kernel |
an array of kernels to be iterated on; kernel used in training and predicting, to be cheosen among "linear", "polynomial", "radial" and "sigmoid" |
degree |
(optional) an array of degree of polynomial to be iterated on; parameter needed for kernel of type "polynomial" (default value is 2) |
gamma |
an array of gamma values to be iterated on; parameter needed for all kernels except linear |
cost |
an array of cost to be iterated on; cost of constraints violation |
error |
(optional) error measure as objective function to be minimised, to be chosen among "mae", "mse" and "rmse" (default value is "rmse") |
cv |
(optional) k vakue for k-fold cross validation to be performed (default value is 1 ie. without cross validation) |
Value
An object of class "support_vector_parameters" is a list containing the following components:
error_tab_detailed |
error summary for each cross validation sample of the parameter combinations iterated during grid search as a dataframe |
error_tab_summary |
error summary for each combination of parameters as a dataframe |
best_kernel |
kernel parameter of the optimal solution |
best_degree |
degree parameter of the optimal solution |
best_gamma |
gamma parameter of the optimal solution |
best_cost |
cost parameter of the optimal solution |
runtime |
runtime of the entire process |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
svm_params_list <- support_vector_parameters(base = data,target = "Y",gamma = 0.1,
cost = 0.1,kernel = "radial")
svm_params_list$error_tab_detailed
svm_params_list$error_tab_summary
svm_params_list$best_kernel
svm_params_list$best_degree
svm_params_list$best_gamma
svm_params_list$best_cost
svm_params_list$runtime
Univariate analysis of variables
Description
The function gives univariate analysis of the variables as output dataframe. The univariate statistics includes - minimum, maximum, mean, median, number of distinct values, variable type, counts of null value, percentage of null value, maximum population percentage among all classes/values, correlation with target. It also returns the list of names of character and numerical variable types along with variable name with population concentration more than a threshold at a class/value.
Usage
univariate(base, target, threshold)
Arguments
base |
input dataframe |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
threshold |
sparsity threshold, to be provided as decimal/fraction |
Value
The function returns an object of class "univariate" which is a list containing the following components:
univar_table |
univariate summary of variables |
num_var_name |
array of column names of numerical type variables |
char_var_name |
array of column names of categorical type variables |
sparse_var_name |
array of column names where population concentration at a class or value is more then the sparsity threshold |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
data$Species <- as.character(data$Species)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
univariate_list <- univariate(base = data,target = "Y",threshold = 0.95)
univariate_list$univar_table
univariate_list$num_var_name
univariate_list$char_var_name
univariate_list$sparse_var_name
Removing multicollinearity from a model using vif test
Description
The function takes a dataset with the starting variables and target only. The vif is calculated and if the maximum vif value is more than the threshold, the variable is dropped from the model and the vif's are recomputed. These steps of computing vif and dropping variable keep iterating till the maximum vif value is less than or equal to the threshold.
Usage
vif_filter(base, target, threshold = 2)
Arguments
base |
input dataframe with set of final variables only along with target |
target |
column / field name for the target variable to be passed as string (must be 0/1 type) |
threshold |
threshold value for vif (default value is 2) |
Value
An object of class "vif_filter" is a list containing the following components:
vif_table |
vif table post vif filtering |
model |
the model used for vif calculation |
retain_var_list |
variables remaining in the model post vif filter as an array |
dropped_var_list |
variables dropped from the model in vif filter step |
threshold |
threshold |
Author(s)
Arya Poddar <aryapoddar290990@gmail.com>
Examples
data <- iris
suppressWarnings(RNGversion('3.5.0'))
set.seed(11)
data$Y <- sample(0:1,size=nrow(data),replace=TRUE)
vif_data_list <- vif_filter(base = data,target = "Y")
vif_data_list$vif_table
vif_data_list$model
vif_data_list$retain_var_list
vif_data_list$dropped_var_list
vif_data_list$threshold