Parallelilization issues:
·
Eliminated
package snowfall and introduced package
parallel instead (simpler code, better maintainance)
·
No
read/write of SRF-file <filename>.SRF.V1.Rdata anymore in the parallel
processes. Instead we
o
transport
the SRF-info via list opts$srf (a list of data
frames, one for each response variable), save SRF-file(s) Output/<confFile>.SRF.RData in case opts$SRF.calc==T (saveSRFinfo()
in tdmBigLoop.r, after all parallel processes)
o
and read SRF-file(s) in case opts$SRF.calc==F
prior to branching into parallel runs & store it on opts$srf
(addSRF() in tdmEnvTMakeNew.r).
Simplified the code / the workflow (12/2012):
·
Simplified
the opts-part connected with data file reading: (a) make opts$filesuffix
obsolete, only local variable filesuffix (tdmReadData); (b) make opts$READ.CMD simpler: a string “readFunc(filename,opts)”, where readFunc
is def’d in main_TASK.r and
returns dset or tset. A
template tdmReadCmd(filename,opts) is def’d in tdmReadData.r. (No need
for complicated syntax ‘sep=\”;\”,…’
any more.)
·
Change
opts$TST.testFrac à tdm$TST.testFrac
(opts$TST.testFrac is only used in tdmSplitTestData)
and tdm$tstFrac
à tdm$TST.testFrac
(tdm$tstFrac is only used in tdmMapDesign,
branch umode=”RSUB”).
·
Make
opts$READ.INI=T the recommended choice. But keep the possibility to read in main_TASK.r (so that main_TASK
can run alone).
Integrated AdaBoost (package adabag)
as a new learner.
Tuning parameters are opts$ADA.coeflearn, ADA.mfinal, ADA.rpart.minsplit.
Added in
function checkOpts (tdmEnvTMakeNew.r)
a logic that checks opts against new (unusual) parameters in list opts and
prints a notification: “Note: a new variable xyz has been defined for list
opts.” (safeguard against misspelling a parameter)
Added tuning params for SVM with kernel
“polynomial”, “linear” or “sigmoid”. Added tuning param TRNFRAC. Removed opts$SVM.C
(never needed).
Bug fix: If
opts$filename is not of the form “a.suf”
but simply “a”, an error occured (connected with filesuffix, see tdmReadData). Now fixed.
Bug fix:
Stop when doublettes appear in tdm$runList.
Rewrote checkRoiParams to work properly in all cases,
no need for envT$roiNames1 any more.
Bug fix:
Column “tdmSplit” was not always excluded from input
variables (main_sonar). Now dsetTrnVa() and dsetTest() return all columns except x$TST.COL, which is usually
“tdmSplit”.
Fixed the lev.resp-bug in tdmClassify: Do
not take for granted that lev.resp = levels(d_train[,response.variable]) is always the same (in the same
ordering) as the column names of app$test.prob in apply.SVM! This assumption resulted (sometimes) in very bad
performance for wdbc-task in case opts$CLS.cutoff=c(0.5,0.5). Now we use colnames(app$test.prob) instead of lev.resp, and mostly everything is o.k. now. There are,
however, still a few cases where the results with app$test.prob
and those with standard predict are different, see fct.
check.apply.SVM in tdmClassify. Unclear why this is different.
Improved
TDMR workflow by testing it on BreastCancer (BC) dataset (has
factor variables with many levels, has columns with <NA>):
·
New
check on NA’s in input variables with na_input_check
in tdmClassify.r. This will issue a warning each time
a NA is found.
·
New
checkData() function in tdmSplitTestData.r:
check whether dset and tset
have the same mode in every column and give a warning if a column is factor and
has an unusual high number of levels (>32).
·
Specifically
for BC dataset: New function tdmPreNAroughfix(data)
which main_BC() calls prior to tdmClassify
and avoids those warnings
Documentation issues
·
options
opts à complete list in Appendix B
·
mappings
à complete list in Appendix A
·
opts$gainmat in docu à opts$CLS.gainmat
·
“phase”
may be misleading, as if a user had to pass all three of them à use
“level” instead
·
confusing
that main-file has opts-settings, should be all in .apd
file <task>_00.apd
·
flow
diagram for train/test/vali settings (sampling)
·
updated
tutorial, lesson 1 and many links and source code snippets
·
updated tutorial, lesson 2: no more bst and res-files, but now returning spotConfig$alg.currentBest
, alg.currentResult.
Bug fix: any open sink’s are now closed in tdmGraAndLogInitialize using sink.number.
This avoids that *.log grows bigger and bigger.
Changed the
logic for opts$PRE.PCA / opts$PRE.PCA.npc
and opts$PRE.SFA / opts$PRE.SFA.npc
(see docu in tdmOptsDefaultsSet).
Appropriate settings for d_preproc = all
non-validation data or = training data, depending on opts$PRE.allNonVali=T/F.
d_preproc is also in tdmRegressLoop
and tdmRegress.
SFA for regression now also available.
Simplify:
Prohibit PCA and SFA to be both activated (makes no sense, since SFA has its
own PCA-step).
tdmPrePCA-Bug (too few columns in dset, when number of train records too small): The bug
happens tdmPreprocUtils.r, tdmPrePCA.train,
tdmAdjustDSet: If number of records Nr is
smaller than number of input vars, then the list of
returned PCs will be shorter, thus leading to a shorter dset.
Fix: Introduce d_preproc, the data set for
preprocessing, which is normally =d_train. Error, if
the number of records in d_preproc is smaller than
the number of numeric vars. The user has the option to activate opts$PRE.allNonVali=T, which leads to d_preproc
= all non-validation data.
Bug fix: many main_TASK functions had the argument tset,
but they have inside still a call tdmClassifyLoop(dset, …, opts) where it should read instead
tdmClassifyLoop(dset, …, opts, tset)
Simplified the code / the workflow (10/2012):
·
Improved
modularity: different parts of TDMR construct different objects:
o
the
user (or tdmDefaultsFill) constructs tdm
o
envT = tdmEnvTMakeNew(tdm) constructs envT
o
envT = tdmEnvTAddBstRes(envT,fileRData) augments envT by bstGrid, resGrid from .RData file (if needed)
o
envT = tdmBigLoop(envT,spotStep) does (tuning and) unbiased runs.
·
tdmBigLoop is now the new function which
should supersede the now deprecated tdmCompleteEval:
o
only
two parameters: envT, spotStep
o
since envT is passed
instead of tdm, we are more flexible which input to
send into tdmBigLoop. Example: If spotStep==”rep”,
tdmBigLoop requires the data frames bst and res from prior tuning runs à this is not possible via tdm, but can easily be done via envT$bstGrid,
envT$resGrid
o
simplified
tdm$fileMode-section (no .res or .bst-file
writing & copying any more, makes the code much simpler to understand!!) à bst and
res are returned / passed via envT
o
tdm$fileMode=FALSE is now the default. tdm$fileMode=TRUE
is deprecated and leads only to writing of .fin and .exp
files (these files are not very necessary, since we store envT
with theFinals in .RData
file)
o
always envT$spotConfig$spot.fileMode=FALSE
·
tdmCompleteEval is still there for downward compatibility, but
it is deprecated:
o
it
writes .res, .bst, .fin, .exp
files, if tdm$fileMode=T
o
tdmCompleteEval has other calling arguments
o
tdmComplteeEval now sets envT$spotConfig$spot.fileMode=tdm$fileMode (was done before in tdmDispatchTuner)
o
tdmCompleteEval should become obsolete, if all demos / user
files are changed to tdmBigLoop (but we keep it
perhaps for downward compatibility)
·
Simplified
envT$result, which contained 3x opts
!! à now only 1x opts + accessor function Opts().
·
Reformulated
the tdm$filemode-sections in tdmCompleteEval
/ tdmBigLoop: The normal case is now tdm$fileMode==FALSE.
·
Abandonned the writing of <name>_train.csv.SRF.<target>.RData, <name>_train.log and
<name>_train_eval.csv when tdm$oFileMode==FALSE,
since this may be conflicting if we do certain parallel tasks.
·
tdmGetObj is now marked as deprecated (we use
it however in unbiasedRun to ensure downward
compatibility).
·
Renamed
“Test2” à “Vali2” and other naming issues around “Test”
and “Validation”. Made variable names more meaningful: VALI, if connected with
validation data, TST, if connected with test data.
Simplified the code / the workflow (06/2012):
·
Simplified
start of parallel execution: no need for sourcing start.tdm.r
(except if you want the R developer sources), all sfExport-related
stuff is now in
function prepareParallelExec in tdmBigLoop.r.
·
Simplified
design mapping: only one function pair {tdmMapDesLoad,
tdmMapDesApply} and no longer tdmMapDesSpot,
no makeTdmDesSpot. The maps map (from tdmMapDesign.csv)
and mapUser (from userMapDesign.csv) are now stored
in list tdm.
·
Simplified
the triangle startFromSource.r, start.tdm.r,
source.tdm.r: startFromSource.r and start.tdm.r are now only needed for the developer (if you
want to start from R sources). They are NO LONGER needed if the normal TDMR user wants
to initiate parallel execution (all sfExport’s and
the like are now done in function prepareParallelExec
in tdmBigLoop.r, which is called if tdm$parallelCPUs>1).
·
Warning:
if tdm$umode=”TST” *and* opts$TST.kind=”col”,
then tdmSplitTestData will tag all records with opts$TST.col!=0 as test data.
Later on, tdmStartSpot will hand only the data with opts$TST.col==0 to main_TASK, and
this will separate into vali and train data acc. to opts$TST.col again à all data are train, no vali data (this is o.k. for opts$MOD.method==”RF”,
but may lead to strange results in other cases). – How to fix:
o
Make
a check on number of vali records for cases opts$MOD.method!=”RF”.
o
Issue
in tdmBigLoop/tdmCompleteEval
a warning, if tdm$umode=”TST” and opts$TST.kind=”col” and opts$MOD.method!=”RF”.
Docu TDMR and Demos TDMR:
·
added TDMR-tutorial.html,
moved the section “Example usage” in there.
·
added
a FAQ section (“How to”) in TDMR-tutorial.html
·
added
two appendices on tdmMapDesign.csv and on elements of opts in TDM-docu.html
·
adapted
all documentation & demos to the new tdmBigLoop
·
added
citation ROCR
Modified the function tdmModAdjustCutoff:
·
Extended
that either parameter CUTOFF1, … , CUTOFFn can be the missing one.
·
Guaranteed
that the dependent CUTOFF can never become negative when enforcing the
constraint.
·
If
tdmModAdjustCutoff is entered with a cutoff with length(cutoff)==n.class-1, then cutoff[n.class]
becomes the dependent CUTOFF.
·
The
old function tdmMapCutoff is now disabled, everything
in tdmModAdjustCutoff.
Fixed a bug: tdmPlotResMeta could crash, if
not all .conf files had the same tuning pars. Fix: Now the x- and y-selectors in twiddler-interface are the union of all tuning pars.
If a x- or y-selection is not part of the specific
tuning pars for the selected .conf, issue an error
message box and do not start spot.
Added skipIncomplete-part in tdmPlotResMeta(). Fixed a
bug (no mergedData) concerning nSkip
in tdmPlotResMeta().
Fixed a bug
concerning opts$READ.NROW: now this is applied also
when loading <filetest>.RData
For
regression: new option opts$rgain.type=”made” (mean absolute deviation)
Extended opts$rgain.string to work also for the regression options,
adapt column names in theFinals accordingly.
tdmOptsDefaultsSet returns now in opts an object of
class “tdmOpts”. Checks for the
right class of opts in central TDMR files.
tdmRegressLoop.r, tdmClassifyLoop.r: More accurate averaging
of evaluation measures for regression CV case, new variable ‘result$predictions’.
Bug fix ‘nfold=max(cvi,1)’ to have not nfold=0 in the special case that all records in dset are training cases (zero validation cases)
Saving envT: parameter savePredictions
(default =FALSE) allows to decide whether result$predictions
and result$predProbList are saved to .RData.
Some small
bug fixes concerning ‘predProb’ and ‘predictions’ for
the case opts$ncopies>0. predProb
is needed by tdmModConfmat, which is called from tdmClassify (in case opts$rgain.type=”ar*”, this will call tdmROCR_calc
with predProb). predProbList is needed by
tdmROCR.TDMclassifier.
Added in tdmSortedRFimport the option opts$SRF.scale to use scaled or unscaled importance.
Bug fix in tdmClassify: build EVALa
correctly also in cases where nrow(d_test)==0 à set
cm.test$* to NA and not cm.test
to NULL.
tdmSortedRFImport: negative importance values are now
clipped to 0 (no longer additive shift of importance values).
If tdm$parallelCPUs>1: snowfall would fail, if there is
only one pass through sfSapply, i.e. if length(indVec)=1. Fix: Check in tdmCompleteEval
whether length(indVec)==1,
issue a warning and set tdm$parallelCPUs to 1.
Renamed bind_response to tdmBindResponse
(tdmGeneralUtils.r)
Bug fix
‘path à tdm$path’ in tdmMapDesLoad (tdmMapDesign.r)
Bug fix for
cma_es (package cmaes):
When running demo/demo04cpu.r with tuner cmaes, we
got “Error in eigen.log[iter,
] <- rev(sort(e$values)): subscript out of bounds".
Solution: control$maxit = round(control$maxit), because this error only occurs if control$maxit is NOT an integer.
Fixed a bug
concerning opts$filesuffix (tdmOptsDefaultsFill)
which could lead to an unwanted stop.
Bug fix:
regression tuning made strange things (names in data frame) if you tuned only 1
variable (cpu, roi with only XPERC). Now fixed.
Bug fix: cma_es (and other tdmStartOther-tuners)
had usually in the BST data frame not the inclusion of the last design points
(which usually are formed after the last time where “des$CONFIG
%% tdm$spotConfig$seq.design.new.size==0” was TRUE). Now fixed.
Bug fix in tdmDispatchTuner: cma_jTuner did
not yet return a list of type spotConfig in tunerVal. Consequence: the above “Append” would not work. Now fixed.
tdmDispatchTuner.r: Made all tuners return a
list of type spotConfig with the proper settings in tunerVal$alg.currentResult and tunerVal$alg.currentBest.
Bug fix in tdmMapDesApply: the “[-1]” in “dn=setdiff(names(des[-1]),
c("COUNT","CONFIG",...))” was wrong.
Extended tdmPlotResMeta by a slider y_10Exp, which allows to
multiply the y-values by 100, 101,…,103
on the fly in the twiddler interface. This usually
gives a better color scheme for the 3D-plot in spotReport3d.
o New ROC chart and lift chart capabilities, based on
package ROCR on CRAN, see help(tdmROCR).
o New measures for opts$rgain.type= “arROC”, “arLIFT”, “arPRE” for area under ROC, lift or precision-recall chart,
based on package ROCR.
o Improved and extended the set of demos (demo00, … , demo06). New demos for interactive visualization.
o Improved cma_jTuner (CMA-ES, Java version). Works now on Linux and Windows OS platforms
when using tdm$fileMode=FALSE.
o Improved tdmPlotResMeta (confFile,
nSkip, chkSkip, xAxis, yAxis).
o Changed opts$fct.postproc: this is now the name
of the postprocessing function and not the
function itself. Reason: If opts contains a function, then it contains also its
environment and this can be pretty big (contains envT, …) and makes the .RData saving of envT big.
o Flag opts$DO.POSTPROC is now deprecated, use
instead opts$fct.postproc.
o Fixed a bug concerning opts$filesuffix (tdmReadData, could lead to overwriting of opts$filename).
o Improved the examples section in TDM-docu.html. Now most examples in TDM-docu.html are in sync with the set of demos. Seperated in TDM-docu.html the example usage description
from the example details. New chapter describing the
interactive visualization example.
o Improved the package documentation (simpler index via @keywords
internal, many small fixes).
o New 3D graphics for tuning results and their metamodels,
using a twiddler-interface on environment envT: see help(tdmPlotResMeta).
o New print() for TDMdata object “dataObj”
o Fixed a bug in tdmClassify (wrong ifelse in applySVM).
o Fixed some minor bugs to reactivate parallel mode: some sfExports were missing.
o Fixed the saveEnvT-bug (“[9:9]”) in tdmCompleteEval. New option tdm$filenameEnvT.
o Fixed the tdmMapDesign bug (Design variables
missing in tdmMapDesign.csv and userMapDesign.csv would not be mapped to opts.
Now missing variables are detected and an error is thrown.)
o Added opts$SPLIT.SEED variable: a variable to
decide if tdmSplitTestData runs in deterministic
mode
o Added opts$TST.trnFrac: now trnFrac can be smaller than 1-opts$TST.valiFrac.
o Added SAVESEED-part in tdmSplitTestData, tdmClassifyLoop, tdmRegressLoop
o Added tdm$stratified with new meaning: if not
NULL, make stratified sampling w.r.t. the column of dset
named in tdm$stratified.
o Some minor fixes concerning data reading
o TDMR documentation now available in PDF and HTML format (TDM-docu.html)
o integration of SFA (slow feature
analysis, see package rSFA on CRAN) as a
feature generation method for classification
o bug fix concerning tdmMapDesign;
extension of tdmMapDesign.csv
o moved PCA feature generation from
main_* into tdmClassifyLoop, it uses now only the
training data for establishing the PCA rotation (same for SFA)
o new training / validation / test
set capabilities, see Section “TDMR Data
Reading and Data Split …” in TDM-docu.html and help(tdmSplitTestData),
help(tdmReadData).
o modified TDMR’s seed concept, new
option opts$*.seed = “algSeed” (get the seed from spotConfig$alg.seed)
o new parameter tdm$mainFunc,
simpler and more general usage (as compared to tdm$mainFile
and tdm$mainCommand)
o powell, cmaes, rSFA now in the “Depends” list of DESCRIPTION
o added a TDMR-package description (file tdmGeneralUtils.r)
o extended documentation (e.g. full docu for tdmOptsDefaultsSet
and many small other documentation extensions)
o new section opts$CLS.* for
classification-related settings
o bug fixes in demo01cpu (seed
variation) and demo02sonar (GD.DEVICE)
o merged former functions unbiasedBestRun_C and unbiasedBestRun_R
into only one function unbiasedRun
o extended functions for information
on class objects: print.TDMclassifier, print.tdmClass, print.TDMregressor,
print.tdmRegre
o removed the dependencies on packages
matlab and mlbench
o new function tdmParaBootstrap.r:
add parametric bootstrap patterns, if opts$ncopies>0
o new version of TDM-docu.pdf: see
documentation index – directory
o new demo: demo00sonar (with some
graphics)
o fix in print.TDMclassifier,
print.TDMregressor: optional argument ‘type’
o doc/index.html added
o doc/changes.html added (this file)
o initial version