Tutorial
extended: new simple TDMR usage Lesson 0 (aaClassify.r and aaRegression.r).
Docu
extended: section on sampsize, cutoff and classwt, new appendix on list tdm.
Fixed
several sampsize- and cutoff-bugs, added NA-checks for sampsize, cutoff,
classwt.
Better
integration of graphic device RStudioGD, better check on RStudio.
Improved
graphics for regression, check on NA’s.
Fixed
missing ‘import’s in NAMESPACE.
Eliminated .path.package in favor of path.package.
Eliminated an error in tdmROCR.r example reported by Brian Ripley which
blocked the package build.
Shortened
in DESCRIPTION the “Depends” list and moved some packages to “Suggests” list
(& adapted the cross-references in the R documentation accordingly).
If
opts$RF.samp was a vector and not a scalar, wrong processing occurred in
train.RF (in tdmClassify.r) and in tdmModSortedRFimport (in tdmModelingUtils.r).
Now corrected.
Changed
parameter SAMPSIZE to SAMPSIZE1 in tdmMapDesign.csv and added lines with
SAMPSIZE2, …, SAMPSIZE5.
Eliminated .find.package in favor of find.package according to the mail
from Brian Ripley.
Simplify
the code / the workflow:
· Allow only one parameter for
tdm$umode, no longer a vector (Why? – Then we do not need a recipe to check the
correctness of tdm$umode: SP_T may only appear as first element, only RSUB and
CV as further elements. Instead it is easier to restart a certain tuning
experiment with spotStep=“rep”, where the previous tuning result is reused and
a new unbiased evaluation is done.)
· Added the possibility for
umode=”TST” that the training fraction can be also in the unbiased run smaller
than 100%: set tdm$U.trnFrac to a value different from NULL. (needed for SVM on large data sets)
· Simplified tdmBigLoop insofar that
spotStep is now only a string, not a vector of strings.
Introduced
the new function tdmExecSpotStep, which makes all demos runnable for both
spotStep==”auto” and spotStep==”rep”. Have a demo03sonar.r and demo03sonar_A.r,
where the former does the same, but shorter with tdmExecSpotStep. Adjusted all
other demos according to demo03sonar.r.
Introduced package
rCMA: Fixed the demo in demo07cma_j.r to run
again under Windows 7. Abandon complicated system calls, switch to package rCMA,
with the help of package rJava. Now runs in the same
fashion under Linux, Win, Mac (at least if rJava is installable). rCMA uses only one .jar file instead of many class files.
Added a new demo demo08parallel.r.
Added predict
functionality: see predict.TDMenvir and others in INDEX of TDMR-manual. New parameter tdm$U.saveModel.
Added new class TDMenvir for
the envT objects.
Parallelilization issues:
· Eliminated package snowfall and introduced
package parallel instead (simpler code, better maintainance)
· No read/write of SRF-file
<filename>.SRF.V1.Rdata anymore in the parallel processes. Instead we
o transport the SRF-info via list opts$srf (a list of data frames, one for
each response variable), save SRF-file(s) Output/<confFile>.SRF.RData in
case opts$SRF.calc==T (saveSRFinfo() in tdmBigLoop.r, after all parallel
processes)
o and read SRF-file(s) in case opts$SRF.calc==F prior to branching into
parallel runs & store it on opts$srf (addSRF() in tdmEnvTMakeNew.r).
Simplified the code / the workflow (12/2012):
· Simplified the opts-part connected
with data file reading: (a) make opts$filesuffix obsolete, only local variable
filesuffix (tdmReadData); (b) make opts$READ.CMD simpler: a string “readFunc(filename,opts)”, where readFunc is def’d in
main_TASK.r and returns dset or tset. A template tdmReadCmd(filename,opts)
is def’d in tdmReadData.r. (No need for complicated syntax ‘sep=\”;\”,…’ any more.)
· Change opts$TST.testFrac à tdm$TST.testFrac (opts$TST.testFrac
is only used in tdmSplitTestData)
and tdm$tstFrac
à tdm$TST.testFrac (tdm$tstFrac is
only used in tdmMapDesign, branch umode=”RSUB”).
· Make opts$READ.INI=T the recommended
choice. But keep the possibility to read in main_TASK.r (so that main_TASK can
run alone).
Integrated AdaBoost (package adabag) as a new learner. Tuning parameters are
opts$ADA.coeflearn, ADA.mfinal, ADA.rpart.minsplit.
Added in
function checkOpts (tdmEnvTMakeNew.r) a logic that checks opts against new
(unusual) parameters in list opts and prints a notification: “Note: a new
variable xyz has been defined for list opts.” (safeguard
against misspelling a parameter)
Added tuning params for SVM with kernel “polynomial”, “linear” or
“sigmoid”. Added tuning param TRNFRAC. Removed opts$SVM.C (never
needed).
Bug fix: If
opts$filename is not of the form “a.suf” but simply “a”, an error occured
(connected with filesuffix, see tdmReadData). Now fixed.
Bug fix:
Stop when doublettes appear in tdm$runList. Rewrote checkRoiParams to work
properly in all cases, no need for envT$roiNames1 any more.
Bug fix:
Column “tdmSplit” was not always excluded from input variables (main_sonar).
Now dsetTrnVa() and dsetTest() return all columns except
x$TST.COL, which is usually “tdmSplit”.
Fixed the
lev.resp-bug in tdmClassify: Do not take for granted that lev.resp = levels(d_train[,response.variable]) is always the same (in
the same ordering) as the column names of app$test.prob in apply.SVM! This
assumption resulted (sometimes) in very bad performance for wdbc-task in case
opts$CLS.cutoff=c(0.5,0.5). Now we use colnames(app$test.prob) instead of lev.resp, and mostly
everything is o.k. now. There are, however, still a few cases where the results
with app$test.prob and those with standard predict are different, see fct. check.apply.SVM in tdmClassify. Unclear
why this is different.
Improved
TDMR workflow by testing it on BreastCancer (BC)
dataset (has factor variables with many levels, has columns with <NA>):
· New check on NA’s in input variables
with na_input_check in tdmClassify.r. This will issue a warning each time a NA
is found.
· New checkData() function in
tdmSplitTestData.r: check whether dset and tset have the same mode in every
column and give a warning if a column is factor and has an unusual high number
of levels (>32).
· Specifically for BC dataset: New
function tdmPreNAroughfix(data) which main_BC() calls
prior to tdmClassify and avoids those warnings
Documentation issues
· options opts à complete list in Appendix B
· mappings à complete list in Appendix A
· opts$gainmat in docu à opts$CLS.gainmat
· “phase” may be misleading, as if a
user had to pass all three of them à use “level” instead
· confusing that main-file has
opts-settings, should be all in .apd file <task>_00.apd
· flow diagram for train/test/vali
settings (sampling)
· updated tutorial, lesson 1 and many
links and source code snippets
· updated tutorial, lesson 2: no more
bst and res-files, but now returning spotConfig$alg.currentBest
, alg.currentResult.
Bug fix: any open sink’s are now closed in tdmGraAndLogInitialize
using sink.number. This avoids that *.log grows bigger and bigger.
Changed the
logic for opts$PRE.PCA / opts$PRE.PCA.npc and opts$PRE.SFA / opts$PRE.SFA.npc
(see docu in tdmOptsDefaultsSet). Appropriate settings for d_preproc = all
non-validation data or = training data, depending on opts$PRE.allNonVali=T/F.
d_preproc is also in tdmRegressLoop and tdmRegress.
SFA for regression now also available.
Simplify:
Prohibit PCA and SFA to be both activated (makes no sense, since SFA has its
own PCA-step).
tdmPrePCA-Bug (too few columns in dset, when number of train records too small): The
bug happens tdmPreprocUtils.r, tdmPrePCA.train, tdmAdjustDSet: If number of
records Nr is smaller than number of input vars, then the list of
returned PCs will be shorter, thus leading to a shorter dset. Fix: Introduce
d_preproc, the data set for preprocessing, which is normally =d_train. Error,
if the number of records in d_preproc is smaller than the number of numeric
vars. The user has the option to activate opts$PRE.allNonVali=T, which leads to
d_preproc = all non-validation data.
Bug fix:
many main_TASK functions had the argument tset,
but they have inside still a call tdmClassifyLoop(dset, …, opts) where it
should read instead
tdmClassifyLoop(dset, …, opts, tset)
Simplified the code / the workflow (10/2012):
· Improved modularity: different parts
of TDMR construct different objects:
o the user (or tdmDefaultsFill) constructs tdm
o envT = tdmEnvTMakeNew(tdm) constructs envT
o envT = tdmEnvTAddBstRes(envT,fileRData) augments envT by bstGrid,
resGrid from .RData file (if needed)
o envT = tdmBigLoop(envT,spotStep) does (tuning and) unbiased runs.
· tdmBigLoop is now the new function which
should supersede the now deprecated tdmCompleteEval:
o only two parameters: envT, spotStep
o since envT is passed instead of tdm, we are more flexible which input to
send into tdmBigLoop. Example: If spotStep==”rep”, tdmBigLoop requires the data
frames bst and res from prior tuning runs à this is not possible via tdm, but can easily
be done via envT$bstGrid, envT$resGrid
o simplified tdm$fileMode-section (no .res or .bst-file writing &
copying any more, makes the code much simpler to understand!!) à bst and
res are returned / passed via envT
o tdm$fileMode=FALSE is now the default. tdm$fileMode=TRUE
is deprecated and leads only to writing of .fin and .exp files (these files are
not very necessary, since we store envT with theFinals in .RData file)
o always envT$spotConfig$spot.fileMode=FALSE
· tdmCompleteEval is still there for downward compatibility,
but it is deprecated:
o it writes .res, .bst, .fin, .exp files, if tdm$fileMode=T
o tdmCompleteEval has other calling arguments
o tdmComplteeEval now sets envT$spotConfig$spot.fileMode=tdm$fileMode (was
done before in tdmDispatchTuner)
o tdmCompleteEval should become obsolete, if all demos / user files are
changed to tdmBigLoop (but we keep it perhaps for downward compatibility)
· Simplified envT$result, which
contained 3x opts !! à now only 1x opts +
accessor function Opts().
· Reformulated the
tdm$filemode-sections in tdmCompleteEval / tdmBigLoop: The normal case is now
tdm$fileMode==FALSE.
· Abandonned the writing of
<name>_train.csv.SRF.<target>.RData,
<name>_train.log and <name>_train_eval.csv when tdm$oFileMode==FALSE,
since this may be conflicting if we do certain parallel tasks.
· tdmGetObj is now marked as deprecated (we use
it however in unbiasedRun to ensure downward compatibility).
·
Renamed “Test2” à “Vali2” and other naming issues around
“Test” and “Validation”. Made variable names more meaningful: VALI, if
connected with validation data, TST, if connected with test data.
Simplified the code / the workflow (06/2012):
· Simplified start of parallel
execution: no need for sourcing start.tdm.r (except if you want the R developer
sources), all sfExport-related stuff is now in function
prepareParallelExec in tdmBigLoop.r.
· Simplified design mapping: only one
function pair {tdmMapDesLoad, tdmMapDesApply} and no longer tdmMapDesSpot, no
makeTdmDesSpot. The maps map (from tdmMapDesign.csv) and mapUser (from
userMapDesign.csv) are now stored in list tdm.
· Simplified the triangle
startFromSource.r, start.tdm.r, source.tdm.r: startFromSource.r and start.tdm.r are now only needed for the developer
(if you want to start from R sources). They are NO LONGER
needed if the normal TDMR user wants to initiate parallel execution (all
sfExport’s and the like are now done in function prepareParallelExec in
tdmBigLoop.r, which is called if tdm$parallelCPUs>1).
· Warning: if tdm$umode=”TST” *and*
opts$TST.kind=”col”, then tdmSplitTestData will tag all records with
opts$TST.col!=0 as test data. Later on, tdmStartSpot
will hand only the data with opts$TST.col==0 to main_TASK, and this will
separate into vali and train data acc. to opts$TST.col again à all data are train, no vali data
(this is o.k. for opts$MOD.method==”RF”, but may lead to strange results in
other cases). – How to fix:
o Make a check on number of vali records for cases opts$MOD.method!=”RF”.
o Issue in tdmBigLoop/tdmCompleteEval a warning, if tdm$umode=”TST” and
opts$TST.kind=”col” and opts$MOD.method!=”RF”.
Docu TDMR and Demos TDMR:
· added TDMR-tutorial.html,
moved the section “Example usage” in there.
· added a FAQ section (“How to”) in TDMR-tutorial.html
· added two appendices on
tdmMapDesign.csv and on elements of opts in TDM-docu.html
· adapted all documentation &
demos to the new tdmBigLoop
· added citation ROCR
Modified the function tdmModAdjustCutoff:
· Extended that either
parameter CUTOFF1, … , CUTOFFn can be the missing one.
· Guaranteed that the dependent CUTOFF can never
become negative when enforcing the constraint.
· If tdmModAdjustCutoff is entered
with a cutoff with length(cutoff)==n.class-1, then
cutoff[n.class] becomes the dependent CUTOFF.
· The old function tdmMapCutoff is now
disabled, everything in tdmModAdjustCutoff.
Fixed a bug: tdmPlotResMeta could crash, if not all .conf files had the
same tuning pars.
Fix: Now the x- and y-selectors in twiddler-interface are the union of
all tuning pars. If a x- or y-selection is not part of
the specific tuning pars for the selected .conf, issue an error message box and
do not start spot.
Added
skipIncomplete-part in tdmPlotResMeta(). Fixed a bug
(no mergedData) concerning nSkip in tdmPlotResMeta().
Fixed a bug
concerning opts$READ.NROW: now this is applied also when loading
<filetest>.RData
For
regression: new option opts$rgain.type=”made” (mean absolute
deviation)
Extended
opts$rgain.string to work also for the regression options, adapt column names in
theFinals accordingly.
tdmOptsDefaultsSet returns now in opts an object of class “tdmOpts”. Checks
for the right class of opts in central TDMR files.
tdmRegressLoop.r,
tdmClassifyLoop.r: More accurate averaging of evaluation measures for
regression CV case, new variable ‘result$predictions’.
Bug fix
‘nfold=max(cvi,1)’ to have not nfold=0 in the special
case that all records in dset are training cases (zero validation cases)
Saving
envT: parameter savePredictions (default =FALSE) allows to decide whether
result$predictions and result$predProbList are saved to .RData.
Some small
bug fixes concerning ‘predProb’ and ‘predictions’ for the case
opts$ncopies>0. predProb is needed by
tdmModConfmat, which is called from tdmClassify (in case opts$rgain.type=”ar*”,
this will call tdmROCR_calc with predProb). predProbList
is needed by tdmROCR.TDMclassifier.
Added in
tdmSortedRFimport the option opts$SRF.scale to use scaled or unscaled
importance.
Bug fix in
tdmClassify: build EVALa correctly also in cases where nrow(d_test)==0
à set cm.test$* to NA and not cm.test
to NULL.
tdmSortedRFImport: negative importance values are now clipped to 0 (no longer additive
shift of importance values).
If
tdm$parallelCPUs>1: snowfall would fail, if there is only one pass through
sfSapply, i.e. if length(indVec)=1. Fix: Check
in tdmCompleteEval whether length(indVec)==1, issue a
warning and set tdm$parallelCPUs to 1.
Renamed
bind_response to tdmBindResponse (tdmGeneralUtils.r)
Bug fix
‘path à tdm$path’ in
tdmMapDesLoad (tdmMapDesign.r)
Bug fix for
cma_es (package cmaes): When running demo/demo04cpu.r with tuner cmaes, we got
“Error in eigen.log[iter, ] <- rev(sort(e$values)):
subscript out of bounds". Solution: control$maxit = round(control$maxit),
because this error only occurs if control$maxit is NOT an integer.
Fixed a bug
concerning opts$filesuffix (tdmOptsDefaultsFill) which could lead to an
unwanted stop.
Bug fix:
regression tuning made strange things (names in data frame) if you tuned only 1
variable (cpu, roi with only XPERC). Now fixed.
Bug fix:
cma_es (and other tdmStartOther-tuners) had usually in the BST data frame not
the inclusion of the last design points (which usually are formed after the
last time where “des$CONFIG %% tdm$spotConfig$seq.design.new.size==0” was
TRUE). Now fixed.
Bug fix in
tdmDispatchTuner: cma_jTuner did not yet return a list of type spotConfig in
tunerVal. Consequence: the above “Append” would not work. Now
fixed.
tdmDispatchTuner.r:
Made all tuners return a list of type spotConfig with the proper
settings in tunerVal$alg.currentResult and tunerVal$alg.currentBest.
Bug fix in
tdmMapDesApply: the “[-1]” in “dn=setdiff(names(des[-1]),
c("COUNT","CONFIG",...))” was wrong.
Extended
tdmPlotResMeta by a slider y_10Exp, which allows to multiply the y-values by 100,
101,…,103 on the fly in the
twiddler interface. This usually gives a better color scheme for the 3D-plot in
spotReport3d.
o New ROC chart and lift chart capabilities, based on
package ROCR on CRAN, see help(tdmROCR).
o New measures for opts$rgain.type= “arROC”, “arLIFT”, “arPRE” for area
under ROC, lift or precision-recall chart, based on package ROCR.
o Improved and extended the set of demos (demo00, … , demo06). New demos for interactive visualization.
o Improved cma_jTuner (CMA-ES, Java version). Works now on Linux
and Windows OS platforms when using tdm$fileMode=FALSE.
o Improved tdmPlotResMeta (confFile, nSkip, chkSkip, xAxis, yAxis).
o Changed opts$fct.postproc: this is now the name of the
postprocessing function and not the function itself. Reason: If opts
contains a function, then it contains also its environment and this can be
pretty big (contains envT, …) and makes the .RData
saving of envT big.
o Flag opts$DO.POSTPROC is now deprecated, use instead opts$fct.postproc.
o Fixed a bug concerning opts$filesuffix (tdmReadData, could lead to
overwriting of opts$filename).
o Improved the examples section in TDM-docu.html. Now most examples in TDM-docu.html are in sync with the set of demos. Seperated in TDM-docu.html the example usage description
from the example details. New chapter describing the
interactive visualization example.
o Improved the package documentation (simpler index via @keywords
internal, many small fixes).
o New 3D graphics for tuning results and their metamodels, using a
twiddler-interface on environment envT: see help(tdmPlotResMeta).
o New print() for TDMdata object “dataObj”
o Fixed a bug in tdmClassify (wrong ifelse in applySVM).
o Fixed some minor bugs to reactivate parallel mode: some sfExports
were missing.
o Fixed the saveEnvT-bug (“[9:9]”) in tdmCompleteEval. New option
tdm$filenameEnvT.
o Fixed the tdmMapDesign bug (Design variables missing in tdmMapDesign.csv
and userMapDesign.csv would not be mapped to opts. Now missing variables are
detected and an error is thrown.)
o Added opts$SPLIT.SEED variable: a variable to decide if tdmSplitTestData
runs in deterministic mode
o Added opts$TST.trnFrac: now trnFrac can be smaller than
1-opts$TST.valiFrac.
o Added SAVESEED-part in tdmSplitTestData, tdmClassifyLoop, tdmRegressLoop
o Added tdm$stratified with new meaning: if not NULL, make stratified
sampling w.r.t. the column of dset named in tdm$stratified.
o Some minor fixes concerning data reading
o TDMR documentation now available in PDF and HTML format (TDM-docu.html)
o integration of SFA (slow feature
analysis, see package rSFA on CRAN) as a feature generation method for
classification
o bug fix concerning tdmMapDesign; extension of
tdmMapDesign.csv
o moved PCA feature generation from
main_* into tdmClassifyLoop, it uses now only the training data for
establishing the PCA rotation (same for SFA)
o new training / validation / test
set capabilities, see Section “TDMR Data
Reading and Data Split …” in TDM-docu.html and help(tdmSplitTestData),
help(tdmReadData).
o modified TDMR’s seed concept, new
option opts$*.seed = “algSeed” (get the seed from spotConfig$alg.seed)
o new parameter tdm$mainFunc, simpler
and more general usage (as compared to tdm$mainFile and tdm$mainCommand)
o powell, cmaes, rSFA now in the “Depends”
list of DESCRIPTION
o added a TDMR-package description (file
tdmGeneralUtils.r)
o extended documentation (e.g. full
docu for tdmOptsDefaultsSet and many small other documentation
extensions)
o new section opts$CLS.* for
classification-related settings
o bug fixes in demo01cpu (seed
variation) and demo02sonar (GD.DEVICE)
o merged former functions
unbiasedBestRun_C and unbiasedBestRun_R into only one function unbiasedRun
o extended functions for information
on class objects: print.TDMclassifier, print.tdmClass, print.TDMregressor,
print.tdmRegre
o removed the dependencies on packages
matlab and mlbench
o new function tdmParaBootstrap.r: add
parametric bootstrap patterns, if opts$ncopies>0
o new version of TDM-docu.pdf: see
documentation index – directory
o new demo: demo00sonar (with some
graphics)
o fix in print.TDMclassifier,
print.TDMregressor: optional argument ‘type’
o doc/index.html added
o doc/changes.html added (this file)
o initial version