-
Updated
py/ANNZ.py
andscripts/annz_evalWrapper.py
forpython-3.6
compatibility. -
Fixed bug in the
Makefile
; now ROOT shared libraries are linked after the local objects.
-
Changed the optimization method for generating regression PDFs. The new default method (denoted in the output as
PDF_0
) is now generated based on a simple random walk alg. The previous versions of the PDF are now denoted asPDF_1
andPDF_2
. While currently available, the deprecated PDFs are not guaranteed to be supported in the future. In order to derive the deprecated PDFs, set:glob.annz["nPDFs"] = 3 glob.annz["addOldStylePDFs"] = True
-
(1) Two new job options corresponding to
PDF_0
have been added:max_optimObj_PDF
andnOptimLoops
(seeREADME.md
andscripts/annz_rndReg_advanced.py
for details). (2) The default value ofexcludeRangePdfModelFit
has been changed from0.1
to0
. (3) Added several job options for plotting, to control the extent of underflow and overflow regions in the regression target:underflowZ
,overflowZ
,underflowZwidth
,overflowZwidth
,nUnderflowBins
,nOverflowBins
. (Seesrc/myANNZ.cpp
for details.) (4) Added a variable,nZclosBins
, to control the number of bins used for optimization-metric calculations in regression. (Seesrc/myANNZ.cpp
for details.) (5) ROOT scripts are no longer stored by default for each plot. SetsavePlotScripts
to choose otherwise. -
Added a wrapper class, which allows calling the evaluation phase for regression/classification directly from python. This can be used to integrate ANNZ directly within pipelines. The python interface is defined in
py/ANNZ.py
, with a full example given inscripts/annz_evalWrapper.py
. (See README.md for details.) -
Bug fix in a few python scripts, where the example for the
weightInp_wgtKNN
option had previously been set to numerically insignificant values. -
Changed the interface to turn off colour output (see
README.md
).
-
Major revamp of the
Makefile
, including adding a step of precompilation of the sharedinclude/commonInclude.hpp
header. -
Reorganization of shared namespaces.
-
Created a new
Manager
class as part ofinclude/myANNZ.hpp
,src/myANNZ.cpp
. -
The new random walk alg for generating regression PDFs is implemented in
ANNZ::getRndMethodBestPDF()
, which has been completely revamped. The old version of this function has been renamed toANNZ::getOldStyleRndMethodBestPDF()
. It is now used in order to derivePDF_1
andPDF_2
. -
Added a wrapper class for e.g., python integration, implemented in
include/Wrapper.hpp
,src/Wrapper.cpp
andpy/ANNZ.py
. -
Completely rewrote
ANNZ::doEvalReg()
to comply with pipeline integration. Added new interfaces for regression evaluation, as implemented insrc/ANNZ_regEval.cpp
.
- Added the option to to not store the full value of pdfs in the output of optimization/evaluation, by setting
glob.annz["doStorePdfBins"] = False
In this case, only the average metrics of a pdf are included in the output.
-
Added the
sampleFrac_errKNN
option, to allow to sub-sample the input dataset for the knn uncertainty calculation (similar to e.g.,sampleFracInp_wgtKNN
andsampleFracInp_inTrain
). -
Added metric plots of the distribution of the KNN error estimator vs. the true bias. The plots are added to the output by setting
glob.annz["doKnnErrPlots"] = True
-
Added support for input ROOT files with different Tree names.
-
Added support for ROOT version
6.8.*
. -
Other minor modifications and bug fixes.
-
Fixed bug with using general math expressions for the
weightVarNames_wgtKNN
andweightVarNames_inTrain
variables. -
Modified the
Makefile
to explicitly includerpath
inLDFLAGS
, which may be needed for pre-compiled versions of ROOT. -
Modified
subprocess.check_output()
inexamples/scripts/annz_qsub.py , fitsFuncs.py
forPython 3.x
. -
Fixed bug which caused a segmentation fault in some cases during reweighting.
-
Other minor modifications and bug fixes.
-
Added a bias correction procedure for MLMs, which may be switched off using
glob.annz["doBiasCorMLM"] = False
. (SeeREADME.md
andscripts/annz_rndReg_advanced.py
for details.) -
Added the option to generate error estimations (using the KNN method) for a general input dataset. An example script is provided as
scripts/annz_rndReg_knnErr.py
. (A detailed description is given inREADME.md
.) -
Added the
userWeights_metricPlots
job option, which can be used to set weight expressions for the performance plots of regression. (SeeREADME.md
for details.) -
Changed the binning scheme for the performance plots of auxiliary variables (defined using
glob.annz["addOutputVars"]
). Instead of equal-width bins, the plots now include bins which are defined as each having the same number of objects (equal-quantile binning). This e.g., reduces statistical fluctuations in computations of the bias, scatter and other parameters, as a function of the variables used for the training. -
Changed the default number of training cycles for ANNs from
5000
to a (more reasonable) randomized choice in the range[500,2000]
(ANNZ::generateOptsMLM()
). The option may be set to any other value by the user, using theNCycles
setting. E.g., during training, set:glob.annz["userMLMopts"] = "ANNZ_MLM=ANN::HiddenLayers=N,N+3:NCycles=3500"
. -
Fixed minor bug in
ANNZ::Train_binnedCls()
, which caused a mismatch of job-options for some configuration of binned classification. -
Added a version-tag to all intermediate option files, with a format as e.g.,
[versionTag]=ANNZ_2.1.3
. -
Minor change to the selection criteria for
ANNZ_best
in randomized regression. -
Other minor modifications and bug fixes.
-
Improved selection criteria for
ANNZ_best
in randomized regression. The optimization is now based onglob.annz["optimCondReg"]="sig68"
or"bias"
. (The"fracSig68"
option is deprecated.) -
Significant speed improvement for KNN weights and
inTrainFlag
calculations inCatFormat::addWgtKNNtoTree()
. -
Modified
CatFormat::addWgtKNNtoTree()
andCatFormat::inputToSplitTree_wgtKNN()
so that both training and testing objects are used together as the reference dataset, when deriving KNN weights. This new option is on by default, and may be turned off by setting:glob.annz["trainTestTogether_wgtKNN"] = False
- For developers: internal interface change (not backward compatible) - What used to be
CatFormat::addWgtKNNtoTree(TChain * aChainInp, TChain * aChainRef, TString outTreeName)
has been changed toCatFormat::addWgtKNNtoTree(TChain * aChainInp, TChain * aChainRef, TChain * aChainEvl, TString outTreeName)
.
- For developers: internal interface change (not backward compatible) - What used to be
-
Cancelled the
splitTypeValid
option, which was not very useful and confusing for users. From now on, input datasets may only be divided into two subsets, one for training and one for testing. The user may define the training/testing samples in one of two ways (seescripts/annz_rndReg_advanced.py
for details):- Automatic splitting:
glob.annz["splitType"] = "random" glob.annz["inAsciiFiles"] = "boss_dr10_0.csv;boss_dr10_1.csv"
Set a list of input files in
inAsciiFiles
, and usesplitType
to specify the method for splitting the sample. Allowed values for the latter areserial
,blocks
orrandom
.- Splitting by file:
glob.annz["splitType"] = "byInFiles" glob.annz["splitTypeTrain"] = "boss_dr10_0.csv" glob.annz["splitTypeTest"] = "boss_dr10_1.csv;boss_dr10_2.csv"
Set a list of input files for training in
splitTypeTrain
, and a list of input files for testing insplitTypeTest
. -
Added plotting for the evaluation mode of regression (single regression, randomized regression and binned classification). If the regression target is detected as part of the evaluated dataset, the nominal performance plots are created. For instance, for the
scripts/annz_rndReg_quick.py
script, the plots will be created inoutput/test_randReg_quick/regres/eval/plots/
. -
Fixed bug in plotting routine from
ANNZ::doMetricPlots()
, when adding user-defined cuts for variables not already present in the input trees. -
Simplified the interface for string variables in cut and weight expressions.
-
For example, given a set of input parameters,
glob.annz["inAsciiVars"] = "D:MAG_AUTO_G;D:MAG_AUTO_R;D:MAG_AUTO_I;D:Z_SPEC;C:FIELD"
one can now use cuts and weights of the form:
glob.annz["userCuts_train"] = " (FIELD == \"FIELD_0\") || (FIELD == \"FIELD_1\")" glob.annz["userCuts_valid"] = " (FIELD == \"FIELD_1\") || (FIELD == \"FIELD_2\")" glob.annz["userWeights_train"] = "1.0*(FIELD == \"FIELD_0\") + 2.0*(FIELD == \"FIELD_1\")" glob.annz["userWeights_valid"] = "1.0*(FIELD == \"FIELD_1\") + 0.1*(FIELD == \"FIELD_2\")"
Here, training is only done using
FIELD_0
andFIELD_1
; validation is weighted such that galaxies fromFIELD_1
have ten times the weight compared to galaxies fromFIELD_2
etc. -
The same rules also apply for the weight and cut options for the KNN re-weighting method:
cutInp_wgtKNN
,cutRef_wgtKNN
,weightRef_wgtKNN
andweightInp_wgtKNN
, and for the corresponding variables for the evaluation compatibility test:cutInp_inTrain
,cutRef_inTrain
,weightRef_inTrain
andweightInp_inTrain
. (Examples for the re-weighting and for the compatibility test using these variables are given inscripts/annz_rndReg_advanced.py
.)
-
-
ANNZ_PDF_max_0
no longer calculated by default. This may be turned back on by setting
glob.annz["addMaxPDF"] = True
- Other minor modifications and bug fixes.
-
Fixed bug in generating a name for an internal
TF1
function inANNZ::setupKdTreeKNN()
. -
Fixed bug in plotting routine from
ANNZ::doMetricPlots()
, when adding user-requested variables which are not floats. -
Added the option,
glob.annz["optimWithMAD"] = False
If set to
True
, then the MAD (median absolute deviation) is used, instead of the 68th percentile of the bias (sigma_68
). This affects only the selection of the "best" MLM and the PDF optimization procedure in randomized regression. Seescripts/generalSettings.py
. -
Added the option,
glob.annz["optimWithScaledBias"] = False
If set to
True
, then instead of the bias,delta == zReg-zTrg
, the expressiondeltaScaled == delta/(1+zTrg)
is used, wherezReg
is the estimated result of the MLM/PDF andzTrg
is the true (target) value. This affects only the selection of the "best" MLM and the PDF optimization procedure in randomized regression. E.g., one can set this parameter in order to minimize the value ofdeltaScaled
instead of the value ofdelta
, or correspondingly the value of the scatter ofdeltaScaled
instead of that ofdelta
. The selection criteria for prioritizing the bias or the scatter remains the parameterglob.annz["optimCondReg"]
. The latter can take the valuebias
(fordelta
ordeltaScaled
),sig68
(for the scatter ofdelta
or ofdeltaScaled
), andfracSig68
(for the outlier fraction ofdelta
or ofdeltaScaled
). Seescripts/generalSettings.py
. -
Added the option,
glob.annz["plotWithScaledBias"] = False
If set to
True
, then instead of the bias,delta == zReg-zTrg
, the expressiondelta/(1+zTrg)
is used. This affects only the figures generated with the plotting routine,ANNZ::doMetricPlots()
, and does not change any of the optimization/output of the code. Seescripts/generalSettings.py
. -
Added option to set the PDF bins in randomized regression by the width of the bins, instead of by the number of the bins. That is, one can now set e.g.,
glob.annz["pdfBinWidth"] = 0.01
instead of e.g.,
glob.annz["nPDFbins"] = 100
Assuming the regression range is
[minValZ,maxValZ] = [0,1.5]
, the first option will lead to 150 PDF bins of width 0.01, while the second will result in 100 bins of width 0.015. The two options are mutually exclusive (the user should define only one or the other). -
For developers: Changed internal key-word interface in
Utils::getInterQuantileStats()
for requesting a MAD calculation: to add the calculation - changed frommedianAbsoluteDeviation
togetMAD
; to retrieve the result of the calculation - fromquant_medianAbsoluteDeviation
toquant_MAD
. -
Other minor modifications.
-
Removed unnecessary dictionary generation from the
Makefile
. -
Changed
std::map
tostd::unordered_map
in main containers of theOptMaps()
andVarMaps()
classes (constitutes a slight performance boost). -
Nominally, no longer keeping track of the name of the original input file (stored in the ROOT trees with the name defined in
origFileName
inmyANNZ::Init()
). This may be switched back on by settingglob.annz["storeOrigFileName"] = True
. -
Added the option to use an entire input file as signal or background for single/randomized classification, in addition to (or instead of) defining a cut based on one of the input parameters. In order to use this option, one muse define the variables
inpFiles_sig
andinpFiles_bck
. An example is given inscripts/annz_rndCls_advanced.py
. -
Added a bias correction for randomized regression PDFs. This options is now active by default, and may be turned off by setting,
glob.annz["doBiasCorPDF"] = False
-
Other minor modifications.
-
Did some code optimization for tree-looping operations.
-
Added the script,
annz_rndReg_weights.py
: This shows how one may derive the weights based on the KNN method (usinguseWgtKNN
), and/or theinTrainFlag
quality-flag, without training/evaluating any MLMs. -
Added a plot-reference guide (
thePlotsExplained.pdf
). -
Added the option
doGausSigmaRelErr
(now set toTrue
by default) to estimate the scatter of the relative uncertainty of regression solutions by a Gaussian fit, instead of by the RMS or the 68th percentile of the distribution of the relative uncertainty. This only affects the plotting output of regression problems (ANNZ::doMetricPlots()
). -
Added support for general math expressions for the
weightVarNames_wgtKNN
andweightVarNames_inTrain
variables. -
Nominally, the
inTrainFlag
quality flag is a binary operator, and may only take values of either0
or1
. Have now added the option of settingmaxRelRatioInRef_inTrain < 0
. In this case, themaxRelRatioInRef_inTrain
parameter is ignored. As a result theinTrainFlag
may take floating-point values between zero and one. -
Added a transformation of the input parameters used for the kd-tree during the nominal uncertainty calculation in regression. The variables after the transformation span the range
[-1,1]
. The transformations are performed by default, and may be turned off by setting,glob.annz["doWidthRescale_errKNN"] = False
Similarly, added the same transformations for the kd-tree during the glob.annz["useWgtKNN"] = True
and glob.annz["addInTrainFlag"] = True
setups. These may be turned off using the flags, doWidthRescale_wgtKNN
and doWidthRescale_inTrain
, respectively.
-
Added support for ROOT file inputs, which may be used instead of ascii inputs (example given in
scripts/annz_rndReg_advanced.py
). -
Other minor modifications.
-
Fixed bug in
CatFormat::addWgtKNNtoTree()
, where the weight expression for the KNN trees did not include theANNZ_KNN_w
weight in cases ofglob.annz["addInTrainFlag"] = True
. -
Modified the condition on the MLM error estimator which is propagated to PDFs in randomized regression. Previously if an error was undefined, indicating a problem, the code would stop. Now the MLM is ignored and the code continues. The change is needed as sometimes there is a valid reason for the error to be undefined.
-
Other minor modifications.
-
Modified the function,
CatFormat::addWgtKNNtoTree()
, and addedCatFormat::asciiToFullTree_wgtKNN()
: The purpose of the new features is to add an output variable, denoted asinTrainFlag
to the output of evaluation. The new output indicates if the corresponding object is "compatible" with other objects from the training dataset. The compatibility is estimated by comparing the density of objects in the training dataset in the vicinity of the evaluated object. If the evaluated object belongs to an area of parameter-space which is not represented in the training dataset, we will getinTrainFlag = 0
. In this case, the output of the training is probably unreliable. -
Other minor modifications.
-
Added MultiClass support to binned classification: The new option is controlled by setting the
doMultiCls
flag. In this mode, multiple background samples can be trained simultaneously against the signal. In the context of binned classification, this means that each classification bin acts as an independent sample during the training. -
Added the function,
ANNZ::deriveHisClsPrb()
: Modified binned classification, such that all classification probabilities are calculated by hand, instead of using theCreateMVAPdfs
option ofTMVA::Factory
. By default, the new calculation takes into account the relative size of the signal in each classification bin, compared to the number of objects in the entire training sample. The latter feature may be turned off, by setting:glob.annz["useBinClsPrior"] = False
-
Added
ANNZ_PDF_max
, the most likely value of a PDF (the peak of the PDF), to the outputs of regression. -
Fixed compatibility issues with ROOT v6.02.
-
Fixed bug in
VarMaps::storeTreeToAscii()
, where variables of typeLong64_t
were treated asBool_t
by mistake, causing a crash. -
Other minor modifications.
Fixed bug in VarMaps::storeTreeToAscii(), where variables of type Long64_t
were treated as Bool_t
by mistake, causing a crash.
The following changes were made:
-
Modified the way in which the KNN error estimator works: In the previous version, the errors were generated by looping for each object over the n near-neighbors in the training dataset. For a given object, this was for all neighbors for each of the MLMs. In the revised version, MLM response values for the entire training dataset are estimated once; this is done before the loop on the objects begins, with the results stored in a dedicated tree (see
ANNZ::createTreeErrKNN()
). This tree is then read-in during the loop over the objects for which the errors are generated. In this implementation, the KNN neighbor search is done once for all MLMs, and the errors are estimated simultaneously for all. This prevents both the unnecessary repeated calculations of MLM outputs, and the redundant searches for the n near-neighbors for the same object. -
Name of evaluation subdirectory: Added the variable
evalDirPostfix
, which allows to modify the name of the evaluation subdirectory. Different input files can now be evaluated simultaneously, without overwriting previous results. The example scripts have been modified accordingly. -
Various small modifications.
First version (v2.0.0) of the new implementation of the machine learning code, ANNz.