-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide jet-smearing (including random numbers) within correctionlib #130
Comments
Following up on this: |
I think yes we could add a function to the formula evaluator to generate a deterministic random number. We could also introduce a new node type for this purpose, which might make it a bit more flexible. |
Running a small benchmark to see what we can do for per-object deterministic random normal distributions: // compile with, e.g.
// g++ -Ixxhash/include -Ipcg-cpp/include -Icxx-ziggurat/include --std=c++17 -O2 -Wall hashtest.cpp -o hashtest
#include <vector>
#include <iostream>
#include <random>
#include <chrono>
// https://github.com/RedSpah/xxhash_cpp
#include "xxhash.hpp"
// https://github.com/imneme/pcg-cpp
#include "pcg_random.hpp"
// https://github.com/snsinfu/cxx-ziggurat
#include "ziggurat.hpp"
void stats(const std::vector<double>& out) {
double mean{0.0};
for(size_t i=0; i<out.size(); ++i) {
mean += out[i];
}
mean /= out.size();
std::cout << "mean " << mean << std::endl;
}
int main(void) {
size_t n {100000};
std::vector<int> in1(n);
std::vector<double> in2(n);
std::vector<double> out(n);
std::mt19937 gen(42);
std::exponential_distribution<> d(20);
for(size_t i=0; i<n; ++i) {
in1[i] = i;
in2[i] = d(gen) + d(gen);
}
{
std::cout << "JME current (std::mt19937 + std::normal_distribution)" << std::endl;
auto tic = std::chrono::high_resolution_clock::now();
std::normal_distribution<> smear(1, 0.1);
for(size_t i=0; i<n; ++i) {
uint32_t seed = in1[i] << 20;
seed |= reinterpret_cast<uint64_t&>(in2[i]) & 0xFFFFFFFF;
gen.seed(seed);
out[i] = smear(gen);
}
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
stats(out);
}
{
std::cout << "std::normal_distribution (marsaglia, probably)" << std::endl;
auto tic = std::chrono::high_resolution_clock::now();
pcg32_oneseq g;
for(size_t i=0; i<n; ++i) {
g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
// if not a temporary, it sneakily reuses the spare value
out[i] = std::normal_distribution<>(1, 0.1)(g);
}
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
stats(out);
}
{
std::cout << "Marsaglia polar method" << std::endl;
auto tic = std::chrono::high_resolution_clock::now();
pcg32_oneseq g;
float u, v, s;
for(size_t i=0; i<n; ++i) {
g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
do {
// cheap but wrong-ish https://www.pcg-random.org/using-pcg-c.html#generating-doubles
u = std::ldexp(g(), -31) - 1;
v = std::ldexp(g(), -31) - 1;
s = u*u + v*v;
} while ( s>= 1.0 || s == 0.0 );
s = std::sqrt(-2.0 * std::log(s) / s);
out[i] = 1 + 0.1*u*s;
}
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
stats(out);
}
{
std::cout << "Ziggurat" << std::endl;
auto tic = std::chrono::high_resolution_clock::now();
pcg32_oneseq g;
cxx::ziggurat_normal_distribution<double> smear{1.0, 0.1};
for(size_t i=0; i<n; ++i) {
g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
out[i] = smear(g);
}
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
stats(out);
}
{
std::cout << "Just the generator" << std::endl;
auto tic = std::chrono::high_resolution_clock::now();
pcg32_oneseq g;
for(size_t i=0; i<n; ++i) {
g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
out[i] = std::ldexp(g(), -32)*0.1 + 1;
}
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
stats(out);
}
return 0;
} on my machine I get
I think in general its important to use a proper hash function such as xxhash to seed any generator, see https://blog.unity.com/technology/a-primer-on-repeatable-random-numbers for some discussion on why. Then its a matter of finding a reasonably fast, reproducible normal distribution generator. Technically, the exact algorithm of The idea would then be to define a new |
Hi Nick, thanks for the follow-up. Would be great to have this going forward! |
Example of a JER smearing correction with the new HashPRNG node: with gzip.open("../jsonpog-integration/POG/JME/2017_UL/jet_jerc.json.gz") as fin:
cset = CorrectionSet.parse_raw(fin.read())
cset.corrections = [
c for c in cset.corrections
if c.name in (
"Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
"Summer19UL17_V5_MC_L2Relative_AK4PFchs",
"Summer19UL17_JRV2_MC_PtResolution_AK4PFchs",
"Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs",
)
]
cset.compound_corrections = []
res = Correction.parse_obj(
{
"name": "JERSmear",
"description": "Jet smearing tool",
"inputs": [
{"name": "JetPt", "type": "real"},
{"name": "JetEta", "type": "real"},
{"name": "GenPt", "type": "real", "description": "matched GenJet pt, or -1 if no match"},
{"name": "Rho", "type": "real", "description": "entropy source"},
{"name": "EventID", "type": "int", "description": "entropy source"},
{"name": "JER", "type": "real", "description": "Jet energy resolution"},
{"name": "JERsf", "type": "real", "description": "Jet energy resolution scale factor"},
],
"output": {"name": "smear", "type": "real"},
"version": 1,
"data": {
"nodetype": "binning",
"input": "GenPt",
"edges": [-1, 0, 1],
"flow": "clamp",
"content": [
# stochastic
{
# rewrite gen_pt with a random gaussian
"nodetype": "transform",
"input": "GenPt",
"rule": {
"nodetype": "hashprng",
"inputs": ["JetPt", "JetEta", "Rho", "EventID"],
"distribution": "normal",
},
"content": {
"nodetype": "formula",
# TODO min jet pt?
"expression": "1+sqrt(max(x*x - 1, 0)) * y * z",
"parser": "TFormula",
# now gen_pt is actually the output of hashprng
"variables": ["JERsf", "JER", "GenPt"],
},
},
# deterministic
{
"nodetype": "formula",
# TODO min jet pt?
"expression": "1+(x-1)*(y-z)/y",
"parser": "TFormula",
"variables": ["JERsf", "JetPt", "GenPt"],
},
],
},
}
)
cset.corrections.append(res)
stack = CompoundCorrection.parse_obj(
{
"name": "JEC",
"output": {"name": "sf", "type": "real"},
"inputs": [
{"name": "JetPt", "type": "real"},
{"name": "JetEta", "type": "real"},
{"name": "JetA", "type": "real"},
{"name": "Rho", "type": "real"},
],
"inputs_update": ["JetPt"],
"input_op": "*",
"output_op": "*",
"stack": [
"Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
"Summer19UL17_V5_MC_L2Relative_AK4PFchs",
],
}
)
cset.compound_corrections.append(stack)
ceval = cset.to_evaluator()
with gzip.open("jectest.json.gz", "wt") as fout:
fout.write(cset.json(exclude_unset=True))
import rich
rich.print(cset) Application of a complete JEC would then be something like jec = ceval.compound["JEC"].evaluate(jets.pt_raw, jets.eta, jets.area, jets.event_rho)
pt_jec = jets.pt_raw * jec
jer = ceval["Summer19UL17_JRV2_MC_PtResolution_AK4PFchs"].evaluate(jets.eta, pt_jec, jets.event_rho)
ptgen = np.where(np.abs(pt_jec - jets.pt_gen) < 3*pt_jec*jer, jets.pt_gen, -1.0)
jersf = ceval["Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs"].evaluate(jets.eta, "nom")
jersmear = ceval["JERSmear"].evaluate(pt_jec, jets.eta, ptgen, jets.event_rho, jets.event_id, jer, jersf)
pt_final = pt_jec * jersmear |
Discussed in #123
Originally posted by kirschen February 3, 2022
Hi @nsmith- , all,
@lgray brought up the idea of going a step further in what is provided via correctionlib. One classical example of where one would profit from harmonizing is jet energy smearing, see e.g. https://twiki.cern.ch/twiki/bin/viewauth/CMS/JetResolution#Smearing_procedures
Depending on whether or not a genjet is matched (well enough) to a recojet, either the difference between genjet pt and recojet pt is changed or some stochastic smearing is applied. Ideally, the result would be reproducible, e.g. deciding on a standard way to calculate a random number seed (e.g. requiring event/lumisection as input).
Input would be:
recopt, recoeta, genpt, (info for random number seed)
Output:
smeared recopt
Would this still be doable and within scope of correctionlib @nsmith- ?
Cheers,
Henning
The text was updated successfully, but these errors were encountered: