Provide jet-smearing (including random numbers) within correctionlib #130

nsmith- · 2022-04-20T14:42:05Z

Discussed in #123

^{Originally posted by kirschen February 3, 2022}
Hi @nsmith- , all,

@lgray brought up the idea of going a step further in what is provided via correctionlib. One classical example of where one would profit from harmonizing is jet energy smearing, see e.g. https://twiki.cern.ch/twiki/bin/viewauth/CMS/JetResolution#Smearing_procedures
Depending on whether or not a genjet is matched (well enough) to a recojet, either the difference between genjet pt and recojet pt is changed or some stochastic smearing is applied. Ideally, the result would be reproducible, e.g. deciding on a standard way to calculate a random number seed (e.g. requiring event/lumisection as input).

Input would be:
recopt, recoeta, genpt, (info for random number seed)
Output:
smeared recopt

Would this still be doable and within scope of correctionlib @nsmith- ?

Cheers,
Henning

kirschen · 2022-05-24T13:08:08Z

Following up on this:
There is a reference implementation of the deterministic smearing https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PatUtils/interface/SmearedJetProducerT.h#L207-L214
Could a call to the random number generator (and setting the deterministic seed) be included in the formula evaluator or how should this proceed?
The same random number generator used in the CMSSW implementation seems to be around in numpy https://numpy.org/doc/stable/reference/random/bit_generators/mt19937.html

nsmith- · 2022-06-01T16:02:57Z

I think yes we could add a function to the formula evaluator to generate a deterministic random number. We could also introduce a new node type for this purpose, which might make it a bit more flexible.
But, I think it would be nice to move to something more performant like using an integer hashing function and then apply Box-Muller

nsmith- · 2022-07-26T05:15:04Z

Running a small benchmark to see what we can do for per-object deterministic random normal distributions:

// compile with, e.g.
// g++ -Ixxhash/include -Ipcg-cpp/include -Icxx-ziggurat/include --std=c++17 -O2 -Wall hashtest.cpp -o hashtest
#include <vector>
#include <iostream>
#include <random>
#include <chrono>

// https://github.com/RedSpah/xxhash_cpp
#include "xxhash.hpp"

// https://github.com/imneme/pcg-cpp
#include "pcg_random.hpp"

// https://github.com/snsinfu/cxx-ziggurat
#include "ziggurat.hpp"

void stats(const std::vector<double>& out) {
  double mean{0.0};
  for(size_t i=0; i<out.size(); ++i) {
    mean += out[i];
  }
  mean /= out.size();
  std::cout << "mean " << mean << std::endl;
}

int main(void) {
  size_t n {100000};
  std::vector<int> in1(n);
  std::vector<double> in2(n);
  std::vector<double> out(n);

  std::mt19937 gen(42);
  std::exponential_distribution<> d(20);
  for(size_t i=0; i<n; ++i) {
    in1[i] = i;
    in2[i] = d(gen) + d(gen);
  }

  {
    std::cout << "JME current (std::mt19937 + std::normal_distribution)" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    std::normal_distribution<> smear(1, 0.1);
    for(size_t i=0; i<n; ++i) {
      uint32_t seed = in1[i] << 20;
      seed |= reinterpret_cast<uint64_t&>(in2[i]) & 0xFFFFFFFF;
      gen.seed(seed);
      out[i] = smear(gen);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "std::normal_distribution (marsaglia, probably)" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      // if not a temporary, it sneakily reuses the spare value
      out[i] = std::normal_distribution<>(1, 0.1)(g);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Marsaglia polar method" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    float u, v, s;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      do {
        // cheap but wrong-ish https://www.pcg-random.org/using-pcg-c.html#generating-doubles
        u = std::ldexp(g(), -31) - 1;
        v = std::ldexp(g(), -31) - 1;
        s = u*u + v*v;
      } while ( s>= 1.0 || s == 0.0 );
      s = std::sqrt(-2.0 * std::log(s) / s);
      out[i] = 1 + 0.1*u*s;
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Ziggurat" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    cxx::ziggurat_normal_distribution<double> smear{1.0, 0.1};
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      out[i] = smear(g);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Just the generator" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      out[i] = std::ldexp(g(), -32)*0.1 + 1;
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }
  return 0;
}

on my machine I get

JME current (std::mt19937 + std::normal_distribution)
Ran in 973ns/entry
mean 0.99971
std::normal_distribution (marsaglia, probably)
Ran in 39ns/entry
mean 0.999821
Marsaglia polar method
Ran in 34ns/entry
mean 0.999502
Ziggurat
Ran in 18ns/entry
mean 1.00048
Just the generator
Ran in 9ns/entry
mean 1.04997

I think in general its important to use a proper hash function such as xxhash to seed any generator, see https://blog.unity.com/technology/a-primer-on-repeatable-random-numbers for some discussion on why. Then its a matter of finding a reasonably fast, reproducible normal distribution generator. Technically, the exact algorithm ofstd::normal_distribution is implementation-defined so may differ from compiler to compiler. So we might want to roll our own. One example is shown above.

The idea would then be to define a new HashPRNG node type that takes a collection of inputs (entropy sources, effectively) and returns an output random number from a configurable distribution. This could then be used with Transform nodes or with CompoundCorrection to smear a value as appropriate.

kirschen · 2022-07-26T13:32:30Z

Hi Nick, thanks for the follow-up. Would be great to have this going forward!

Closes #130

nsmith- · 2022-08-08T02:34:18Z

Example of a JER smearing correction with the new HashPRNG node:

with gzip.open("../jsonpog-integration/POG/JME/2017_UL/jet_jerc.json.gz") as fin:
    cset = CorrectionSet.parse_raw(fin.read())

cset.corrections = [
    c for c in cset.corrections
    if c.name in (
        "Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
        "Summer19UL17_V5_MC_L2Relative_AK4PFchs",
        "Summer19UL17_JRV2_MC_PtResolution_AK4PFchs",
        "Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs",
    )
]
cset.compound_corrections = []

res = Correction.parse_obj(
    {
        "name": "JERSmear",
        "description": "Jet smearing tool",
        "inputs": [
            {"name": "JetPt", "type": "real"},
            {"name": "JetEta", "type": "real"},
            {"name": "GenPt", "type": "real", "description": "matched GenJet pt, or -1 if no match"},
            {"name": "Rho", "type": "real", "description": "entropy source"},
            {"name": "EventID", "type": "int", "description": "entropy source"},
            {"name": "JER", "type": "real", "description": "Jet energy resolution"},
            {"name": "JERsf", "type": "real", "description": "Jet energy resolution scale factor"},
        ],
        "output": {"name": "smear", "type": "real"},
        "version": 1,
        "data": {
            "nodetype": "binning",
            "input": "GenPt",
            "edges": [-1, 0, 1],
            "flow": "clamp",
            "content": [
                # stochastic
                {
                    # rewrite gen_pt with a random gaussian
                    "nodetype": "transform",
                    "input": "GenPt",
                    "rule": {
                        "nodetype": "hashprng",
                        "inputs": ["JetPt", "JetEta", "Rho", "EventID"],
                        "distribution": "normal",
                    },
                    "content": {
                        "nodetype": "formula",
                        # TODO min jet pt?
                        "expression": "1+sqrt(max(x*x - 1, 0)) * y * z",
                        "parser": "TFormula",
                        # now gen_pt is actually the output of hashprng
                        "variables": ["JERsf", "JER", "GenPt"],
                    },
                },
                # deterministic
                {
                    "nodetype": "formula",
                    # TODO min jet pt?
                    "expression": "1+(x-1)*(y-z)/y",
                    "parser": "TFormula",
                    "variables": ["JERsf", "JetPt", "GenPt"],
                },
            ],
        },
    }
)
cset.corrections.append(res)

stack = CompoundCorrection.parse_obj(
    {
        "name": "JEC",
        "output": {"name": "sf", "type": "real"},
        "inputs": [
            {"name": "JetPt", "type": "real"},
            {"name": "JetEta", "type": "real"},
            {"name": "JetA", "type": "real"},
            {"name": "Rho", "type": "real"},
        ],
        "inputs_update": ["JetPt"],
        "input_op": "*",
        "output_op": "*",
        "stack": [
            "Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
            "Summer19UL17_V5_MC_L2Relative_AK4PFchs",
        ],
    }
)

cset.compound_corrections.append(stack)
ceval = cset.to_evaluator()

with gzip.open("jectest.json.gz", "wt") as fout:
    fout.write(cset.json(exclude_unset=True))
    
import rich
rich.print(cset)

Application of a complete JEC would then be something like

jec = ceval.compound["JEC"].evaluate(jets.pt_raw, jets.eta, jets.area, jets.event_rho)
pt_jec = jets.pt_raw * jec
jer = ceval["Summer19UL17_JRV2_MC_PtResolution_AK4PFchs"].evaluate(jets.eta, pt_jec, jets.event_rho)
ptgen = np.where(np.abs(pt_jec - jets.pt_gen) < 3*pt_jec*jer, jets.pt_gen, -1.0)
jersf = ceval["Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs"].evaluate(jets.eta, "nom")
jersmear = ceval["JERSmear"].evaluate(pt_jec, jets.eta, ptgen, jets.event_rho, jets.event_id, jer, jersf)
pt_final = pt_jec * jersmear

nsmith- added the enhancement New feature or request label Apr 20, 2022

nsmith- mentioned this issue Jun 22, 2022

Picking 'random' numbers which will always be the same for a given event? scikit-hep/coffea#682

Closed

mariadalfonso mentioned this issue Jul 26, 2022

JER Corrections cms-nanoAOD/cmssw#202

Closed

nsmith- added a commit that referenced this issue Aug 5, 2022

Implement a HashPRNG node

c5ced3c

Closes #130

nsmith- mentioned this issue Aug 5, 2022

Implement a HashPRNG node #140

Merged

nsmith- closed this as completed in #140 Aug 9, 2022

nsmith- closed this as completed in cb8d99e Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide jet-smearing (including random numbers) within correctionlib #130

Provide jet-smearing (including random numbers) within correctionlib #130

nsmith- commented Apr 20, 2022

kirschen commented May 24, 2022 •

edited

Loading

nsmith- commented Jun 1, 2022

nsmith- commented Jul 26, 2022 •

edited

Loading

kirschen commented Jul 26, 2022

nsmith- commented Aug 8, 2022

Provide jet-smearing (including random numbers) within correctionlib #130

Provide jet-smearing (including random numbers) within correctionlib #130

Comments

nsmith- commented Apr 20, 2022

Discussed in #123

kirschen commented May 24, 2022 • edited Loading

nsmith- commented Jun 1, 2022

nsmith- commented Jul 26, 2022 • edited Loading

kirschen commented Jul 26, 2022

nsmith- commented Aug 8, 2022

kirschen commented May 24, 2022 •

edited

Loading

nsmith- commented Jul 26, 2022 •

edited

Loading