Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide jet-smearing (including random numbers) within correctionlib #130

Closed
nsmith- opened this issue Apr 20, 2022 Discussed in #123 · 5 comments · Fixed by #140
Closed

Provide jet-smearing (including random numbers) within correctionlib #130

nsmith- opened this issue Apr 20, 2022 Discussed in #123 · 5 comments · Fixed by #140
Labels
enhancement New feature or request

Comments

@nsmith-
Copy link
Collaborator

nsmith- commented Apr 20, 2022

Discussed in #123

Originally posted by kirschen February 3, 2022
Hi @nsmith- , all,

@lgray brought up the idea of going a step further in what is provided via correctionlib. One classical example of where one would profit from harmonizing is jet energy smearing, see e.g. https://twiki.cern.ch/twiki/bin/viewauth/CMS/JetResolution#Smearing_procedures
Depending on whether or not a genjet is matched (well enough) to a recojet, either the difference between genjet pt and recojet pt is changed or some stochastic smearing is applied. Ideally, the result would be reproducible, e.g. deciding on a standard way to calculate a random number seed (e.g. requiring event/lumisection as input).

Input would be:
recopt, recoeta, genpt, (info for random number seed)
Output:
smeared recopt

Would this still be doable and within scope of correctionlib @nsmith- ?

Cheers,
Henning

@nsmith- nsmith- added the enhancement New feature or request label Apr 20, 2022
@kirschen
Copy link
Contributor

kirschen commented May 24, 2022

Following up on this:
There is a reference implementation of the deterministic smearing https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PatUtils/interface/SmearedJetProducerT.h#L207-L214
Could a call to the random number generator (and setting the deterministic seed) be included in the formula evaluator or how should this proceed?
The same random number generator used in the CMSSW implementation seems to be around in numpy https://numpy.org/doc/stable/reference/random/bit_generators/mt19937.html

@nsmith-
Copy link
Collaborator Author

nsmith- commented Jun 1, 2022

I think yes we could add a function to the formula evaluator to generate a deterministic random number. We could also introduce a new node type for this purpose, which might make it a bit more flexible.
But, I think it would be nice to move to something more performant like using an integer hashing function and then apply Box-Muller

@nsmith-
Copy link
Collaborator Author

nsmith- commented Jul 26, 2022

Running a small benchmark to see what we can do for per-object deterministic random normal distributions:

// compile with, e.g.
// g++ -Ixxhash/include -Ipcg-cpp/include -Icxx-ziggurat/include --std=c++17 -O2 -Wall hashtest.cpp -o hashtest
#include <vector>
#include <iostream>
#include <random>
#include <chrono>

// https://github.com/RedSpah/xxhash_cpp
#include "xxhash.hpp"

// https://github.com/imneme/pcg-cpp
#include "pcg_random.hpp"

// https://github.com/snsinfu/cxx-ziggurat
#include "ziggurat.hpp"

void stats(const std::vector<double>& out) {
  double mean{0.0};
  for(size_t i=0; i<out.size(); ++i) {
    mean += out[i];
  }
  mean /= out.size();
  std::cout << "mean " << mean << std::endl;
}

int main(void) {
  size_t n {100000};
  std::vector<int> in1(n);
  std::vector<double> in2(n);
  std::vector<double> out(n);

  std::mt19937 gen(42);
  std::exponential_distribution<> d(20);
  for(size_t i=0; i<n; ++i) {
    in1[i] = i;
    in2[i] = d(gen) + d(gen);
  }

  {
    std::cout << "JME current (std::mt19937 + std::normal_distribution)" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    std::normal_distribution<> smear(1, 0.1);
    for(size_t i=0; i<n; ++i) {
      uint32_t seed = in1[i] << 20;
      seed |= reinterpret_cast<uint64_t&>(in2[i]) & 0xFFFFFFFF;
      gen.seed(seed);
      out[i] = smear(gen);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "std::normal_distribution (marsaglia, probably)" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      // if not a temporary, it sneakily reuses the spare value
      out[i] = std::normal_distribution<>(1, 0.1)(g);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Marsaglia polar method" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    float u, v, s;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      do {
        // cheap but wrong-ish https://www.pcg-random.org/using-pcg-c.html#generating-doubles
        u = std::ldexp(g(), -31) - 1;
        v = std::ldexp(g(), -31) - 1;
        s = u*u + v*v;
      } while ( s>= 1.0 || s == 0.0 );
      s = std::sqrt(-2.0 * std::log(s) / s);
      out[i] = 1 + 0.1*u*s;
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Ziggurat" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    cxx::ziggurat_normal_distribution<double> smear{1.0, 0.1};
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      out[i] = smear(g);
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }

  {
    std::cout << "Just the generator" << std::endl;
    auto tic = std::chrono::high_resolution_clock::now();
    pcg32_oneseq g;
    for(size_t i=0; i<n; ++i) {
      g.seed(xxh::xxhash<64>({static_cast<uint64_t>(in1[i]), reinterpret_cast<uint64_t&>(in2[i])}));
      out[i] = std::ldexp(g(), -32)*0.1 + 1;
    }
    auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - tic);
    std::cout << "Ran in " << diff.count() / n << "ns/entry" << std::endl;
    stats(out);
  }
  return 0;
}

on my machine I get

JME current (std::mt19937 + std::normal_distribution)
Ran in 973ns/entry
mean 0.99971
std::normal_distribution (marsaglia, probably)
Ran in 39ns/entry
mean 0.999821
Marsaglia polar method
Ran in 34ns/entry
mean 0.999502
Ziggurat
Ran in 18ns/entry
mean 1.00048
Just the generator
Ran in 9ns/entry
mean 1.04997

I think in general its important to use a proper hash function such as xxhash to seed any generator, see https://blog.unity.com/technology/a-primer-on-repeatable-random-numbers for some discussion on why. Then its a matter of finding a reasonably fast, reproducible normal distribution generator. Technically, the exact algorithm ofstd::normal_distribution is implementation-defined so may differ from compiler to compiler. So we might want to roll our own. One example is shown above.

The idea would then be to define a new HashPRNG node type that takes a collection of inputs (entropy sources, effectively) and returns an output random number from a configurable distribution. This could then be used with Transform nodes or with CompoundCorrection to smear a value as appropriate.

@kirschen
Copy link
Contributor

Hi Nick, thanks for the follow-up. Would be great to have this going forward!

@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 8, 2022

Example of a JER smearing correction with the new HashPRNG node:

with gzip.open("../jsonpog-integration/POG/JME/2017_UL/jet_jerc.json.gz") as fin:
    cset = CorrectionSet.parse_raw(fin.read())

cset.corrections = [
    c for c in cset.corrections
    if c.name in (
        "Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
        "Summer19UL17_V5_MC_L2Relative_AK4PFchs",
        "Summer19UL17_JRV2_MC_PtResolution_AK4PFchs",
        "Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs",
    )
]
cset.compound_corrections = []

res = Correction.parse_obj(
    {
        "name": "JERSmear",
        "description": "Jet smearing tool",
        "inputs": [
            {"name": "JetPt", "type": "real"},
            {"name": "JetEta", "type": "real"},
            {"name": "GenPt", "type": "real", "description": "matched GenJet pt, or -1 if no match"},
            {"name": "Rho", "type": "real", "description": "entropy source"},
            {"name": "EventID", "type": "int", "description": "entropy source"},
            {"name": "JER", "type": "real", "description": "Jet energy resolution"},
            {"name": "JERsf", "type": "real", "description": "Jet energy resolution scale factor"},
        ],
        "output": {"name": "smear", "type": "real"},
        "version": 1,
        "data": {
            "nodetype": "binning",
            "input": "GenPt",
            "edges": [-1, 0, 1],
            "flow": "clamp",
            "content": [
                # stochastic
                {
                    # rewrite gen_pt with a random gaussian
                    "nodetype": "transform",
                    "input": "GenPt",
                    "rule": {
                        "nodetype": "hashprng",
                        "inputs": ["JetPt", "JetEta", "Rho", "EventID"],
                        "distribution": "normal",
                    },
                    "content": {
                        "nodetype": "formula",
                        # TODO min jet pt?
                        "expression": "1+sqrt(max(x*x - 1, 0)) * y * z",
                        "parser": "TFormula",
                        # now gen_pt is actually the output of hashprng
                        "variables": ["JERsf", "JER", "GenPt"],
                    },
                },
                # deterministic
                {
                    "nodetype": "formula",
                    # TODO min jet pt?
                    "expression": "1+(x-1)*(y-z)/y",
                    "parser": "TFormula",
                    "variables": ["JERsf", "JetPt", "GenPt"],
                },
            ],
        },
    }
)
cset.corrections.append(res)

stack = CompoundCorrection.parse_obj(
    {
        "name": "JEC",
        "output": {"name": "sf", "type": "real"},
        "inputs": [
            {"name": "JetPt", "type": "real"},
            {"name": "JetEta", "type": "real"},
            {"name": "JetA", "type": "real"},
            {"name": "Rho", "type": "real"},
        ],
        "inputs_update": ["JetPt"],
        "input_op": "*",
        "output_op": "*",
        "stack": [
            "Summer19UL17_V5_MC_L1FastJet_AK4PFchs",
            "Summer19UL17_V5_MC_L2Relative_AK4PFchs",
        ],
    }
)

cset.compound_corrections.append(stack)
ceval = cset.to_evaluator()

with gzip.open("jectest.json.gz", "wt") as fout:
    fout.write(cset.json(exclude_unset=True))
    
import rich
rich.print(cset)

Application of a complete JEC would then be something like

jec = ceval.compound["JEC"].evaluate(jets.pt_raw, jets.eta, jets.area, jets.event_rho)
pt_jec = jets.pt_raw * jec
jer = ceval["Summer19UL17_JRV2_MC_PtResolution_AK4PFchs"].evaluate(jets.eta, pt_jec, jets.event_rho)
ptgen = np.where(np.abs(pt_jec - jets.pt_gen) < 3*pt_jec*jer, jets.pt_gen, -1.0)
jersf = ceval["Summer19UL17_JRV2_MC_ScaleFactor_AK4PFchs"].evaluate(jets.eta, "nom")
jersmear = ceval["JERSmear"].evaluate(pt_jec, jets.eta, ptgen, jets.event_rho, jets.event_id, jer, jersf)
pt_final = pt_jec * jersmear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

2 participants