Skip to content

Commit

Permalink
Make it possible to read only a subset of available collections in re…
Browse files Browse the repository at this point in the history
…aders (#504)

* Add test case for checking whether reading limited set works

* Also check auxiliary information for RNTuple

* Move test function into header and template it

* Make ROOTReader capable of only reading a subset

* Improve error message on failing tests

* Add selective collection reading to RNTupleReader

* Add selective collection reading to SIO

* Enable selective reading in Reader interface

* Make sure that passing non-existent names doesn't break

* Add simple test for reusing collection names

It is possible that users want to "ignore" a certain collection when
reading files. It should then still be possible to add a new collection
with the same name to the event that has been created without breaking
anything.

* Replace a pair with a struct

Facilitates the usage of projections with range algorithms

* Throw an exception instead of ignoring non-existant names

* Remove obsolete checks

* Update docstrings to inform about the exception

* Add basic checks for roundtripping

* Add content checks for roundtripped files with dropped collections

* Add include guards

* Improve formatting

* Don't run tests that miss inputs for sanitizers

* Remove double return

* Bump minimum ROOT version for RNTuple support

* Remove no longer necessary special casing

* Add possibility to select colections to python API

* Update readme to reflect new ROOT requirements

* Add possibility to limit collections to RDataSource
  • Loading branch information
tmadlener authored Feb 17, 2025
1 parent c06253f commit cc36b39
Show file tree
Hide file tree
Showing 37 changed files with 615 additions and 189 deletions.
4 changes: 3 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,15 @@ option(ENABLE_JULIA "Enable Julia support. When enabled, Julia datamodels w
list(APPEND CMAKE_PREFIX_PATH $ENV{ROOTSYS})
set(CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake)
set(root_components_needed RIO Tree)
set(root_min_version 6.28.04)
if(ENABLE_RNTUPLE)
list(APPEND root_components_needed ROOTNTuple)
set(root_min_version 6.32)
endif()
if(ENABLE_DATASOURCE)
list(APPEND root_components_needed ROOTDataFrame)
endif()
find_package(ROOT 6.28.04 REQUIRED COMPONENTS ${root_components_needed})
find_package(ROOT ${root_min_version} REQUIRED COMPONENTS ${root_components_needed})

# ROOT_CXX_STANDARD was introduced in https://github.com/root-project/root/pull/6466
# before that it's an empty variable so we check if it's any number > 0
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,14 @@ use a recent LCG or Key4hep stack release.

On Mac OS or Ubuntu, you need to install the following software.

### ROOT 6.28.04
### ROOT 6.28.04 (6.32 for RNTuple support)

Install ROOT 6.28.04 (or later) built with c++20 support and set up your ROOT environment:

source <root_path>/bin/thisroot.sh

If you want to build with RNTuple support, you will need 6.32 at least.

### Catch2 v3 (optional)

Podio uses [Catch2 v3](https://github.com/catchorg/Catch2/tree/devel) for some
Expand Down
30 changes: 25 additions & 5 deletions include/podio/DataSource.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,25 @@ class DataSource : public ROOT::RDF::RDataSource {
///
/// @brief Construct the podio::DataSource from the provided file.
///
explicit DataSource(const std::string& filePath, int nEvents = -1);
/// @param filePath Path to the file that should be read
/// @param nEvents Number of events to process (optional, defaults to -1 for
/// all events)
/// @param collsToRead The collections that should be made available (optional,
/// defaults to empty vector for all collections)
///
explicit DataSource(const std::string& filePath, int nEvents = -1, const std::vector<std::string>& collsToRead = {});

///
/// @brief Construct the podio::DataSource from the provided file list.
///
explicit DataSource(const std::vector<std::string>& filePathList, int nEvents = -1);
/// @param filePathList Paths to the files that should be read
/// @param nEvents Number of events to process (optional, defaults to -1 for
/// all events)
/// @param collsToRead The collections that should be made available (optional,
/// defaults to empty vector for all collections)
///
explicit DataSource(const std::vector<std::string>& filePathList, int nEvents = -1,
const std::vector<std::string>& collsToRead = {});

///
/// @brief Inform the podio::DataSource of the desired level of parallelism.
Expand Down Expand Up @@ -139,26 +152,33 @@ class DataSource : public ROOT::RDF::RDataSource {
///
/// @param[in] nEvents Number of events.
///
void SetupInput(int nEvents);
void SetupInput(int nEvents, const std::vector<std::string>& collsToRead);
};

///
/// @brief Create RDataFrame from multiple Podio files.
///
/// @param[in] filePathList List of file paths from which the RDataFrame
/// will be created.
/// @param[in] collsToRead List of collection names that should be made
/// available
///
/// @return RDataFrame created from input file list.
///
ROOT::RDataFrame CreateDataFrame(const std::vector<std::string>& filePathList);
ROOT::RDataFrame CreateDataFrame(const std::vector<std::string>& filePathList,
const std::vector<std::string>& collsToRead = {});

///
/// @brief Create RDataFrame from a Podio file or glob pattern matching multiple Podio files.
///
/// @param[in] filePath File path from which the RDataFrame will be created.
/// The file path can include glob patterns to match multiple files.
/// @param[in] collsToRead List of collection names that should be made
/// available
///
/// @return RDataFrame created from input file list.
///
ROOT::RDataFrame CreateDataFrame(const std::string& filePath);
ROOT::RDataFrame CreateDataFrame(const std::string& filePath, const std::vector<std::string>& collsToRead = {});
} // namespace podio

#endif /* PODIO_DATASOURCE_H */
24 changes: 15 additions & 9 deletions include/podio/RNTupleReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,11 @@
#include <vector>

#include <ROOT/RNTuple.hxx>
#include <ROOT/RNTupleReader.hxx>
#include <RVersion.h>
#if ROOT_VERSION_CODE >= ROOT_VERSION(6, 31, 0)
#include <ROOT/RNTupleReader.hxx>
#endif

namespace podio {

/**
This class has the function to read available data from disk
and to prepare collections and buffers.
**/
/// The RNTupleReader can be used to read files that have been written with the
/// RNTuple backend.
///
Expand Down Expand Up @@ -61,20 +55,32 @@ class RNTupleReader {
/// Read the next data entry for a given category.
///
/// @param name The category name for which to read the next entry
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns FrameData from which a podio::Frame can be constructed if the
/// category exists and if there are still entries left to read.
/// Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string& name);
///
/// @throws std::invalid_argument in case collsToRead contains collection
/// names that are not available
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string& name,
const std::vector<std::string>& collsToRead = {});

/// Read the desired data entry for a given category.
///
/// @param name The category name for which to read the next entry
/// @param entry The entry number to read
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns FrameData from which a podio::Frame can be constructed if the
/// category and the desired entry exist. Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string& name, const unsigned entry);
///
/// @throws std::invalid_argument in case collsToRead contains collection
/// names that are not available
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string& name, const unsigned entry,
const std::vector<std::string>& collsToRead = {});

/// Get the names of all the available Frame categories in the current file(s).
///
Expand Down
4 changes: 1 addition & 3 deletions include/podio/RNTupleWriter.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@
#include "TFile.h"
#include <ROOT/RNTuple.hxx>
#include <ROOT/RNTupleModel.hxx>
#if ROOT_VERSION_CODE >= ROOT_VERSION(6, 31, 0)
#include <ROOT/RNTupleWriter.hxx>
#endif
#include <ROOT/RNTupleWriter.hxx>

#include <string>
#include <unordered_map>
Expand Down
11 changes: 7 additions & 4 deletions include/podio/ROOTLegacyReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,20 +75,23 @@ class ROOTLegacyReader {
/// Read the next data entry from which a Frame can be constructed.
///
/// @note the category name has to be "events" in this case, as only that
/// category is available for legacy files.
/// category is available for legacy files. Also the collections to read
/// argument will be ignored.
///
/// @returns FrameData from which a podio::Frame can be constructed if there
/// are still entries left to read. Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string&);
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string&, const std::vector<std::string>& = {});

/// Read the desired data entry from which a Frame can be constructed.
///
/// @note the category name has to be "events" in this case, as only that
/// category is available for legacy files.
/// category is available for legacy files. Also the collections to read
/// argument will be ignored.
///
/// @returns FrameData from which a podio::Frame can be constructed if the
/// desired entry exists. Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string&, const unsigned entry);
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string&, const unsigned entry,
const std::vector<std::string>& = {});

/// Get the number of entries for the given name
///
Expand Down
35 changes: 26 additions & 9 deletions include/podio/ROOTReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ namespace detail {
// vector
using CollectionInfo = std::tuple<std::string, bool, SchemaVersionT, size_t>;

struct NamedCollInfo {
std::string name{};
CollectionInfo info{};
};
} // namespace detail

class CollectionBase;
Expand Down Expand Up @@ -74,20 +78,32 @@ class ROOTReader {
/// Read the next data entry for a given category.
///
/// @param name The category name for which to read the next entry
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns FrameData from which a podio::Frame can be constructed if the
/// category exists and if there are still entries left to read.
/// Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string& name);
///
/// @throws std::invalid_argument in case collsToRead contains collection
/// names that are not available
std::unique_ptr<podio::ROOTFrameData> readNextEntry(const std::string& name,
const std::vector<std::string>& collsToRead = {});

/// Read the desired data entry for a given category.
///
/// @param name The category name for which to read the next entry
/// @param entry The entry number to read
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns FrameData from which a podio::Frame can be constructed if the
/// category and the desired entry exist. Otherwise a nullptr
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string& name, const unsigned entry);
///
/// @throws std::invalid_argument in case collsToRead contains collection
/// names that are not available
std::unique_ptr<podio::ROOTFrameData> readEntry(const std::string& name, const unsigned entry,
const std::vector<std::string>& collsToRead = {});

/// Get the number of entries for the given name
///
Expand Down Expand Up @@ -146,12 +162,12 @@ class ROOTReader {
/// constructor from chain for more convenient map insertion
CategoryInfo(std::unique_ptr<TChain>&& c) : chain(std::move(c)) {
}
std::unique_ptr<TChain> chain{nullptr}; ///< The TChain with the data
unsigned entry{0}; ///< The next entry to read
std::vector<std::pair<std::string, detail::CollectionInfo>> storedClasses{}; ///< The stored collections in this
///< category
std::vector<root_utils::CollectionBranches> branches{}; ///< The branches for this category
std::shared_ptr<CollectionIDTable> table{nullptr}; ///< The collection ID table for this category
std::unique_ptr<TChain> chain{nullptr}; ///< The TChain with the data
unsigned entry{0}; ///< The next entry to read
std::vector<detail::NamedCollInfo> storedClasses{}; ///< The stored collections in this
///< category
std::vector<root_utils::CollectionBranches> branches{}; ///< The branches for this category
std::shared_ptr<CollectionIDTable> table{nullptr}; ///< The collection ID table for this category
};

/// Initialize the passed CategoryInfo by setting up the necessary branches,
Expand All @@ -174,7 +190,8 @@ class ROOTReader {
/// Read the data entry specified in the passed CategoryInfo, and increase the
/// counter afterwards. In case the requested entry is larger than the
/// available number of entries, return a nullptr.
std::unique_ptr<podio::ROOTFrameData> readEntry(ROOTReader::CategoryInfo& catInfo);
std::unique_ptr<podio::ROOTFrameData> readEntry(ROOTReader::CategoryInfo& catInfo,
const std::vector<std::string>& collsToRead);

/// Get / read the buffers at index iColl in the passed category information
podio::CollectionReadBuffers getCollectionBuffers(CategoryInfo& catInfo, size_t iColl, bool reloadBranches,
Expand Down
38 changes: 24 additions & 14 deletions include/podio/Reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ class Reader {
struct ReaderConcept {
virtual ~ReaderConcept() = default;

virtual podio::Frame readNextFrame(const std::string& name) = 0;
virtual podio::Frame readFrame(const std::string& name, size_t index) = 0;
virtual podio::Frame readNextFrame(const std::string& name, const std::vector<std::string>&) = 0;
virtual podio::Frame readFrame(const std::string& name, size_t index, const std::vector<std::string>&) = 0;
virtual size_t getEntries(const std::string& name) const = 0;
virtual podio::version::Version currentFileVersion() const = 0;
virtual std::optional<podio::version::Version> currentFileVersion(const std::string& name) const = 0;
Expand All @@ -44,16 +44,17 @@ class Reader {

~ReaderModel() = default;

podio::Frame readNextFrame(const std::string& name) override {
auto maybeFrame = m_reader->readNextEntry(name);
podio::Frame readNextFrame(const std::string& name, const std::vector<std::string>& collsToRead) override {
auto maybeFrame = m_reader->readNextEntry(name, collsToRead);
if (maybeFrame) {
return maybeFrame;
}
throw std::runtime_error("Failed reading category " + name + " (reading beyond bounds?)");
}

podio::Frame readFrame(const std::string& name, size_t index) override {
auto maybeFrame = m_reader->readEntry(name, index);
podio::Frame readFrame(const std::string& name, size_t index,
const std::vector<std::string>& collsToRead) override {
auto maybeFrame = m_reader->readEntry(name, index, collsToRead);
if (maybeFrame) {
return maybeFrame;
}
Expand Down Expand Up @@ -105,46 +106,55 @@ class Reader {
/// Read the next frame of a given category
///
/// @param name The category name for which to read the next frame
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns A fully constructed Frame with the contents read from file
///
/// @throws std::invalid_argument in case the category is not available or in
/// case no more entries are available
podio::Frame readNextFrame(const std::string& name) {
return m_self->readNextFrame(name);
podio::Frame readNextFrame(const std::string& name, const std::vector<std::string>& collsToRead = {}) {
return m_self->readNextFrame(name, collsToRead);
}

/// Read the next frame of the "events" category
///
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns A fully constructed Frame with the contents read from file
///
/// @throws std::invalid_argument in case no (more) events are available
podio::Frame readNextEvent() {
return readNextFrame(podio::Category::Event);
podio::Frame readNextEvent(const std::vector<std::string>& collsToRead = {}) {
return readNextFrame(podio::Category::Event, collsToRead);
}

/// Read a specific frame for a given category
///
/// @param name The category name for which to read the next entry
/// @param index The entry number to read
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns A fully constructed Frame with the contents read from file
///
/// @throws std::invalid_argument in case the category is not available or in
/// case the specified entry is not available
podio::Frame readFrame(const std::string& name, size_t index) {
return m_self->readFrame(name, index);
podio::Frame readFrame(const std::string& name, size_t index, const std::vector<std::string>& collsToRead = {}) {
return m_self->readFrame(name, index, collsToRead);
}

/// Read a specific frame of the "events" category
///
/// @param index The event number to read
/// @param collsToRead (optional) the collection names that should be read. If
/// not provided (or empty) all collections will be read
///
/// @returns A fully constructed Frame with the contents read from file
///
/// @throws std::invalid_argument in case the desired event is not available
podio::Frame readEvent(size_t index) {
return readFrame(podio::Category::Event, index);
podio::Frame readEvent(size_t index, const std::vector<std::string>& collsToRead = {}) {
return readFrame(podio::Category::Event, index, collsToRead);
}

/// Get the number of entries for the given name
Expand Down
Loading

0 comments on commit cc36b39

Please sign in to comment.