PROV-IO is an I/O-centric provenance management framework for scientific data and workflows. It provides an interface for data provenance tracking and stores provenance as RDF triples. PROV-IO data model follows W3C PROV-DM and is an extension of it. PROV-IO has been integarted with HDF5 vol-provenance connector to track data provenance of HDF5 applications. PROV-IO has been tested on Ubuntu 18.04 and Cray Linux.
Please cite the following papers if you find our work useful:
PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems (TPDS'24)
PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems (HPDC'22) [Bibtex]
Other pulications:
Towards A Practical Provenance Framework for Scientific Data on HPC Systems (poster@FAST'22)
The easiest way of trying out PROV-IO is through docker
. PROV-IO docker
image is available now at rzhan/prov-io. The image is based on Debian 11 with Python 3.9 installed. Download the basic PROV-IO docker
image:
docker pull rzhan/prov-io:1.0
We also publish the docker
image of Megatron-LM instrumented with PROV-IO as an example use case. Download the instrumented Megatron-LM docker image:
docker pull rzhan/prov-io:megatron-lm
This section is for building PROV-IO from scratch.
PROV-IO library needs to be built with libtool
. Install it by:
sudo apt-get install -y gcc make
sudo apt-get install -y autoconf automake libtool pkg-config gtk-doc-tools
PROV-IO's RDF schema is based on Redland librdf
(including raptor2-2.0.15
, rasqal-0.9.33
, librdf-1.0.17
) and its Python binding (redland-bindings-1.0.17.1
). Install their dependencies first:
sudo apt-get install -y libltdl-dev libxml2 libxml2-dev flex bison swig uuid uuid-dev
We provide specific releases of librdf
at: https://github.com/hpc-io/prov-io/tree/master/packages. Unzip and install them in the sequence of raptor2-2.0.15
->rasqal-0.9.33
->librdf-1.0.17
->redland-bindings-1.0.17.1
.
For example, install raptor2-2.0.15
and export path:
cd raptor2-2.0.15
./autogen.sh
./configure --prefix=<your_prov_io_path>/lib/lib-raptor
make && make install
export PKG_CONFIG_PATH=<your_prov_io_path>/lib/lib-raptor/lib/pkgconfig:$PKG_CONFIG_PATH
Then, install rasqal-0.9.33
and librdf-1.0.17
using similar commands with correct path.
Finally, install the Python binding (redland-bindings-1.0.17.1
):
./autogen.sh
./configure --with-python
make && make install
PROV-IO Python library tracks provenance information defined in PROV-IO Extensible class. Follow instructions in /python to use it.
PROV-IO C library tracks low level I/O information. Build PROV-IO C library and export path:
cd c/provio
make
export LD_LIBRARY_PATH=<your_prov_io_path>/c/provio
To run a basic PROV-IO test, in the same directory:
export PROVIO_CONFIG=<your_prov_io_path>/doc/example_config/prov.cfg
./provio_test
Check out the provenance file (prov.turtle
) and stat file (prov.stat
) generated by PROV-IO.
PROV-IO HDF5 Lib Connector is used to track HDF5 I/O. Follow instructions to build it:
- Build and install HDF5 (provided in the repo: https://github.com/hpc-io/prov-io/tree/master/packages).
- Make sure HDF5 is correclty installed and its path is correctly exported, and build HDF5 vol-connector (dynamic library) instrumented with PROV-IO:
cd c/vol-provenance
make
- To redirect HDF5 I/Os to provenance vol-connector, set these environment variables:
export HDF5_VOL_CONNECTOR="provenance under_vol=0;under_info={};path=<trace_file_path>/my_trace.log;level=2;format="
export HDF5_PLUGIN_PATH=<hdf5_vol_connector_path>
Note: HDF5_VOL_CONNECTOR
contains the original provenance file (plain text) configurations of HDF5 provenance vol-connector. PROV-IO configuration is in it's own configuration file under $PROVIO_CONFIG
. <hdf5_vol_connector_path>
is the path that holds libh5prov.so
.
- Run a testcase application (VPIC) under the same directory:
./vpicio_uni_h5.exe ./my_data.dat 2 2 1 ./my_trace.log
You may compare the default plain text provenance file generated by vol-connector with the RDF provenance file generated by PROV-IO.
More provenance traces tracked from real workflows in paper are provided at: https://github.com/hpc-io/prov-io/tree/master/example_provenance.
PROV-IO Syscall wrapper is developed to track high frequency POSIX I/O APIs such as open
,write
,fsync
,etc. It's based on LLNL's GOTACH project.
Syscall Wrapper is still under testing, stay tuned!
Check out query engine at: https://github.com/hpc-io/prov-io/tree/master/user_engine/query.
Check out visualizer at: https://github.com/hpc-io/prov-io/tree/master/user_engine/visualizer. An example of visualized RDF provenance is also given.
If you run into issues when using PROV-IO, please email me at hanrz AT iastate DOT edu. I'm more than happy to help.