If you are coming from our EDBT Industrial submission, please check out the edbt
branch to access the following features, which are not merged into main yet:
- Discovery of association rules using ECLAT and FP-Growth algorithms adapted from Christian Borgelt’s implementations
- Discovery of conditional functional dependencies using the CTANE algorithm and its variations
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. The currently supported data patterns are:
- Functional dependencies, both exact and approximate
- Conditional functional dependencies
- Association rules
It also allows to run data cleaning scenarios using these algorithms. At the moment, we have implemented a typo detection scenario using an exact and approximate functional dependency discovery algorithm.
The algorithms of Desbordante are implemented in C++ to maximize the resulting performance. They can be run using either a console version or a web-application that features an easy-to-use web interface.
You can try the deployed version here. You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced (and a task is killed if it goes outside of acceptable ranges).
A brief introduction into the tool and its use-cases is presented here (in Russian, the English version is in the works).
This project supports installation with and without a web application. In the second case, to build the project, you also need to have dependencies that are specified for installation without a web application.
-
The following instructions were tested on Ubuntu 18.04.4 LTS.
Prior to cloning the repository and attempting to build the project, ensure that you have the following software:
- GNU g++ compiler, version 10+
- CMake, version 3.13+
- Boost library, version 1.72.0+
Firstly, navigate to a desired directory. Then, clone the repository, cd into the project directory and launch the build script:
git clone https://github.com/Mstrutov/Desbordante/ cd Desbordante ./build.sh
The script generates the following file structure in
/path/to/Desbordante/build/target
:├───input_data │ └───some-sample-csvs.csv ├───Desbordante_test ├───Desbordante_run
The
input_data
directory contains several .csv files that may be used byDesbordante_test
. RunDesbordante_test
to perform unit testing:cd build/target ./Desbordante_test
The tool itself is launched via the following line:
./Desbordante_run --algo=tane --data=<dataset_name>.csv
The
<dataset_name>.csv
, which is a user-provided dataset, should be placed in the/path/to/Desbordante/build/target
directory. -
The following instructions were tested on Windows 10 .
Prior to cloning the repository and attempting to build the project, ensure that you have the following software:
- Microsoft Visual Studio 2019
- CMake, version 3.13+
- Boost library, version 1.65.1+
The recommended way to install Boost is by using chocolatey
Firstly, launch the command prompt and navigate to a desired directory. Then, clone the repository, cd into the project directory and launch the build script:
git clone https://github.com/Mstrutov/Desbordante/ cd Desbordante git checkout windows-compatible build.bat
Note: to compile the project, the script uses hard-coded path to MSVC developer command prompt, which is located by default at
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat
. You should change the path in the script if it differs from the default one.The script generates the following file structure in
\path\to\Desbordante\build\target
:├───input_data │ └───some-sample-csv's.csv ├───Desbordante_test.exe ├───Desbordante_run.exe
The
input_data
directory contains several .csv files that may be used byDesbordante_test
. RunDesbordante_test
to perform unit testing:cd build\target Desbordante_test.exe
The tool itself is launched via the following line:
Desbordante_run.exe --algo=tane --data=<dataset_name>.csv
The
<dataset_name>.csv
, which is a user-provided dataset, should be placed in the\path\to\Desbordante\build\target
directory.
Requires docker, docker-compose
git clone https://github.com/vs9h/Desbordante.git
cd Desbordante/
git checkout origin/web-app
./install_web.sh
- Modify .env file in Desbordante/
- Set those variables:
- POSTGRES_PASSWORD
- POSTGRES_USER
- POSTGRES_DB
- KAFKA_ADMIN_CLIENT_ID
- CONSUMER_TL_SEC
- CONSUMER_ML_MB
- HOST_SERVER_IP
- Create your grafana user
sudo htpasswd -c grafana-users user1
docker-compose up --force-recreate
After the launch it will be available at http://localhost:3000/
Kirill Stupakov — Client side of the web application
Anton Chizhov — Server side of the web application
Alexandr Smirnov — DFD implementation
Ilya Shchuckin — FD_Mine implementation
Michael Polyntsov — FastFDs implementation
Ilya Vologin — core classes
Maxim Strutovsky — team lead, Pyro & TANE implementation
Nikita Bobrov — product owner, consult, papers
Kirill Smirnov — product owner, code quality, infrastructure, consult
George Chernishev — product owner, consult, papers
If you use this software for research, please cite the paper (https://fruct.org/publications/fruct29/files/Strut.pdf, https://ieeexplore.ieee.org/document/9435469) as follows:
M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.