Add docs about dataframe tests

uc-cdis · Mar 12, 2024 · d370a69 · d370a69
1 parent b3894dd
commit d370a69
Show file tree

Hide file tree

Showing 2 changed files with 25 additions and 3 deletions.
diff --git a/docs/run_tube_tests_locally.md b/docs/run_tube_tests_locally.md
@@ -1,8 +1,11 @@
-# Run Tube tests locally
+# Run Tube integrated tests locally
 
 > NOTE: Tube local tests currently can not be configured with ARM based Mac computers (M1 or M2 Macs) due to limitations with old spark and elasticsearch docker container images. The document will be updated if and when they become available.
 
+> This doc is for running integrated tests. See https://github.com/uc-cdis/tube/blob/master/tests/README.md
+
 ## Initial Setup
+
 * Running tube locally requires `python 3.7`, `postgresql`, `docker` and `jq` installed on the local machine.
 *  It is recommended to create a python virtual environment to install all the required dependencies to avoid complications.
 * Tube uses poetry for dependency management, we install poetry in the virtual env using pip and then install all the dependencies using the following two commands

diff --git a/tests/README.md b/tests/README.md
@@ -2,6 +2,25 @@
 
 The tests in directory `standalone_tests` can be run with pytest after installing the project, without additional setup.
 
-The tests in directory `integrated_tests` require Spark and ElasticSearch to be running.
+The tests in directory `integrated_tests` require Spark and ElasticSearch to be running. See [this doc](/docs/run_tube_tests_locally.md) on how to run these tests locally.
 
-The tests in directory `dataframe_tests` require Spark to be running. Those tests are used to test tube function by function. These tests need dataframes produced in every transformation step. These dataframe is created by running `python run_etl -c PreTest`. Each test case should have an `input_df` and `output_df`. When test simply compare and ensure that with given `input_df`, we have an expected `output_df`.
+The tests in directory `dataframe_tests` require Spark to be running.
+
+## Dataframe tests
+
+The dataframe tests are used to test Tube function by function. These tests require dataframes to be produced in every transformation step. A Spark cluster must be running, since the tests submit the ETL mapping and the dataframes to the Spark cluster, where the transformation happens.
+
+Each test case should have an `input_df` and an `output_df`. When testing, simply compare and ensure that with the given `input_df`, we get the expected `output_df`.
+
+Use `print(df.show(truncate=False))` to view the contents of a dataframe.
+
+### How to generate test dataframes
+
+We need to submit the original test data file to Sheepdog in a QA environment or a Gen3 instance running in local. Make sure to use the approopriate data dictionary, for example use the MIDRC dictionary to run tests marked as `schema_midrc`.
+
+**Note:** all existing Sheepdog data must be cleared from the database before the test data is submitted.
+
+When running the ETL, use the "PreTest" running mode (see modes [here](https://github.com/uc-cdis/tube/blob/cac298e/tube/enums.py#L1-L4) for reference) to export the intermediate result of each step to a Parquet file and store it in Hadoop HDFS:
+```
+python run_etl -c PreTest
+```