Skip to content

Commit

Permalink
Merge pull request #17 from luciorq/cl-readme
Browse files Browse the repository at this point in the history
docs: update readme
  • Loading branch information
luciorq authored Nov 8, 2024
2 parents 7471373 + 022a2af commit 5af7f31
Show file tree
Hide file tree
Showing 8 changed files with 398 additions and 85 deletions.
5 changes: 5 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
^renv$
^renv\.lock$
^condathis\.Rproj$
^\.Rproj\.user$
^LICENSE\.md$
^\.Rprofile$
^justfile$
^data-raw$
^.*\.Rproj$
Expand All @@ -14,3 +17,5 @@
^justfile$
^data-raw$
^vignettes/*_files$
^README\.Rmd$
^README\.html$
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,7 @@
.Rdata
.httr-oauth
.DS_Store
renv
renv.lock
.Rprofile
README.html
12 changes: 9 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,13 @@ Version: 0.0.6.9007
Authors@R: c(
person("Lucio", "Queiroz", , "luciorqueiroz@gmail.com",
role = c("aut", "cre", "cph"),
comment = c(ORCID = "0000-0002-6090-1834"))
)
comment = c(ORCID = "0000-0002-6090-1834")),
person(given = "Claudio",
family = "Zanettini",
role = c("aut", "ctb"),
email = "claudio.zanettini@gmail.com",
comment = c(ORCID = "0000-0001-5043-8033")
))
Description: Simplifies the execution of command line tools within isolated and
reproducible environments.
It enables users to effortlessly manage Conda environments,
Expand All @@ -30,5 +35,6 @@ Imports:
withr
Suggests:
testthat (>= 3.0.0),
curl
curl,
dplyr
Config/testthat/edition: 3
197 changes: 197 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# condathis

<!-- badges: start -->
[![R-CMD-check](https://github.com/luciorq/condathis/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/luciorq/condathis/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

Run system command line interface (CLI) tools in a **reproducible** and **isolated** environment **within R**.

## Get started

Install package from [R-Universe](https://luciorq.r-universe.dev/condathis):

```r
install.packages("condathis", repos = c("https://luciorq.r-universe.dev", getOption("repos")))
```

### Installing the development version

``` r
remotes::install_github("luciorq/condathis")
```

## Motivation

One of the main disadvantages of calling CLI tools within `R` is that they are system-specific. This affects the replicability of your code, making it dependent on the system it’s run on. Additionally, using multiple CLI tools increases the likelihood of encountering version conflicts, where different tools require different versions of the same library. Therefore, relying on system-specific tools within `R` is generally not recommended.

The package `{condathis}` lets you call CLI tools within R while keeping things reproducible and isolated.

This means you can use `R` alongside other tools without the drawback of having system-specific code. It opens up the possibility of creating code and pipelines in `R` that integrate multiple CLI tools. This is especially useful for bioinformatics and other fields that rely on many software tools for conducting complex analysis.

## Reproducibility: An Example

### The issue with `system`

Suppose you're writing a pipeline or just a script for some analysis, and you want to use [`fastqc`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) — a program to check the quality of FASTQ files. You've installed `fastqc` and use `system2` to run it.

The `fastqc` command synopsis is `fastqc <path-to-fastq-file> -o <output-dir>`. The output directory is where `fastqc` saves its quality control reports.



```{r eval=FALSE}
fastq_file <- system.file("extdata", "sample1_L001_R1_001.fastq.gz", package = "condathis")
temp_out_dir <- tempdir()
system2(command = "fastqc", args = c(fastq_file, "-o", temp_out_dir))
```


The `fastqc` program generates several output files, including a zip file that is 424KB in size. To get information about one of the output files, we can use:

```{r eval=FALSE}
library(fs)
library(dplyr)
file_info(fs::dir_ls(temp_out_dir, glob = "*zip")) |>
mutate(file_name = path_file(path)) |>
select(file_name, size)
```


```{r echo=FALSE, message=FALSE}
fastq_file <- system.file("extdata", "sample1_L001_R1_001.fastq.gz", package = "condathis")
temp_out_dir <- tempdir()
condathis::create_env(packages = "fastqc==0.11.2", env_name = "fastqc-0.11.2")
condathis::run("fastqc", fastq_file, "-o", temp_out_dir, env_name = "fastqc-0.11.2")
library(fs)
library(dplyr)
file_info(fs::dir_ls(temp_out_dir, glob = "*zip")) |>
mutate(file_name = path_file(path)) |>
select(file_name, size)
```

Now, let's consider the scenario where you share your code with someone else or revisit it yourself after a year. There's no guarantee the code will run because it relies on a specific CLI tool installed on the system. In the worst case, it might run without throwing any errors but produce different results, so you might not even realize that.

The exact same code run on the same system but with an updated version of `fastqc` (0.12.1 instead of 0.11.2) generates a different file, and its size is different as well: *446k instead of 424k*.


```{r echo=FALSE}
temp_out_dir_2 <- tempdir()
condathis::create_env(packages = "fastqc==0.12.1", env_name = "fastqc-0.12.1")
condathis::run("fastqc", fastq_file, "-o", temp_out_dir, env_name = "fastqc-0.12.1")
condathis::remove_env("fastqc-0.12.1")
file_info(fs::dir_ls(temp_out_dir_2, glob = "*zip")) |>
mutate(file_name = path_file(path)) |>
select(file_name, size)
```

This discrepancy limits the workflow, pipelines, and scripts to using only `R` packages!

What can we do about it? We can use `{condathis}`!

The package **`{condathis}`** ensures that the code you share and the results from running `fastqc` will be **consistent across different systems and over time**!


### The solution with `{condathis}`

We would first create an isolated environment containing a specific version of the package `fastqc` (0.12.1). The command automatically manages all the library dependencies of `fastqc`, making sure that they are compatible with the specific operating system.


```{r echo=FALSE}
rm(temp_out_dir_2)
```


```{r}
condathis::create_env(packages = "fastqc==0.12.1", env_name = "fastqc_env", verbose = "output")
```

Then we run the command inside the environment just created which contains a version 0.12.1 of `fastqc`.

```{r echo=FALSE}
# dir of output files
temp_out_dir_2 <- tempdir()
out <- condathis::run("fastqc", fastq_file, "-o", temp_out_dir_2, # command
env_name = "fastqc_env" # environment
)
```

In our temp directory, `fastqc` generated the output files as expected.

```{r}
out
```

In the our temp dir, `fastqc`generated the output files as expected.

```{r}
fs::dir_ls(temp_out_dir_2)
```

The code that we created with `{condathis}` **uses a system CLI tool but is reproducible**.

## Isolation: an example

Another key feature of `{condathis}` is the ability to run CLI tools in **independent, isolated environments**. This allows you to run packages within R that would have conflicting dependencies. This makes it possible for `{condathis}` to run two versions of the same CLI tool simultaneously!

For example, the system's `curl` is of a specific version:

```{r}
libcurlVersion()
```

However, we can choose to use a different version of `curl` run in a different environment. Here, for example, we are installing a different version of `curl` in a separate environment, and checking the version of the newly installed `curl`.

```{r}
condathis::create_env(packages = "curl==8.10.1", env_name = "curl_env", verbose = "output")
out <- condathis::run("curl", "--version",
env_name = "curl_env" # environment
)
cat(out$stdout)
```

This isolation feature of `{condathis}` allows not only running different versions of the same CLI tools but also different tools that have **incompatible dependencies**. One common example is CLI tools that rely on different versions of Python.

## Details

The package `{condathis}` relies on [**`micromamba`**](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html) to bring **reproducibility and isolation**. `micromamba` is a lightweight, fast, and efficient package manager that "does not need a base environment and does not come with a default version of Python".

The integration of `micromamba` into `R` is handled using the `processx` and `withr` packages. The package `processx` runs external processes and manages their input and output, ensuring that commands to `micromamba` are executed correctly from within R. The package `withr` temporarily modifies environment variables and settings, allowing `micromamba` to run smoothly without permanently altering your `R` environment.

## Known limitations

Special characters in CLI commands are interpreted as literals and not expanded.

- It is not supported the use of output redirections in commands, e.g. "|" or ">".
- Instead of redirects (e.g. ">"), use the argument `stdout = "<FILENAME>.txt"`.
Instead of Pipes ("|"), simple run multiple calls to `condathis::run()`,
using `stdout` argument to control the output and input of each command.
- File paths should not use special characters for relative paths, e.g. "~", ".", "..".
- Expand file paths directly in R, using `base` functions
or functions from the `fs` package.




Loading

0 comments on commit 5af7f31

Please sign in to comment.