Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative paths to data files are not accessible by Singularity containers #184

Open
jgoodson opened this issue Oct 3, 2022 · 4 comments

Comments

@jgoodson
Copy link

jgoodson commented Oct 3, 2022

This error occurs while using the atac-seq-pipeline but I do not believe this is specific to that. When using Singularity, relative paths to local input files do not get included in the Singularity bindpath. This causes jobs to fail when the input files are symlinked back to the original location (the default first-priority for localization) as the base directory is not bound into the container. Changing the paths to absolute paths fixes this issue.

This error will look something like:

Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_task_trim_adapter.py", line 214, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_task_trim_adapter.py", line 157, in main
    args.adapters[i][0] = detect_most_likely_adapter(fastqs[0])
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 49, in detect_most_likely_adapter
    fname)
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 26, in detect_adapters_and_cnts
    with open_gz(fname) as fp:
  File "/software/atac-seq-pipeline/src/detect_adapter.py", line 16, in open_gz
    return gzip.open(fname) if fname.endswith('.gz') else open(fname, 'rb')
  File "/usr/lib/python3.6/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.6/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1/attempt-2/inputs/135737268/ENCFF641SFZ.subsampled.400.fastq.gz'

In this example, that file is a symlink to a file in /gpfs/gsfs8/users/goodsonjr/encode-atac/input/. The submitted script invokes singularity with this command:

singularity exec --cleanenv --home=/gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1 --bind=/fdb/encode-atac-seq-pipeline/v3/hg38,/vf/db/encode-atac-seq-pipeline/v3, https://encode-pipeline-singularity-image.s3.us-west-2.amazonaws.com/atac-seq-pipeline_v2.2.0.sif /bin/bash /gpfs/gsfs8/users/goodsonjr/encode-atac/atac/a72c0a26-b2e0-4bdc-8e63-31a52e65a332/call-align/shard-1/attempt-2/execution/script

This generated bindpaths for the atac.genome_tsv file as it was an absolute path, but the relative paths to the FastQ files are discarded. Since the original files aren't in --bind or the --home path, the container cannot read the file.

Details: Looking at the Caper code, it runs caper.singularity.find_bindpath() on the input JSON file to determine what paths to bind-mount. This function calls autouri.AbsPath to determine if it is a valid path. It then uses some logic to determine which parent directories to bind-mount. Since this function takes the relative paths and directly generates an autouri AbsPath, the relative paths don't generate valid URIs, and won't be included in the bind-path generation logic. I get a comparable result when calling this function directly, this conditional:

def find_dirname(s):
u = AbsPath(s)
if u.is_valid:

evaluates to False when fed a relative path. This means the path won't be included in all_dirnames and won't contribute to the bindpath.

This issue does not seem to arise with the plain local backend without Slurm. I haven't figured out why, but when using Slurm Caper creates symlinks in the workflow run directory, while with the local backend they get copied or hardlinked, despite the generated backend.conf having the same order for backend.providers.Local.config.filesystem.local.localization.

I looked but was unable to find any documentation concerning absolute vs relative paths, and the descriptions of the input JSON format use either web URIs or relative local paths. I am not sure what to suggest, although using os.path.abspath() to convert relative paths to absolute before generating the autouri.AbsPath might potentially resolve this.

@leepc12
Copy link
Contributor

leepc12 commented Oct 11, 2022

I will add documentation about absolute paths in README.

Caper's localization engine autouri cannot distinguish between relative path and plain string, so was not able to add it to Singularity's bindpath. Currently, it's recommended to use absolute paths only in an input JSON particularly for Singularity.

Thanks for reporting, I will look into this and fix it soon. Please use absolute paths until it's fixed.

@xk42
Copy link

xk42 commented Mar 22, 2023

This is affecting our environment as well. Since cromwell localization by default does hard link, soft link and file copy with cromwell. Most institutes have labs and multiple filesystems that shares data and in this case, singularity container seems to be unable to access data that are localized by cromwell and turned into relative paths.

@sidwekhande
Copy link

We faced this issue as well, and using absolute paths did not work for us. To solve this, I created a custom backend and changed the order of the localization list to:

localization = [
    "hard-link"
    "copy"
    "soft-link" 
 ]

and then passed the custom backend file using --backend-file via cmd line.

@leepc12
Copy link
Contributor

leepc12 commented Sep 26, 2023

@sidwekhande Are you using slurm backend? It's weird that changing the order fixed the problem. I think you can fix it by not including linked (symlinked or hardlinked) files in input JSON.

For example, if you have an original file at /home/me/original/genome.tsv and it's symlinked to /home/me/linked/genome.tsv. You should not use /home/me/linked/genome.tsv. Simply using original paths in input JSON will fix the problem.

Make sure that all files defined in genome.tsv are not linked (soft or hard) either. e.g. fasta, genome size files, bowtie2 indices.

I will add this to my ToDo list. I will need to edit caper.singularity.find_bindpath() function to recognize linked files and find original ones (recursively for files in genome TSV too).

{
   "atac.genome_tsv": "/home/me/original/genome.tsv"
}

Please let me know if u can run it without adding your own --backend-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants