-
-
Notifications
You must be signed in to change notification settings - Fork 70
Use an NCBI Assembly Report to Update Contig Names in FASTA VCF GFF BED IntervalList Delimited Files
Want to change the contig names (e.g. to UCSC's names) in your FASTA, use fgbio
!
- Download an NCBI assembly report. For example, the assembly report for GRCh38.p14:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_assembly_report.txt
- Download the corresponding FASTA. For example, the FASTA for GRCh38.p14:
Don't forget to also de-compress (gunzip
) it!
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz
gunzip GCA_000001405.29_GRCh38.p14_genomic.fna.gz
- Create a sequence dictionary with contig aliases from (1) with
fgbio CollectAlternateContigNames
:
The sequencing dictionary is just a SAM file header! Use the --primary
argument to specify the column in the assembly report to use as the primary contig names. Use --alternates
to specify the names of other columns to use as alternate contig names (aliases). Use --sequence-roles
to define which contig types you want to include.
fgbio CollectAlternateContigNames \
--input GCA_000001405.29_GRCh38.p14_assembly_report.txt \
--output GCA_000001405.29_GRCh38.p14_assembly_report.dict \
--primary UcscName \
--alternates RefSeqAccession GenBankAccession AssignedMolecule \
--sequence-roles AssembledMolecule UnlocalizedScaffold UnplacedScaffold AltScaffold
This should create a .dict
file:
$ head GCA_000001405.29_GRCh38.p14_assembly_report.dict
@HD VN:1.6
@SQ SN:chr1 LN:248956422 am:1 ga:CM000663.2 sn:1 ra:na un:chr1 AN:CM000663.2,1 sr:assembled-molecule
@SQ SN:chr2 LN:242193529 am:2 ga:CM000664.2 sn:2 ra:na un:chr2 AN:CM000664.2,2 sr:assembled-molecule
@SQ SN:chr3 LN:198295559 am:3 ga:CM000665.2 sn:3 ra:na un:chr3 AN:CM000665.2,3 sr:assembled-molecule
@SQ SN:chr4 LN:190214555 am:4 ga:CM000666.2 sn:4 ra:na un:chr4 AN:CM000666.2,4 sr:assembled-molecule
@SQ SN:chr5 LN:181538259 am:5 ga:CM000667.2 sn:5 ra:na un:chr5 AN:CM000667.2,5 sr:assembled-molecule
@SQ SN:chr6 LN:170805979 am:6 ga:CM000668.2 sn:6 ra:na un:chr6 AN:CM000668.2,6 sr:assembled-molecule
@SQ SN:chr7 LN:159345973 am:7 ga:CM000669.2 sn:7 ra:na un:chr7 AN:CM000669.2,7 sr:assembled-molecule
@SQ SN:chr8 LN:145138636 am:8 ga:CM000670.2 sn:8 ra:na un:chr8 AN:CM000670.2,8 sr:assembled-molecule
@SQ SN:chr9 LN:138394717 am:9 ga:CM000671.2 sn:9 ra:na un:chr9 AN:CM000671.2,9 sr:assembled-molecule
Notice the alternate contig names in the AN
tag (as per the SAM spec). Lower case tags contain additional information about each contig.
- Index the FASTA with
samtools faidx
:
samtools faidx GCA_000001405.29_GRCh38.p14_genomic.fna
- Update the FASTA with
fgbio UpdateFastaContigNames
:
Below we use --skip-missing
to skip outputting contigs in the input FASTA that are not in our input sequence dictionary, which is required because we did not select all the contigs from the assembly report in step (2). The --sort-by-dict
option is useful if you want to change the order of the contigs in the sequence dictionary before running this step (some folks like Mitochondria first, who can say).
fgbio UpdateFastaContigNames \
--input GCA_000001405.29_GRCh38.p14_genomic.fna \
--dict GCA_000001405.29_GRCh38.p14_assembly_report.dict \
--output GRCh38.p14_genomic.fna \
--sort-by-dict true \
--skip-missing
- Rename/copy the sequence dictionary in to place, and index the output FASTA:
cp -v GCA_000001405.29_GRCh38.p14_assembly_report.dict GRCh38.p14_genomic.dict
samtools faidx GRCh38.p14_genomic.fna
- Update other file types with the
fgbio Update*ContigNames
tools:
In particular, the fgbio UpdateDelimitedFileContigNames
tool can update any delimited file (e.g. BED file).