Skip to content

snayfach/UHGV

Repository files navigation

Unified Human Gut Virome Catalog (UHGV)

The UHGV is a comprehensive genomic resource of viruses from the human microbiome. Genomes were derived from 12 independent data sources and annotated using a uniform bioinformatics pipeline:

Table of contents

  1. Methods
  2. Data availability
  3. Bioinformatics tools that use the UHGV

Methods

Data sources

We constructed the UHGV by integrating gut virome collections from a number of recent studies:

Bioinformatics pipeline

Sequences from these studies were combined and run through the following bioinformatics pipeline:

  • geNomad, viralVerify, and CheckV were used to remove sequences from cellular organisms and plasmids, as necessary
  • CheckV was used to trim remaining bacterial DNA from virus ends, estimate completeness, and identify closed genomes. Sequences >10Kb or >50% complete were retained and classified as either complete, high-quality (>90% complete), medium-quality (50-90% complete), or low-quality (<50% complete)
  • BLASTN was used to calculate the average nucleotide identity between viruses using a custom script
  • DIAMOND was used to blast proteins between viral genomes. Pairwise alignments were used to calculate a genome-wide protein-based similarity metric.
  • MCL was used to cluster genomes into viral operational taxonomic units (vOTUs) at approximately the species, subgenus, genus, subfamily, and family-level ranks using a combination of genome-wide ANI for the species level and genome-wide proteomic similarity for higher ranks
  • A representative genome was selected for each species level vOTU based on: presence of terminal repeats, completeness, and ratio of viral:non-viral genes
  • ICTV taxonomy was inferred using a best-genome-hit approach to phage genomes from INPHARED and using taxon-specific marker genes from geNomad
  • CRISPR spacer matching and kmer matching with PHIST were used to connect viruses and host genomes. A voting procedure was used to then identify the host taxon at the lowest taxonomic rank comprising at least 70% of connections
  • HumGut genomes and MAGs from a Hadza hunter-gatherer population were used for host prediction and read mapping (HumGut contains all genomes from the UHGG v1.0 combined with NCBI genomes detected in gut metagenomes)
  • GTDB r207 and GTDB-tk were used to assign taxonomy to all prokaryotic genomes
  • BACPHLIP was used for prediction of phage lifestyle together with integrases from the PHROG database and prophage information from geNomad. Note: BACPHLIP tends to over classify viral genome fragments as lytic
  • Prodigal-gv was used to identify protein-coding genes and alternative genetic codes
  • eggNOG-mapper, PHROGs, KOfam, Pfam, UniRef_90, PADLOC, and the AcrCatalog were used for phage gene functional annotation
  • PhaNNs were used to infer phage structural genes
  • DGRscan was used to identify diversity-generating retroelements on viruses containing reverse transcriptases
  • Bowtie2 was used to align short reads from 1798 whole-metagenomes and 673 viral-enriched metagenomes against the UHGV and database of prokaryotic genomes. ViromeQC was used to select human gut viromes. CoverM was used to estimate the breadth of coverage and we applied a 50% threshold for classifying virus presence-absence

For additional details, please refer to our manuscript: (in preparation).

Data availability

The entire resource is freely available at: https://portal.nersc.gov/UHGV

We provide genomes for three quality tiers:

  • Full: >50% complete or >10Kbp, high-confidence & uncertain viral predictions
  • Medium-quality: >50% complete, high-confidence viral predictions
  • High-quality : >90% complete, high-confidence viral predictions

Additionally, we provide data for:

  • vOTU representatives
  • All genomes in each vOTU

Recommended files

For most analyses, we recommend using these files:

All available files:

  • metadata/

    • uhgv_full_metadata.tsv : detailed information on each of the 874,104 UHGV genome sequences
    • votus_full_metadata.tsv : detailed information on each of the 168,570 species level viral clusters
    • votus_metadata_extended.tsv: additional information on each vOTU
    • host_metadata.tsv : taxonomy and other info for prokaroytic genomes (completeness, contamination, n50)
  • genome_catalogs/

    • uhgv_full.[fna|faa].gz : sequences for all genomes >10kb or >50% completeness
    • uhgv_mq_plus.[fna|faa].gz : sequences for all genomes with >50% completeness
    • uhgv_hq_plus.[fna|faa].gz : sequences for all genomes with >90% completeness
    • votus_full.[fna|faa].gz : sequences for for vOTU representatives >10kb or >50% completeness
    • votus_mq_plus.[fna|faa].gz : sequences for for vOTU representatives with >50% completeness
    • votus_hq_plus.[fna|faa].gz : sequences for vOTU representatives with >90% completeness
  • votu_reps/

    • [genome_id].fna : DNA sequence FASTA file of the genome assembly of the species representative
    • [genome_id].faa : protein sequence FASTA file of the species representative
    • [genome_id].gff : genome GFF file with various sequence annotations
    • [genome_id]_emapper.tsv : eggNOG-mapper annotations of the protein-coding sequences
    • [genome_id]_annotations.tsv : tab-delimited file containing diverse protein-coding annotations (PHROG, Pfam, UniRef90, eggNOG-mapper, PhANNs, KEGG)
  • host_predictions/

    • crispr_spacers.fna : 5,318,089 CRISPR spacers from UHGG (3,143,456), NCBI (1,568,807), and Hadza genomes (605,826)
    • host_genomes_info.tsv : GTDB r207 taxonomy for genomes from the UHGG (286,387), NCBI (123,500), and Hadza genomes (54,779)
    • host_assignment_crispr.tsv : detailed information for host prediction with CRISPR spacers
    • host_assignment_kmers.tsv : detailed information for host prediction with PHIST kmer matching
  • annotations/

    • functional annotation matrices: vOTUs x functions (PHROG, Pfam, KOfam, PADLOC)
  • read_mapping/

    • metagenomes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across bulk metagenomes

    • viromes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across viral-enriched metagenomes

    • sample_metadata.tsv: human sample metadata (country, lifestyle, age, gender, bmi, study)

    • fastq_summary.tsv: information on sequencing reads (sra, bulk/virome metagenome, viromeQC enrichment, read counts)

    • study_metadata.tsv: information on individual studies for read mapping

    • bowtie2_indexes/

      • prokaryote_reps.fna.gz: FASTA of prokaryotic genomes used for read mapping
      • prokaryote_metadata_table.tsv.gz: prok genome metadata
      • prokaryote_reps.1.bt*: bowtie2 indexes

Code availability

Contig-level taxonomic classification with the UHGV toolkit

  • Code to assign viral genomes to taxonomic groups from the UHGV
  • View the README for download and usage instructions.

Read-level abundance profiling with Phanta

  • Phanta (https://github.com/bhattlab/phanta) is a fast and accurate virus-inclusive profiler of human gut metagenomes based on the classification of short reads with Kraken2.
  • Follow the instructions to install the software at the Phanta Github page
  • Download a custom-built UHGV database for Phanta:
    • HQ plus: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_hqplus_v1.tar.gz
    • MQ plus: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_mqplus_v1.tar.gz
    • These databases are similar to Phanta's default database as described in Phanta's manuscript but replacing the viral portion of Phanta’s default DB with UHGV.
  • Phanta can be executed based on the instructions on its GitHub page.

Genome visualization

About

Unified Human Gut Virome Catalog

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages