HUMAnN Tutorial with BiobakeryUtils.jl

Installation and setup

If you haven't already, check out the "Getting Started" page to install julia, create an environment, and install BiobakeryUtils.jl, and hook up or install the HUMAnN v3 command line tools.

This tutorial assumes:

  1. You are running julia v1.6 or greater
  2. You have activated a julia Project that has BiobakeryUtils.jl installed
  3. The humann python package is installed, and accessible from your PATH.

If any of those things aren't true, or you don't know if they're true, go back to "Getting Started" to see if you skipped a step. If you're still confused, please ask (see 3rd bullet point at the top)!

HUMAnN Databases

HUMAnN requires a number of specialized databases to work correctly. When you first install it, it comes with some demo databases that are much smaller, but can be used to complete this tutorial. However, for actually running real data, you'll want to take the time to download them - they're BIG! See here for more information.

For now, the easiest way to do this for now is via the shell, which you can access from the julia REPL by typing ;:

;humann_databases --help
usage: humann_databases [-h] [--available]
                        [--download <database> <build> <install_location>]
                        [--update-config {yes,no}]
                        [--database-location DATABASE_LOCATION]

HUMAnN Databases

optional arguments:
  -h, --help            show this help message and exit
  --available           print the available databases
  --download <database> <build> <install_location>
                        download the selected database to the install location
  --update-config {yes,no}
                        update the config file to set the new database as the default [DEFAULT: yes]
  --database-location DATABASE_LOCATION
                        location (local or remote) to pull the database

;humann_databases --available
HUMANnN2 Databases ( database : build = location )
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.v296_201901b.tar.gz
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.v296_201901b.tar.gz
uniref : uniref50_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901b_ful
l.tar.gz
uniref : uniref90_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_annotated_v201901b_ful
l.tar.gz
uniref : uniref50_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref50_ec_filte
red_201901b_subset.tar.gz
uniref : uniref90_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref90_ec_filte
red_201901b_subset.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_DEMO_diamond_v201901b.tar.
gz
utility_mapping : full = http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz

For example, if you'd like to install these databases to /BigDrive/humann/, you could run

shell> humann_databases --download chocophlan full /BigDrive/humann/chocophlan
# ... lots of output

shell> humann_databases --download uniref uniref90_diamond /BigDrive/humann/uniref
# ... lots of output

shell> humann_databases --download utility_mapping full /BigDrive/humann/utility_mapping
# ... lots of output

At some point, I'll write some functions to automate this, but for now, doing this will update a configuration file, so you shouldn't have to worry about it again.

Running HUMAnN

Official tutorial link

Some example files you can use to run this tutorial are available from the MetaPhlAn repo, and can be downloaded using the Downloads standard library in julia:

julia> using Downloads: download

julia> base_url = "https://github.com/biobakery/humann/raw/master/examples/";

julia> files = [
           "demo.fastq.gz",
           "demo.sam",
           "demo.m8"
       ];

julia> for file in files
           download(joinpath(base_url, file), file)
       end

julia> readdir()
5-element Vector{String}:
 "Manifest.toml"
 "Project.toml"
 "demo.fastq.gz"
 "demo.sam"
 "demo.m8"

For convenience, this package has the humann() function, which can be used in your julia scripts to build and call the humann command line tool.

For example, rather than call

$ humann --input demo.fastq.gz --output demo_fastq

You can do

julia> humann("demo.fastq.gz", "demo_fastq")
[ Info: Running command: humann -i demo.fastq.gz -o demo_fastq
Creating output directory: /home/kevin/my_project/demo_fastq
Output files will be written to: /home/kevin/my_project/demo_fastq
Decompressing gzipped file ...
# ... etc

First, humann will run metaphlan to generate taxonomic profiles, then will use that taxonomic profile to run a custom gene search.

Default outputs

Official tutorial link

To load a profile generated by humann, use the humann_profile function:

julia> gfs = humann_profile("demo_fastq/demo_genefamilies.tsv")
Microbiome.CommunityProfile{Float64, Microbiome.GeneFunction, Microbiome.MicrobiomeSample} with 589 things in 1 places

Sample names:
UNMAPPED, UniRef90_G1UL42, UniRef90_I9QXW8...UniRef90_A6LH06, UniRef90_D0TRR5

Feature names:
demo_genefamilies

HUMAnN generates "stratified" gene function profiles - in other words, each gene function is also split into the species that contributed it. By default, human_profile skips the stratified rows (they can get big!):

julia> first(features(gfs), 5)
5-element Vector{GeneFunction}:
 GeneFunction("UNMAPPED", missing)
 GeneFunction("UniRef90_G1UL42", missing)
 GeneFunction("UniRef90_I9QXW8", missing)
 GeneFunction("UniRef90_A0A174QBF2", missing)
 GeneFunction("UniRef90_A0A078RDY6", missing)

The missing component of the GeneFunction means that these gene functions are not associated with a particular taxon.

If you want to hang onto the taxon information, use the keyword argument stratified = true:

julia> gfs_strat = humann_profile("demo_fastq/demo_genefamilies.tsv", stratified=true)
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 1416 features in 1 samples

Feature names:
UNMAPPED, UniRef90_G1UL42, UniRef90_G1UL42...UniRef90_D0TRR5, UniRef90_D0TRR5

Sample names:
demo_genefamilies



julia> first(features(gfs_strat), 5)
5-element Vector{GeneFunction}:
 GeneFunction("UNMAPPED", missing)
 GeneFunction("UniRef90_G1UL42", missing)
 GeneFunction("UniRef90_G1UL42", Taxon("Bacteroides_dorei", :species))
 GeneFunction("UniRef90_I9QXW8", missing)
 GeneFunction("UniRef90_I9QXW8", Taxon("Bacteroides_dorei", :species))

Here, we can see that the uniref90 "G1UL42" was contributed by Bacteroides dorei.

The object returned by humann_profile is a CommunityProfile type from Microbiome.jl, and has a bunch of useful properties.

Indexing

For CommunityProfiles, you can select features and samples with strings (or regular expressions) representing names of features (rows) or samples (columns):

julia> samplenames(gfs_strat)
1-element Vector{String}:
 "demo_genefamilies"

julia> featurenames(gfs_strat)
1416-element Vector{String}:
 "UNMAPPED"
 "UniRef90_G1UL42"
 "UniRef90_G1UL42"
 "UniRef90_I9QXW8"
 โ‹ฎ
 "UniRef90_A6LH06"
 "UniRef90_D0TRR5"
 "UniRef90_D0TRR5"
 "UniRef90_D0TRR5"

julia> slice = gfs_strat["UniRef90_A0A174NIB7", :]
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 3
features in 1 samples

Feature names:
UniRef90_A0A174NIB7, UniRef90_A0A174NIB7, UniRef90_A0A174NIB7

Sample names:
demo_genefamilies



julia> features(slice)
3-element Vector{GeneFunction}:
 GeneFunction("UniRef90_A0A174NIB7", missing)
 GeneFunction("UniRef90_A0A174NIB7", Taxon("Bacteroides_dorei", :species))
 GeneFunction("UniRef90_A0A174NIB7", Taxon("Bacteroides_vulgatus", :species))

Using a string (or regular expression) to index will return all rows whose name matches, regardless of the taxon. If you just want a single value, you can use a GeneFunction directly:

julia> gfs_strat[GeneFunction("UniRef90_A0A174NIB7", "s__Bacteroides_dorei"), 1]
0.8271298594

You can even pass an array of strings as the row index to get a slice with multiple gene functions:

julia> features(gfs_strat[["UniRef90_A0A174NIB7", "UniRef90_A6L100"], :])
5-element Vector{GeneFunction}:
 GeneFunction("UniRef90_A6L100", missing)
 GeneFunction("UniRef90_A6L100", Taxon("Bacteroides_vulgatus", :species))
 GeneFunction("UniRef90_D0TRR5", missing)
 GeneFunction("UniRef90_D0TRR5", Taxon("Bacteroides_vulgatus", :species))
 GeneFunction("UniRef90_D0TRR5", Taxon("Bacteroides_dorei", :species))

For more information about indexing and accessing components of the data, see the Microbiome.jl docs

Manipulating tables

Official tutorial link

There are several ways to manipulate CommunityProfiles, both using julia and using utilities provided by humann.

One example of the former is using filter, which takes a boolean function as the first argument, and returns a new CommunityProfile containing only rows that returned true.

For example, given a stratified table like gfs_strat, if you want to get only rows that have a taxon associated with them, you can do:

julia> gfs_strat_only = filter(hastaxon, gfs_strat)
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 827 features in 1 samples

Feature names:
UniRef90_G1UL42, UniRef90_I9QXW8, UniRef90_A0A174QBF2...UniRef90_D0TRR5, UniRef90_D0TRR5

Sample names:
demo_genefamilies

Uh oh! We've now lost the "UNMAPPED" row, which means that we won't have the reads that couldn't be mapped to a gene function represented. No matter, we can use julia's anonymous function (also sometimes called "lambda function") syntax to roll our own function.

In the following example, gf -> indicates a function that takes a single argument (in this case, our GeneFunction), then askes if it's name is "UNMAPPED" with name(gf) == "UNMAPPED", OR (|| is a short-circuiting OR operator) if it has a taxon:

julia> gfs_with_unmapped = filter(
                            gf-> name(gf) == "UNMAPPED" || hastaxon(gf),
                            gfs_strat)
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 828 features in 1 samples

Feature names:
UNMAPPED, UniRef90_G1UL42, UniRef90_I9QXW8...UniRef90_D0TRR5, UniRef90_D0TRR5

Sample names:
demo_genefamilies

Normalize RPK to relative abundance

Official tutorial link

Some humann utility scripts have convenience functions in BiobakeryUtils.jl. for example, if you want to renormalize your table into relative abundance, you could use humann_renorm_table from the command line, or call humann_renorm:

julia> gfs_renormed = humann_renorm(gfs_strat; units="relab")
Loading table from: /tmp/jl_9WL33H
  Treating /tmp/jl_9WL33H as stratified output, e.g. ['UniRef90_G1UL42', 'Bacteroides_dorei']
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 589 features in 1 samples

Feature names:
UNMAPPED, UniRef90_G1UL42, UniRef90_I9QXW8...UniRef90_A6LH06, UniRef90_D0TRR5

Sample names:
demo_genefamilies

julia> abundances(gfs_strat[1:5, 1])
5ร—1 SparseArrays.SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 17556.0
   333.333
   333.333
   333.333
   333.333

julia> abundances(gfs_renormed[1:5, 1])
5ร—1 SparseArrays.SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 0.665379
 0.0126335
 0.0126335
 0.00758008
 0.00631673

Regrouping genes to other functional categories

Official tutorial link

Similarly, if we want to regroup our uniref90s into another gene function category like ecs or KOs, we can use humann_regroup

julia> gfs_rxn = humann_regroup(gfs_strat, inkind="uniref90", outkind="rxn")
Loading table from: /tmp/jl_SA9rCQ
  Treating /tmp/jl_SA9rCQ as stratified output, e.g. ['UniRef90_G1UL42', 'Bacteroides_dorei']
Loading mapping file from: /home/kevin/.julia/conda/3/envs/biobakery/lib/python3.7/site-packages/humann/tools/../data/pathways/meta
cyc_reactions_level4ec_only.uniref.bz2
Original Feature Count: 589; Grouped 1+ times: 78 (13.2%); Grouped 2+ times: 20 (3.4%)
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 174 features in 1 samples

Feature names:
UNMAPPED, UNGROUPED, 1.7.7.2-RXN...UDPNACETYLGLUCOSAMACYLTRANS-RXN, UROGENDECARBOX-RXN

Sample names:
demo_genefamilies

julia> first(features(gfs_strat), 5)
5-element Vector{GeneFunction}:
 GeneFunction("UNMAPPED", missing)
 GeneFunction("UniRef90_G1UL42", missing)
 GeneFunction("UniRef90_G1UL42", Taxon("Bacteroides_dorei", :species))
 GeneFunction("UniRef90_I9QXW8", missing)
 GeneFunction("UniRef90_I9QXW8", Taxon("Bacteroides_dorei", :species))

julia> first(features(gfs_rxn), 5)
5-element Vector{GeneFunction}:
 GeneFunction("UNMAPPED", missing)
 GeneFunction("UNGROUPED", missing)
 GeneFunction("1.7.7.2-RXN", missing)
 GeneFunction("1.8.1.4-RXN", missing)
 GeneFunction("2.4.1.135-RXN", missing)

Note - to get other feature types, you may have to download the requisite databases using humann_databases at the command line. See Using Conda.jl

Attaching names to features

Official tutorial link

You can attach names to features using humann_rename.

HUMAnN for multiple samples

Official tutorial link

You can easily run multiple files in a loop in julia. First, download the files (if you already did this in the MetaPhlAn tutorial, no need to repeat it).

julia> base_url = "https://github.com/biobakery/biobakery/raw/master/demos/biobakery_demos/data/metaphlan3/input/";

julia> files = [
           "SRS014476-Supragingival_plaque.fasta.gz",
           "SRS014494-Posterior_fornix.fasta.gz",
           "SRS014459-Stool.fasta.gz",
           "SRS014464-Anterior_nares.fasta.gz",
           "SRS014470-Tongue_dorsum.fasta.gz",
           "SRS014472-Buccal_mucosa.fasta.gz"
       ];

julia> for file in files
           download(joinpath(base_url, file), file)
       end

Then, just write a normal loop with humann:

julia> for file in files
           humann(file, "hmp_subset")
       end
[ Info: Running command: humann -i SRS014476-Supragingival_plaque.fasta.gz -o hmp_subset
Creating output directory: /home/kevin/my_project/hmp_subset
Output files will be written to: /home/kevin/my_project/hmp_subset
Decompressing gzipped file ...
# ... etc

On my decently powerful laptop, this took about 10 min.

To merge them using humann_join_tables, use the convenient julia function, humann_join:

julia> humann_join("hmp_subset", "hmp_subset_genefamilies.tsv"; file_name="genefamilies")
Gene table created: /home/kevin/my_project/hmp_subset_genefamilies.tsv
Process(`humann_join_tables -i hmp_subset -o hmp_subset_genefamilies.tsv --file_name genefamilies`, ProcessExited(0))

This will write a new file that you can then load with humann_profiles

julia> humann_profiles("hmp_subset_genefamilies.tsv"; stratified=true)

Alternatively, you can load each profile into a CommunityProfile, then merge them using the Microbiome.jl function commjoin:

# anonymous function passed to filter files that contain "genefamilies"
julia> hmp_files = filter(f-> contains(f, "genefamilies"),
                              readdir("hmp_subset"; join=true))
6-element Vector{String}:
 "hmp_subset/SRS014459-Stool_genefamilies.tsv"
 "hmp_subset/SRS014464-Anterior_nares_genefamilies.tsv"
 "hmp_subset/SRS014470-Tongue_dorsum_genefamilies.tsv"
 "hmp_subset/SRS014472-Buccal_mucosa_genefamilies.tsv"
 "hmp_subset/SRS014476-Supragingival_plaque_genefamilies.tsv"
 "hmp_subset/SRS014494-Posterior_fornix_genefamilies.tsv"

julia> commjoin(humann_profile.(hmp_files)...)
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 1 features in 6 samples

Feature names:
UNMAPPED

Sample names:
SRS014459-Stool_genefamilies, SRS014464-Anterior_nares_genefamilies, SRS014470-Tongue_dorsum_genefamilies...SRS014476-Supragingival
_plaque_genefamilies, SRS014494-Posterior_fornix_genefamilies

Plotting

Official tutorial link

BiobakeryUtils.jl does not come with plotting recipes (yet), but there are several excellent plotting packages that you can use. Alternatively, you can use the wrapped humann_barplot script.

First, download the pcl file used in the HUMAnN tutorial.

julia> download("https://raw.githubusercontent.com/biobakery/biobakery/master/demos/biobakery_demos/data/humann2/input/hmp_pathabund.pcl", "hmp_pathabund.pcl")

Using humann_barplot

It's probably a good idea to read the tutorial link above that describes the dataset. here are the equivalent julia commands to generate the plots described there.

julia> gfs = read_pcl("hmp_pathabund.pcl"; last_metadata="STSite")



julia> humann_barplot(gfs, "plot1.png"; focal_metadata="STSite", focal_feature="METSYN-PWY")
Process(`humann_barplot --i /tmp/jl_tKWIfW -o plot1.png --last-metadata STSite --focal-metadata STSite --focal-feature METSYN-PWY`,

julia> humann_barplot(gfs, "plot2.png"; focal_metadata="STSite", focal_feature="METSYN-PWY", 
                      sort="sum")
Process(`humann_barplot --i /tmp/jl_vF6GHe -o plot2.png --last-metadata STSite --focal-metadata STSite --focal-feature METSYN-PWY -

julia> humann_barplot(gfs, "plot3.png"; focal_metadata="STSite", focal_feature="METSYN-PWY",
                      sort=["sum", "metadata"],
                      scaling="logstack")
Process(`humann_barplot --i /tmp/jl_VVl3zD -o plot3.png --last-metadata STSite --focal-metadata STSite --focal-feature METSYN-PWY -
-sort sum metadata --scaling logstack`, ProcessExited(0))

julia> humann_barplot(gfs, "plot4.png"; focal_metadata="STSite", focal_feature="COA-PWY",
                      sort="sum")
Process(`humann_barplot --i /tmp/jl_XePIBD -o plot4.png --last-metadata STSite --focal-metadata STSite --focal-feature COA-PWY --so
rt sum`, ProcessExited(0))

julia> humann_barplot(gfs, "plot5.png"; focal_metadata="STSite", focal_feature="COA-PWY",
                      sort="braycurtis",
                      scaling="logstack",
                      as_genera=true,
                      remove_zeros=true)

On the last call, notice that "flag arguments" (eg --as-genera) that don't take arguments on the command line must be set to true in the julia version.

Using julia plotting

Use the read_pcl function to load the pcl file into julia, which will add all of the metadata encoded in the PCL to the resulting CommunityProfile

julia> gfs = read_pcl("hmp_pathabund.pcl", last_metadata="STSite")
CommunityProfile{Float64, GeneFunction, MicrobiomeSample} with 5606 features in 378 samples

Feature names:
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis, 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis, 1CMET2-PWY: N10-formyl-
tetrahydrofolate biosynthesis...VALSYN-PWY: L-valine biosynthesis, VALSYN-PWY: L-valine biosynthesis

Sample names:
SRS011084, SRS011086, SRS011090...SRS058213, SRS058808



julia> first(samples(gfs))
MicrobiomeSample("SRS011084", {:STSite = "Stool"})

For plotting, I tend to use Makie, but there are many other options.

Functions and types

BiobakeryUtils.humann โ€” Method
humann(inputfile, output[, flags]; kwargs...)

Run humann command line tool on inputfile, putting outputs in output. Requires humann to be installed and accessible in the PATH (see Getting Started).

humann flag options (those that don't have a parameter) can be passed in an array, and other options can be passed with keyword arguments. For example, if on the command line you would run:

$ humann -i $INPUTFILE -o $OUTPUT --bypass-tranlated-search --input-format fastq.gz --output-format biom

using this function, you would write:

humann(INTPUTFILE, OUTPUT, ["bypass_translated_search"]; input_formal="fastq.gz", output_format="biom")
BiobakeryUtils.humann_barplot โ€” Method
humann_barplot(comm::CommunityProfile, outpath; kwargs...)

Wrapper for humann_barplot script, to generate plots from functional data. pass keyword arguments for script options. Flag arguments should be set to true. eg

julia> humann_barplot(comm, "plot.png"; focal_metadata="STSite", focal_feature="COA-PWY",
                      sort="braycurtis",
                      scaling="logstack",
                      as_genera=true,
                      remove_zeros=true)

Requires installation of humann available in ENV["PATH"]. See "Using Conda" for more information.

BiobakeryUtils.humann_join โ€” Method
humann_renorm(comm::AbstractDataFrame; units::String="cpm")

Wrapper for humann_renorm_table script, to renormalize from RPKM (reads per kilobase per million) to "cpm" (counts per million) or "relab" (relative abundance).

Requires installation of humann available in ENV["PATH"]. See "Using Conda" for more information.

BiobakeryUtils.humann_profile โ€” Method
humann_profile(path::AbstractString; sample=basename(first(splitext(path))), stratified=false)

Load a single functional profile generated by HUMAnN. By default, skips rows that have species-stratified content, use stratified=true to keep them.

BiobakeryUtils.humann_regroup โ€” Method
function humann_regroup(comm::CommunityProfile; inkind="uniref90", outkind::String="ec")

Wrapper for humann_regroup_table script to convert table from one kind of functional mapping to another.

Requires installation of humann available in ENV["PATH"]. See "Using Conda" for more information.

BiobakeryUtils.humann_rename โ€” Method
humann_rename(comm::AbstractDataFrame; kind::String="ec")

Wrapper for humann_rename_table script, returning a CommunityProfile with re-named features.

Requires installation of humann available in ENV["PATH"]. See "Using Conda" for more information.

BiobakeryUtils.humann_renorm โ€” Method
humann_renorm(comm::AbstractDataFrame; units::String="cpm")

Wrapper for humann_renorm_table script, to renormalize from RPKM (reads per kilobase per million) to "cpm" (counts per million) or "relab" (relative abundance).

Requires installation of humann available in ENV["PATH"]. See "Using Conda" for more information.

BiobakeryUtils.read_pcl โ€” Method
read_pcl(infile; last_metadata=2)

Reads a PCL file and generates a CommunityProfile with metadata attached to the samples.

last_metadata may be a row number or a string representing the final metadatum.

BiobakeryUtils.write_pcl โ€” Method
write_pcl(infile; usemetadata=:all)

Writes a PCL file from a CommunityProfile with metadata attached to the samples.

usemetadata may be :all or a vector of symbols.