MetaPhlAn Tutorial with BiobakeryUtils.jl

Installation and setup

If you haven't already, check out the "Getting Started" page to install julia, create an environment,xd and install BiobakeryUtils.jl, and hook up or install the MetaPhlAn v3 command line tools.

This tutorial assumes:

  1. You are running julia v1.6 or greater
  2. You have activated a julia Project that has BiobakeryUtils.jl installed
  3. The metaphlan python package is installed, and accessible from your PATH.

If any of those things aren't true, or you don't know if they're true, go back to "Getting Started" to see if you skipped a step. If you're still confused, please ask (see 3rd bullet point at the top)!

Bowtie2 database

The first time you run metaphlan, it needs to download and unpack the marker database. If you don't care where this goes, don't worry about it - by default it will go into a subdirectory of your conda environment.

If your home folder has limited space, or you want to install it to a particular location (eg a faster drive), you can either pass the kewword argument bowtie2db="/path/to/location" to all metaphlan commands, or set the environment variable METAPHLAN_BOWTIE2_DB.

Input files

Official tutorial link

Some example files you can use to run this tutorial are available from the MetaPhlAn repo, and can be downloaded using the Downloads standard library in julia:

julia> using Downloads: download

julia> base_url = "";

julia>  files = [

julia> for file in files
           download(joinpath(base_url, file), file)

julia> readdir()
9-element Vector{String}:

Run a single sample

Official tutorial link

For convenience, this package has the metaphlan() function, which can be used in your julia scripts to build and call the metaphlan command line tool.

For example, rather than call this from the shell:

$ metaphlan SRS014476-Supragingival_plaque.fasta.gz --input_type fasta > SRS014476-Supragingival_plaque_profile.txt

you can instead call this from the julia REPL:

julia> metaphlan("SRS014476-Supragingival_plaque.fasta.gz",
                 "SRS014476-Supragingival_plaque_profile.tsv"; input_type="fasta")

The first time you run this command, metaphlan will download its database and build Bowtie2 indices for aligning marker genes. It may take a while... maybe go for a walk ๐Ÿ™‚.

Output files

Official tutorial link

You can now load this profile using metaphlan_profile:

julia> mp = metaphlan_profile("SRS014476-Supragingival_plaque_profile.tsv"; sample="SRS014476")
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 11 features in 1 samples

Feature names:
Bacteria, Actinobacteria, Actinobacteria...Corynebacterium_matruchotii, Rothia_dentocariosa

Sample names:

This generates a CommunityProfile type from Microbiome.jl, which is a matrix-like object with MicrobiomeSamples as column headers, and Taxons as row headers.

The samples can be accessed with samples or samplenames:

julia> samples(mp)
1-element Vector{MicrobiomeSample}:
 MicrobiomeSample("SRS014476", {})

julia> samplenames(mp)
1-element Vector{String}:

Notice that in addition to the sample name ("SRS014476"), there's an additional field - that's a metadata dictionary that we can add values to.

julia> plaque = first(samples(mp))
MicrobiomeSample("SRS014476", {})

julia> set!(plaque, :STSite, "Supragingival Plaque")
MicrobiomeSample("SRS014476", {:STSite = "Supragingival Plaque"})

The taxa (Microbiome.jl uses the generic term "features") can be accessed with features or featurenames:

julia> features(mp)
11-element Vector{Taxon}:
 Taxon("Bacteria", :kingdom)
 Taxon("Actinobacteria", :phylum)
 Taxon("Actinobacteria", :class)
 Taxon("Corynebacteriales", :order)
 Taxon("Micrococcales", :order)
 Taxon("Corynebacteriaceae", :family)
 Taxon("Micrococcaceae", :family)
 Taxon("Corynebacterium", :genus)
 Taxon("Rothia", :genus)
 Taxon("Corynebacterium_matruchotii", :species)
 Taxon("Rothia_dentocariosa", :species)

julia> featurenames(mp)
11-element Vector{String}:

Run on multiple cores

Official tutorial link

Other keywords can be passed to metaphlan() as well. For example, to speed things up a bit, try nproc=4:

julia> metaphlan("SRS014459-Stool.fasta.gz",
                        "SRS014459-Stool_profile.tsv"; input_type="fasta", nproc=4)
[ Info: Running command: metaphlan SRS014459-Stool.fasta.gz SRS014459-Stool_profile.tsv --input_type fasta --nproc 4
# ... 

Run multiple samples

Official tutorial link

You can use julia to run metaphlan in a loop to do the rest of the files. Here, we find the output path by replacing .fasta.gz with _profile.tsv, then check if it exists already, and continue if it does (&& means "AND" - if isfile(prof) is true, then it will do the thing on the right).

julia> for f in files
           prof = replace(f, ".fasta.gz"=>"_profile.tsv")
           isfile(prof) && continue
           metaphlan(f, prof; input_type="fasta", nproc=4)
โ”Œ Info: Running command: metaphlan SRS014494-Posterior_fornix.fasta.gz SRS014494-Posterior_fornix_profile.tsv --input_type fasta
โ”” --nproc 4

Merge outputs

Official tutorial link

There are several ways to merge these profiles into one large table. One way is to use, which has a convenient wrapper, metaphlan_merge:

julia> profiles = filter(f-> contains(f, "profile"), readdir())
6-element Vector{String}:

julia> metaphlan_merge(profiles, "merged_abundance_table.tsv")
Process(` -o merged_abundance_table.tsv SRS014459-Stool_profile.tsv SRS014464-Anterior_nares_profile.tsv S
RS014470-Tongue_dorsum_profile.tsv SRS014472-Buccal_mucosa_profile.tsv SRS014476-Supragingival_plaque_profile.tsv SRS014494-Posteri
or_fornix_profile.tsv`, ProcessExited(0))

Then, you can load it as a CommunityProfile using metaphlan_profiles():

julia> mps = metaphlan_profiles("merged_abundance_table.tsv"; samplestart=3)
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 62 features in 6 samples

Feature names:
Bacteria, Actinobacteria, Actinobacteria...Moraxella, Moraxella_nonliquefaciens

Sample names:
SRS014494-Posterior_fornix, SRS014476-Supragingival_plaque, SRS014472-Buccal_mucosa...SRS014464-Anterior_nares, SRS014459-Stool

Note, the samplestart=3 argument is necessary because the second column file generated by contains NCBI taxonomy IDs.

An alternative is to load each individual sample as a CommunityProfile, and then merge them with commjoin:

julia> mps2 = commjoin([metaphlan_profile(p) for p in profiles]...)
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 62 features in 6 samples

Feature names:
Bacteria, Firmicutes, Bacteroidetes...Lactobacillus_crispatus, Lactobacillus_iners

Sample names:
SRS014459-Stool_profile, SRS014464-Anterior_nares_profile, SRS014470-Tongue_dorsum_profile...SRS014476-Supragingival_plaque_profile
, SRS014494-Posterior_fornix_profile

If you'd like that saved to disk, you can use CSV.jl:

julia> using CSV

julia> CSV.write("merged_abundance_table2.tsv", mps2)

One benefit of doing it the later way is that as we're loading the tables, we can attach some metadata to them. For example:

julia> my_profiles = [] # an empty vector

julia> for p in profiles
           file_wo_ext = first(splitext(p))
           (srs, site) = split(file_wo_ext, '-')
           site = replace(site, "_profile"=> "")

           sample = MicrobiomeSample(srs)
           set!(sample, :STSite, site)
           set!(sample, :filename, p)
           push!(my_profiles, metaphlan_profile(p; sample))

julia> mps3 = commjoin(my_profiles...)
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 62 features in 6 samples

Feature names:
Bacteria, Firmicutes, Bacteroidetes...Lactobacillus_crispatus, Lactobacillus_iners

Sample names:
SRS014459, SRS014464, SRS014470...SRS014476, SRS014494

julia> get(mps3)
6-element Vector{NamedTuple{(:sample, :STSite, :filename), Tuple{String, String, String}}}:
 (sample = "SRS014459", STSite = "Stool", filename = "SRS014459-Stool_profile.tsv")
 (sample = "SRS014464", STSite = "Anterior_nares", filename = "SRS014464-Anterior_nares_profile.tsv")
 (sample = "SRS014470", STSite = "Tongue_dorsum", filename = "SRS014470-Tongue_dorsum_profile.tsv")
 (sample = "SRS014472", STSite = "Buccal_mucosa", filename = "SRS014472-Buccal_mucosa_profile.tsv")
 (sample = "SRS014476", STSite = "Supragingival_plaque", filename = "SRS014476-Supragingival_plaque_profile.tsv")
 (sample = "SRS014494", STSite = "Posterior_fornix", filename = "SRS014494-Posterior_fornix_profile.tsv")

See Microbiome.jl docs for more info on metadata and CommunityProfiles)

Analyze Results

Official tutorial link

With the profile loaded, you can use many julia packages to analyze or visualize the results. Inside the CommunityProfile is a sparce matrix, which you can access with abundances. This means that all of julia's powerful statistics and ML libraries are easy to use with your microbiome data.

For more information about indexing and accessing components of the data, see the Microbiome.jl docs

Performing PCoA analysis

To demonstrate this, we'll use a couple of other julia packages to perform and plot a principal coordinates analysis (PCoA): Distances.jl and MulitvariateStats.jl.

(Actually some convenent functions for this are re-exported from Microbiome.jl eg braycurtis and pcoa โ€“ this is just meant to show how easy it is to use the underlying data for whatever you like)

You can install these by opening the Pkg REPL (type ']') and using add:

(my_project) pkg> add MultivariateStats Distances
   Resolving package versions...
    Updating `~/my_project/Project.toml`
  [b4f34e82] + Distances v0.10.4
  [6f286f6a] + MultivariateStats v0.8.0
  No Changes to `~/my_project/Manifest.toml`

julia> # type <backspace> to get back to normal REPL

julia> using MultivariateStats

julia> using Distances

Since abundances gives us a normal matrix, we can easily use all the power of other julia packages to do analysis. Here, we'll create a distance matrix use Bray-Curtis dissimilarity, and then do multi dimensional scaling to get our PCoA ordination[note].

julia> dm = pairwise(BrayCurtis(), abundances(mps3), dims=2) # parwise distances of columns (dimension 2)
6ร—6 Matrix{Float64}:
 0.0       0.853268  0.662373  0.758712  0.857143  0.758712
 0.853268  0.0       0.845517  0.841645  0.857143  0.845517
 0.662373  0.845517  0.0       0.77874   0.857143  0.787196
 0.758712  0.841645  0.77874   0.0       0.857143  0.461648
 0.857143  0.857143  0.857143  0.857143  0.0       0.857143
 0.758712  0.845517  0.787196  0.461648  0.857143  0.0

julia> pc = fit(MDS, dm, distances=true)
Classical MDS(indim = NaN, outdim = 5)

For plotting, I use Makie, but there are many other options.

(my_project) pkg> add CairoMakie
# ...

julia> using CairoMakie

julia> sites = [m.STSite for m in get(mps3)] # or get(m, :STSite)
6-element Vector{String}:

julia> clrs = [:lightgreen, :cyan, :dodgerblue, :orange, :salmon, :purple]
6-element Vector{Symbol}:

julia> loads = sqrt.(pc.ฮป)' .* projection(pc) # loadings for pcoa, columns are axes

julia> fig, ax, plt = scatter(loads[:,1], loads[:, 2], color=clrs,
                                  title="Demo PCoA"

julia> leg = Legend(fig[1,2], [MarkerElement(color = c, marker=:circle) for c in clrs], sites)

julia> fig

Metaphlan PCoA

Stacked bar

One more example - let's plot the proportion of different phyla in each sample. First, we'll filter the table to keep only rows that contain phyla. The filter acts on a CommunityProfile by applying the predicate to the features of the profile.

julia> phyl = filter(t-> taxrank(t) == :phylum, mps3)
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 4 features in 6 samples

Feature names:
Firmicutes, Bacteroidetes, Proteobacteria, Actinobacteria

Sample names:
SRS014459, SRS014464, SRS014470...SRS014476, SRS014494
julia> phylumnames = featurenames(phyl)
4-element Vector{String}:

julia> fig2 = Figure()

julia> ax2 = Axis(fig2[1,1], title="Phyla in samples", xticks=(1:nsamples(phyl), sites));

julia> ax2.xticklabelrotation = ฯ€ / 4 # rotations are in radians, type 'ฯ€' by doing `\pi<tab>`

julia> y = Float64[]

julia> for sample in samples(phyl)
           abs = abundances(phyl[:, sample])
           abs = abs ./ sum(abs)
           append!(y, abs)

julia> x = repeat(1:nsamples(phyl), inner=nfeatures(phyl))
24-element Vector{Int64}:

julia> sitenums = repeat(1:nfeatures(phyl), outer=nsamples(phyl))
24-element Vector{Int64}:

julia> barplot!(ax2, x, y, stack=sitenums,
               color=sitenums, colormap=:Accent_5)
Combined{Makie.barplot, Tuple{Vector{Point{2, Float32}}}}

julia> leg = Legend(fig2[1,2], [MarkerElement(color = c, marker=:rect) for c in to_colormap(:Accent_5, 4)], phylumnames)

julia> fig2

Metaphlan PCoA

Functions and Types

BiobakeryUtils.metaphlan โ€” Method
metaphlan(inputfile, outputfile; kwargs...)

Run metaphlan command line tool on inputfile, creating outputfile. Requires metaphlan to be installed and accessible in the PATH (see Getting Started).

metaphlan options can be passed via keyword arguments. For example, if on the command line you would run:

$ metaphlan some.fastq.gz output/some_profile.tsv --input_type fastq --nproc 8

using this function, you would write:

metaphlan("some.fastq.gz", "output/some_profile.tsv"; input_type="fastq", nproc=8)

Note: the input_type keyword is required.

Set the environmental variable "METAPHLAN_BOWTIE2_DB"to specify the location where the markergene database is/will be installed, or passbowtie2db = "some/path"` as a keyword argument.

BiobakeryUtils.metaphlan_merge โ€” Method
metaphlan_merge(paths, outputfile; kwargs...)

Run merge_metaphlan_tables command line tool on the files in paths, creating outputfile. Requires metaphlan to be installed and accessible in the PATH (see Getting Started).

BiobakeryUtils.metaphlan_profile โ€” Function
metaphlan_profile(path::AbstractString, rank::Union{Int, Symbol}=:all; sample::AbstractString=basename(first(splitext(path))))

Creates a CommunityProfile from a MetaPhlAn output file. Can select data according to taxonomic rank. If rank not given, all data is used. Sample name of the CommunityProfile can be specified by passing a sample argument. If name not given, the name of the file becomes the Sample name.

Levels may be given either as numbers or symbols:

  • 1 = :kingdom
  • 2 = :phylum
  • 3 = :class
  • 4 = :order
  • 5 = :family
  • 6 = :genus
  • 7 = :species
  • 8 = :subspecies
BiobakeryUtils.metaphlan_profiles โ€” Function
metaphlan_profiles(path::AbstractString, rank::Union{Int, Symbol}=:all; keepunidentified=false)

Reads in MetaPhlAn profiles from a merged table into a CommunityProfile. Can select data according to taxonomic rank. If rank not given, all data is used. Set keepunidentified flag to true to keep UNIDENTIFIED data.

Levels may be given either as numbers or symbols:

  • 1 = :kingdom
  • 2 = :phylum
  • 3 = :class
  • 4 = :order
  • 5 = :family
  • 6 = :genus
  • 7 = :species
  • 8 = :subspecies
BiobakeryUtils.metaphlan_profiles โ€” Function
metaphlan_profiles(paths::Array{<:AbstractString, 1}, rank::Union{Int, Symbol}=:all)

Combines MetaPhlAn profiles from multiple single tables into a CommunityProfile.

BiobakeryUtils.parsetaxa โ€” Method
parsetaxa(taxstring::AbstractString; throw_error=true)

Given a string representing taxonmic ranks as formatted by MetaPhlAn (eg "kBacteria|pProteobacteria..."), separates taxonomic ranks into elements of type Taxon in a vector.

BiobakeryUtils.parsetaxon โ€” Method
parsetaxon(taxstring::AbstractString, rank::Union{Int, Symbol})

Finds given taxonomic rank in a string (as formatted by MetaPhlAn (eg "kBacteria|pProteobacteria...")) and returns the name and taxonomic rank as a Taxon. If taxon rank not given, function will return the most specific (lowest) taxonomic rank available.

Levels may be given either as numbers or symbols:

  • 1 = :kingdom
  • 2 = :phylum
  • 3 = :class
  • 4 = :order
  • 5 = :family
  • 6 = :genus
  • 7 = :species
  • 8 = :subspecies
  • noteRight now, the table contains all taxonomic levels, so this doesn't make much sense. For a real analysis, you'd probably want to restrict to a single rank (eg species). You can easily do this with filter: spec = filter(t-> taxrank(t) == :species, mps3). We're not doing this for the demo dataset, because there's not any taxonomic overlap at the species level.