Gut-to-soil axis 16S rRNA analysis tutorial 💩🌱 - The gut-to-soil axis microbiome analysis tutorial

Background¶

In this tutorial you’ll learn an end-to-end microbiome marker-gene data science workflow, building on data presented in Meilander et al. (2024): Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. The data used here is a subset (a single sequencing run) of that generated for the paper, specifically selected so that this tutorial can be run quickly on a personal computer. The full data set for the paper can be found in the study’s Artifact Repository. In the final step (not yet written, as of 17 April 2025), you’ll learn how to adapt the workflow for use in analyzing your own data using Provenance Replay.

The data used in this tutorial was generated using the Earth Microbiome Project protocol. Specifically, the hypervariable region 4 (V4) of the 16S rRNA gene was amplified using the F515-R806 primers - a broad-coverage primer pair for Bacteria that also amplifies some Archaea. Paired-end sequencing was performed on an Illumina MiSeq. Full details are presented in Meilander et al. (2024).

Sample metadata¶

Before starting the analysis, explore the sample metadata to familiarize yourself with the samples used in this study. The following command will download the sample metadata as tab-separated text and save it in the file sample-metadata.tsv. This sample-metadata.tsv file is used throughout the rest of the tutorial.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

wget -O 'sample-metadata.tsv' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'

from qiime2 import Metadata
from urllib import request

url = 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'
fn = 'sample-metadata.tsv'
request.urlretrieve(url, fn)
sample_metadata_md = Metadata.load(fn)

library(reticulate)

Metadata <- import("qiime2")$Metadata
request <- import("urllib")$request

url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'
fn <- 'sample-metadata.tsv'
request$urlretrieve(url, fn)
sample_metadata_md <- Metadata$load(fn)

sample-metadata.tsv | download

QIIME 2’s metadata plugin provides a Visualizer called tabulate that generates a convenient view of a sample metadata file. Let’s run this, and then we’ll look at the result. Here’s the first QIIME 2 command that you should run in this tutorial:

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime metadata tabulate \
  --m-input-file sample-metadata.tsv \
  --o-visualization sample-metadata.qzv

Using the qiime2 metadata tabulate tool:

For "input":
- Perform the following steps.
  1. Leave as Metadata from TSV
  2. Set "Metadata Source" to sample-metadata.tsv
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 metadata tabulate [...] : visualization.qzv`	`sample-metadata.qzv`

sample-metadata.qzv | download | view

This will generate a QIIME 2 Visualization. Visualizations can be viewed by loading them with QIIME 2 View. Navigate to QIIME 2 View, and drag and drop the visualization that was created to view it.

Access already-imported QIIME 2 data¶

This tutorial begins with paired-end read sequencing data that has already been demultiplexed and imported into a QIIME 2 Artifact. Because sequence data can be delivered to you in many different forms, it’s not possible to cover the varieties here. Instead we refer you to How to import data for use with QIIME 2 to learn how to import your data. If you want to learn why importing is necessary, refer to Why importing is necessary.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

wget -O 'demux.qza' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/demux.qza'

Artifact <- import("qiime2")$Artifact

url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/demux.qza'
fn <- 'demux.qza'
request$urlretrieve(url, fn)
demux <- Artifact$load(fn)

demux.qza | download | view

Summarize demultiplexed sequences¶

When you have demultiplexed sequence data, the next step is typically to generate a visual summary of it. This allows you to determine how many sequences were obtained per sample, and also to get a summary of the distribution of sequence qualities at each position in your sequence data.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime demux summarize \
  --i-data demux.qza \
  --o-visualization demux.qzv

Using the qiime2 demux summarize tool:

Set "data" to #: demux.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 demux summarize [...] : visualization.qzv`	`demux.qzv`

demux.qzv | download | view

Upstream data analysis¶

Generally, the term “upstream” is used to refer to data analysis pre-feature-asv_table, and “downstream” is used to refer to data analysis post-feature-asv_table. Let’s jump into our upstream analysis.

Sequence quality control and feature table construction¶

QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering. In this tutorial we present this step using DADA2. The result of this method will be a FeatureTable[Frequency] QIIME 2 artifact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence] QIIME 2 artifact, which maps feature identifiers in the FeatureTable to the sequences they represent.

DADA2 is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. As implemented in the dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, filter chimeric sequences, and merge paired end reads.

The denoise-paired action, which we’ll use here, requires four parameters that are used in quality filtering:

trim-left-f a, which trims off the first a bases of each forward read
trunc-len-f b which truncates each forward read at position b
trim-left-r c, which trims off the first c bases of each forward read
trunc-len-r d which truncates each forward read at position d This allows the user to remove low quality regions of the sequences. To determine what values to pass for these parameters, you should review the Interactive Quality Plot tab in the demux.qzv file that was generated above.

Solution to Exercise 2

The quality of the initial bases seems to be high, so I choose not to trim any bases from the beginning of the sequences. The quality also seems good all the way out to the end, though maybe dropping off after 250 bases. I’ll therefore truncate at 250. I’ll keep these values the same for both the forward and reverse reads, though that is not a requirement.

Now run your DADA2 command. This step may take up to 10 minutes to complete - it’s the longest running step in this tutorial.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 250 \
  --p-trim-left-r 0 \
  --p-trunc-len-r 250 \
  --o-representative-sequences asv-seqs.qza \
  --o-table asv-table.qza \
  --o-denoising-stats stats.qza

import qiime2.plugins.dada2.actions as dada2_actions

asv_table, asv_seqs, stats = dada2_actions.denoise_paired(
    demultiplexed_seqs=demux,
    trim_left_f=0,
    trunc_len_f=250,
    trim_left_r=0,
    trunc_len_r=250,
)

Using the qiime2 dada2 denoise-paired tool:

Set "demultiplexed_seqs" to #: demux.qza
Set "trunc_len_f" to 250
Set "trunc_len_r" to 250
Expand the additional options section
1. Leave "trim_left_f" as its default value of 0
2. Leave "trim_left_r" as its default value of 0
Press the Execute button.

Once completed, for each new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 dada2 denoise-paired [...] : table.qza`	`asv-table.qza`
`#: qiime2 dada2 denoise-paired [...] : representative_sequences.qza`	`asv-seqs.qza`
`#: qiime2 dada2 denoise-paired [...] : denoising_stats.qza`	`stats.qza`

dada2_actions <- import("qiime2.plugins.dada2.actions")

action_results <- dada2_actions$denoise_paired(
    demultiplexed_seqs=demux,
    trim_left_f=0L,
    trunc_len_f=250L,
    trim_left_r=0L,
    trunc_len_r=250L,
)
asv_seqs <- action_results$representative_sequences
asv_table <- action_results$table
stats <- action_results$denoising_stats

asv_seqs, asv_table, stats = use.action(
    use.UsageAction(plugin_id='dada2',
                    action_id='denoise_paired'),
    use.UsageInputs(demultiplexed_seqs=demux,
                    trim_left_f=0,
                    trunc_len_f=250,
                    trim_left_r=0,
                    trunc_len_r=250),
    use.UsageOutputNames(representative_sequences='asv_seqs',
                         table='asv_table',
                         denoising_stats='stats'))

asv-seqs.qza | download | view
asv-table.qza | download | view
stats.qza | download | view

One of the outputs created by DADA2 is a summary of the denoising run. That is generated as an Artifact, so can’t be viewed directly. However this is one of many QIIME 2 types that can be viewed as Metadata - a very powerful concept that we’ll use again later in this tutorial. Learning to view artifacts as Metadata creates nearly infinite possibilities for how you can explore your microbiome data with QIIME 2.

Here, we’ll again use the metadata plugins tabulate visualizer, but this time we’ll apply it to the DADA2 statistics.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime metadata tabulate \
  --m-input-file stats.qza \
  --o-visualization stats.qzv

Using the qiime2 metadata tabulate tool:

For "input":
- Perform the following steps.
  1. Change to Metadata from Artifact
  2. Set "Metadata Source" to stats.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 metadata tabulate [...] : visualization.qzv`	`stats.qzv`

stats_as_md = use.view_as_metadata('stats_as_md', stats)

use.action(
    use.UsageAction(plugin_id='metadata',
                    action_id='tabulate'),
    use.UsageInputs(input=stats_as_md),
    use.UsageOutputNames(visualization='stats'))

stats.qzv | download | view

Solution to Exercise 4

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime metadata tabulate \
  --m-input-file sample-metadata.tsv stats.qza \
  --o-visualization sample-metadata-w-dada2-stats.qzv

Using the qiime2 metadata tabulate tool:

For "input":
1. Perform the following steps.
  1. Leave as Metadata from TSV
  2. Set "Metadata Source" to sample-metadata.tsv
2. Press the + Insert input button to set up the next steps.
  1. Change to Metadata from Artifact
  2. Set "Metadata Source" to stats.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 metadata tabulate [...] : visualization.qzv`	`sample-metadata-w-dada2-stats.qzv`

sample_metadata_and_dada2_stats_md_md <- sample_metadata_md$merge(stats_as_md_md)
action_results <- metadata_actions$tabulate(
    input=sample_metadata_and_dada2_stats_md_md,
)
sample_metadata_w_dada2_stats_viz <- action_results$visualization

sample_metadata_and_dada2_stats_md = use.merge_metadata('sample_metadata_and_dada2_stats_md', sample_metadata, stats_as_md)

use.action(
    use.UsageAction(plugin_id='metadata',
                    action_id='tabulate'),
    use.UsageInputs(input=sample_metadata_and_dada2_stats_md),
    use.UsageOutputNames(visualization='sample_metadata_w_dada2_stats'))

sample-metadata-w-dada2-stats.qzv | download | view

Feature table and feature data summaries¶

After DADA2 completes, you’ll want to explore the resulting data. You can do this using the following two commands, which will create visual summaries of the data. The feature-table summarize action command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table summarize-plus \
  --i-table asv-table.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary asv-table.qzv \
  --o-sample-frequencies sample-frequencies.qza \
  --o-feature-frequencies asv-frequencies.qza

import qiime2.plugins.feature_table.actions as feature_table_actions

asv_frequencies, sample_frequencies, asv_table_viz = feature_table_actions.summarize_plus(
    table=asv_table,
    metadata=sample_metadata_md,
)

Using the qiime2 feature-table summarize-plus tool:

Set "table" to #: asv-table.qza
Expand the additional options section
- For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Leave as Metadata from TSV
    2. Set "Metadata Source" to sample-metadata.tsv
Press the Execute button.

Once completed, for each new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza`	`asv-frequencies.qza`
`#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza`	`sample-frequencies.qza`
`#: qiime2 feature-table summarize-plus [...] : summary.qzv`	`asv-table.qzv`

feature_table_actions <- import("qiime2.plugins.feature_table.actions")

action_results <- feature_table_actions$summarize_plus(
    table=asv_table,
    metadata=sample_metadata_md,
)
asv_table_viz <- action_results$summary
sample_frequencies <- action_results$sample_frequencies
asv_frequencies <- action_results$feature_frequencies

_, _, asv_frequencies = use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='summarize_plus'),
    use.UsageInputs(table=asv_table,
                    metadata=sample_metadata),
    use.UsageOutputNames(summary='asv_table',
                         sample_frequencies='sample_frequencies',
                         feature_frequencies='asv_frequencies'))

asv-table.qzv | download | view
sample-frequencies.qza | download | view
asv-frequencies.qza | download | view

The feature-table tabulate-seqs action command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database. We can also include the feature frequency information in this visualization by passing it as metadata, similar to how we merged metadata in the exercise above. In this case, however, we’re looking a feature metadata, as opposed to sample metadata. As far as QIIME 2 is concerned, there is no difference between these two - in our case, it’ll only be the identifiers that differ.

This visualization will be very useful later in the tutorial, when you want to learn more about specific features that are important in the data set.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table tabulate-seqs \
  --i-data asv-seqs.qza \
  --m-metadata-file asv-frequencies.qza \
  --o-visualization asv-seqs.qzv

Using the qiime2 feature-table tabulate-seqs tool:

Set "data" to #: asv-seqs.qza
Expand the additional options section
- For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Change to Metadata from Artifact
    2. Set "Metadata Source" to asv-frequencies.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table tabulate-seqs [...] : visualization.qzv`	`asv-seqs.qzv`

asv_frequencies_md_md <- asv_frequencies$view(Metadata)
action_results <- feature_table_actions$tabulate_seqs(
    data=asv_seqs,
    metadata=asv_frequencies_md_md,
)
asv_seqs_viz <- action_results$visualization

asv_frequencies_as_md = use.view_as_metadata('asv_frequencies_md',
                                                 asv_frequencies)

use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='tabulate_seqs'),
    use.UsageInputs(data=asv_seqs,
                    metadata=asv_frequencies_as_md),
    use.UsageOutputNames(visualization='asv_seqs'),
)

asv-seqs.qzv | download | view

Filtering features from a feature table¶

If you review the tabulated feature sequences, or the feature detail table of the feature table summary, you’ll notice that there are many sequences that are observed in only a single sample. Let’s filter those out to reduce the number of sequences we’re working with - this will speed up several slower steps that are coming up.

This is a two-step process. First we filter our feature table, and then we use the new feature table to filter our sequences to only the ones that are contained in the new table.

Tip

I include _ms2 in the new output names here to remind us that these are filtered to those features in a minimum of 2 samples. Generally speaking, file names are a convenient place to store information like this, but they’re unreliable. File names can easily be changed, and therefore could be modified to contain inaccurate information about the data they contain.

Luckily, QIIME 2’s provenance tracking system records all of the information that we need about how results were generated. We’re therefore free to include information like this in file names if it’s helpful for us, but we shouldn’t ever rely on the file names. If in doubt -- and always be in doubt 🕵️‍♀️ -- refer to the data provenance.

QIIME 2’s data provenance is your source for the truth about how a result was created.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table filter-features \
  --i-table asv-table.qza \
  --p-min-samples 2 \
  --o-filtered-table asv-table-ms2.qza

Using the qiime2 feature-table filter-features tool:

Set "table" to #: asv-table.qza
Expand the additional options section
- Set "min_samples" to 2
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table filter-features [...] : filtered_table.qza`	`asv-table-ms2.qza`

asv_table_ms2, = use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='filter_features'),
    use.UsageInputs(table=asv_table,
                    min_samples=2),
    use.UsageOutputNames(filtered_table='asv_table_ms2'),
)

asv-table-ms2.qza | download | view

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table filter-seqs \
  --i-data asv-seqs.qza \
  --i-table asv-table-ms2.qza \
  --o-filtered-data asv-seqs-ms2.qza

Using the qiime2 feature-table filter-seqs tool:

Set "data" to #: asv-seqs.qza
Expand the additional options section
- Set "table" to #: asv-table-ms2.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table filter-seqs [...] : filtered_data.qza`	`asv-seqs-ms2.qza`

asv_seqs_ms2, = use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='filter_seqs'),
    use.UsageInputs(data=asv_seqs,
                    table=asv_table_ms2),
    use.UsageOutputNames(filtered_data='asv_seqs_ms2'),
)

asv-seqs-ms2.qza | download | view

Solution to Exercise 7

Here’s the command you would use:

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table summarize-plus \
  --i-table asv-table-ms2.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary asv-table-ms2.qzv \
  --o-sample-frequencies sample-frequencies-ms2.qza \
  --o-feature-frequencies asv-frequencies-ms2.qza

Using the qiime2 feature-table summarize-plus tool:

Set "table" to #: asv-table-ms2.qza
Expand the additional options section
- For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Leave as Metadata from TSV
    2. Set "Metadata Source" to sample-metadata.tsv
Press the Execute button.

Once completed, for each new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza`	`asv-frequencies-ms2.qza`
`#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza`	`sample-frequencies-ms2.qza`
`#: qiime2 feature-table summarize-plus [...] : summary.qzv`	`asv-table-ms2.qzv`

action_results <- feature_table_actions$summarize_plus(
    table=asv_table_ms2,
    metadata=sample_metadata_md,
)
asv_table_ms2_viz <- action_results$summary
sample_frequencies_ms2 <- action_results$sample_frequencies
asv_frequencies_ms2 <- action_results$feature_frequencies

_, _, asv_frequencies_ms2 = use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='summarize_plus'),
    use.UsageInputs(table=asv_table_ms2,
                    metadata=sample_metadata),
    use.UsageOutputNames(summary='asv_table_ms2',
                         sample_frequencies='sample_frequencies_ms2',
                         feature_frequencies='asv_frequencies_ms2'))

asv-table-ms2.qzv | download | view
sample-frequencies-ms2.qza | download | view
asv-frequencies-ms2.qza | download | view

Be sure to run this as we’re going to use one of the results below.

Taxonomic annotation¶

Before we complete our upstream analysis steps, we’ll generate taxonomic annotations for our sequences using the feature-classifier plugin.

Note

The taxonomic classifier used here is very specific to the sample preparation and sequencing protocols used for this study, and Greengenes 13_8, which it is trained on, is an outdated reference database. The reason we use it here is because the reference data is relatively small, so classification can be run on most modern computers with this classifier^[1].

When you’re ready to work on your own data, one of the choices you’ll need to make is what classifier to use for your data. You can find pre-trained classifiers the QIIME 2 Library Resources page. If you don’t find a classifier that will work for you there, you may be able to find one on the Forum, or you can learn how to train your own by referencing the RESCRIPt documentation.

We strongly recommend the use of environment-weighted classifiers, as described in Kaehler et al., (2019), and you can find builds of these on the Resources page.

First, we’ll download a pre-trained classifier artifact.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

wget -O 'suboptimal-16S-rRNA-classifier.qza' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'

url = 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn = 'suboptimal-16S-rRNA-classifier.qza'
request.urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier = Artifact.load(fn)

url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn <- 'suboptimal-16S-rRNA-classifier.qza'
request$urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier <- Artifact$load(fn)

def classifier_factory():
    from urllib import request
    from qiime2 import Artifact
    fp, _ = request.urlretrieve(
        'https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes/gg-13-8-99-515-806-nb-classifier.qza')

    return Artifact.load(fp)

classifier = use.init_artifact('suboptimal-16S-rRNA-classifier', classifier_factory)

suboptimal-16S-rRNA-classifier.qza | download | view

Then, we’ll apply it to our sequences using classify-sklearn.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-classifier classify-sklearn \
  --i-classifier suboptimal-16S-rRNA-classifier.qza \
  --i-reads asv-seqs-ms2.qza \
  --o-classification taxonomy.qza

import qiime2.plugins.feature_classifier.actions as feature_classifier_actions

taxonomy, = feature_classifier_actions.classify_sklearn(
    classifier=suboptimal_16S_rRNA_classifier,
    reads=asv_seqs_ms2,
)

Using the qiime2 feature-classifier classify-sklearn tool:

Set "reads" to #: asv-seqs-ms2.qza
Set "classifier" to #: suboptimal-16S-rRNA-classifier.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-classifier classify-sklearn [...] : classification.qza`	`taxonomy.qza`

feature_classifier_actions <- import("qiime2.plugins.feature_classifier.actions")

action_results <- feature_classifier_actions$classify_sklearn(
    classifier=suboptimal_16S_rRNA_classifier,
    reads=asv_seqs_ms2,
)
taxonomy <- action_results$classification

taxonomy, = use.action(
    use.UsageAction(plugin_id='feature_classifier',
                    action_id='classify_sklearn'),
    use.UsageInputs(classifier=classifier,
                    reads=asv_seqs_ms2),
    use.UsageOutputNames(classification='taxonomy'))

taxonomy.qza | download | view

Then, to get an initial look at our taxonomic classifications, let’s integrate taxonomy in the sequence summary, like the one we generated above.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table tabulate-seqs \
  --i-data asv-seqs-ms2.qza \
  --i-taxonomy Greengenes_13_8:taxonomy.qza \
  --m-metadata-file asv-frequencies-ms2.qza \
  --o-visualization asv-seqs-ms2.qzv

from qiime2 import ResultCollection

asv_frequencies_md = asv_frequencies_ms2.view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection({
    'Greengenes_13_8': taxonomy,
})
asv_seqs_ms2_viz, = feature_table_actions.tabulate_seqs(
    data=asv_seqs_ms2,
    taxonomy=taxonomy_collection_artifact_collection,
    metadata=asv_frequencies_md,
)

Using the qiime2 feature-table tabulate-seqs tool:

Set "data" to #: asv-seqs-ms2.qza
Expand the additional options section
1. Set "taxonomy" to #: taxonomy-collection/
2. For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Change to Metadata from Artifact
    2. Set "Metadata Source" to asv-frequencies-ms2.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table tabulate-seqs [...] : visualization.qzv`	`asv-seqs-ms2.qzv`

ResultCollection <- import("qiime2")$ResultCollection

asv_frequencies_md <- asv_frequencies_ms2$view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection(dict(
    'Greengenes_13_8' = taxonomy
))
action_results <- feature_table_actions$tabulate_seqs(
    data=asv_seqs_ms2,
    taxonomy=taxonomy_collection_artifact_collection,
    metadata=asv_frequencies_md,
)
asv_seqs_ms2_viz <- action_results$visualization

asv_frequencies_ms2_as_md = use.view_as_metadata('asv_frequencies',
                                                 asv_frequencies_ms2)

taxonomy_collection = use.construct_artifact_collection(
    'taxonomy_collection', {'Greengenes_13_8': taxonomy}
)

use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='tabulate_seqs'),
    use.UsageInputs(data=asv_seqs_ms2,
                    taxonomy=taxonomy_collection,
                    metadata=asv_frequencies_ms2_as_md),
    use.UsageOutputNames(visualization='asv_seqs_ms2'),
)

asv-seqs-ms2.qzv | download | view

Building a tree for phylogenetic diversity calculations¶

QIIME supports several phylogenetic diversity metrics, including Faith’s Phylogenetic Diversity and weighted and unweighted UniFrac. In addition to counts of features per sample (i.e., the data in the FeatureTable[Frequency] QIIME 2 artifact), these metrics require a rooted phylogenetic tree relating the features to one another. The amplicon distribution offers a few ways to build these trees, including a reference-based approach in the fragment-insertion plugin and de novo (i.e., reference-free) approaches in the phylogeny plugin.

The reference based approach, by default, is specific to 16S rRNA marker gene analysis. We could use that here, but the runtime is too long for our documentation.^[1] If you’d like to see this demonstrated, you can refer to the Parkinson’s Mouse tutorial.

The de novo approach is known to generate low quality trees when very short sequences are used as input, but can be used with any phylogenetically informative marker gene (not just 16S). If you’d like to see this demonstrated, you can refer to the Moving Pictures tutorial.

For those reasons, we’re going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here.

Downstream data analysis¶

As mentioned above, we tend to think of “downstream” analysis as beginning with a feature table, taxonomic annotation of our features, and optionally a phylogenetic tree. Now that we have those (with the exception of the tree, which we won’t use here), let’s jump in. This is where it starts to get fun! ⛷️

Kmer-based diversity analysis¶

As mentioned above, we’re going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here. This will use two stand-alone QIIME 2 plugins, q2-boots and q2-kmerizer, which are integrated through the kmer-diversity action in q2-boots. To learn about kmerization of features and how this relates to phylogenetic diversity metrics, read the q2-kmerizer paper. q2-boots provides actions that mirror the interface of the diversity metric calculation actions in the diversity plugin, but generates more robust results because it integrates rarefaction and/or bootstrapping. You can learn more about this in the q2-boots paper.

kmer-diversity is a qiime2.Pipeline: a type of action that links multiple other QIIME 2 actions together for convenience. As such, it does a lot of work. Here are the steps that it takes:

Resample the input feature table (i.e., the ASV table) to contain exactly sampling-depth sequences per sample, either by bootstrapping or rarefaction, n times. Samples with fewer than sampling-depth sequences will be removed from the feature table and not included in the subsequent steps. This will result in n feature tables.
For each feature table resulting from step 1, using the input sequences (i.e., the ASV sequences), kmerize all sequences into kmers of length kmer-size. ^[2] Use this information to create one kmer table per resampled feature table. This will result in n feature tables, where the features are kmers (instead of ASVs, as in the input feature table).
Compute the user-requested alpha- and beta-diversity metrics on each of the kmer tables resulting from step 2. ^[3] The metrics computed by default are:
- Alpha diversity
  - Shannon’s diversity index (a quantitative measure of community richness)
  - Observed Features (a qualitative measure of community richness)
  - Evenness (i.e., Pielou’s Evenness; a measure of community evenness)
- Beta diversity
  - Jaccard distance (a qualitative measure of community dissimilarity)
  - Bray-Curtis distance (a quantitative measure of community dissimilarity)
For each diversity metric, average the results computed across the n kmer tables. These results can be used in subsequent analysis steps (e.g., ordination, statistical modeling, machine learning).
Perform PCoA ordination on the averaged beta diversity distance matrices resulting from Step 4. ^[4]
Generate an interactive q2-vizard scatter plot that contains all user-provided sample metadata, all averaged alpha diversity metrics, and the first three ordination axes for all PCoA matrices computed in step 5.

A key parameter that needs to be provided to kmer-diversity is sampling-depth, which is the even sampling (i.e., bootstrapping or rarefaction) depth. Because most diversity metrics are sensitive to different sampling depths (i.e., sequence counts) across different samples, the tables are randomly subsampled such that the total frequency for each sample is the user-specified sampling depth. For example, if you set sampling-depth=500, this step will subsample the sequence counts in each sample so that each sample in the resulting table has a total frequency of 500. If the total frequency (i.e., the number of sequences observed) for any sample(s) is smaller than this value, those samples will be dropped from the diversity analysis. Choosing this value is tricky. We recommend making your choice by reviewing the information presented in the asv-table-ms2.qzv file that you created above.

I’m going to choose values that is around the first quartile of the sample total frequencies.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime boots kmer-diversity \
  --i-table asv-table-ms2.qza \
  --i-sequences asv-seqs-ms2.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-sampling-depth 96 \
  --p-n 10 \
  --p-replacement \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --output-dir boots-kmer-diversity

import qiime2.plugins.boots.actions as boots_actions

action_results = boots_actions.kmer_diversity(
    table=asv_table_ms2,
    sequences=asv_seqs_ms2,
    metadata=sample_metadata_md,
    sampling_depth=96,
    n=10,
    replacement=True,
    alpha_average_method='median',
    beta_average_method='medoid',
)
bootstrap_tables_artifact_collection = action_results.resampled_tables
kmer_tables_artifact_collection = action_results.kmer_tables
bootstrap_alpha_diversities_artifact_collection = action_results.alpha_diversities
bootstrap_distance_matrices_artifact_collection = action_results.distance_matrices
bootstrap_pcoas_artifact_collection = action_results.pcoas
kmer_diversity_scatter_plot_viz = action_results.scatter_plot

Using the qiime2 boots kmer-diversity tool:

Set "table" to #: asv-table-ms2.qza
Set "sequences" to #: asv-seqs-ms2.qza
Set "sampling_depth" to 96
For "metadata":
- Perform the following steps.
  1. Leave as Metadata from TSV
  2. Set "Metadata Source" to sample-metadata.tsv
Set "n" to 10
Set "replacement" to Yes
Expand the additional options section
1. Leave "alpha_average_method" as its default value of median
2. Set "beta_average_method" to medoid
Press the Execute button.

Once completed, for each new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 boots kmer-diversity [...] : resampled_tables.qza`	`bootstrap-tables/`
`#: qiime2 boots kmer-diversity [...] : kmer_tables.qza`	`kmer-tables/`
`#: qiime2 boots kmer-diversity [...] : alpha_diversities.qza`	`bootstrap-alpha-diversities/`
`#: qiime2 boots kmer-diversity [...] : distance_matrices.qza`	`bootstrap-distance-matrices/`
`#: qiime2 boots kmer-diversity [...] : pcoas.qza`	`bootstrap-pcoas/`
`#: qiime2 boots kmer-diversity [...] : scatter_plot.qzv`	`kmer-diversity-scatter-plot.qzv`

boots_actions <- import("qiime2.plugins.boots.actions")

action_results <- boots_actions$kmer_diversity(
    table=asv_table_ms2,
    sequences=asv_seqs_ms2,
    metadata=sample_metadata_md,
    sampling_depth=96L,
    n=10L,
    replacement=TRUE,
    alpha_average_method='median',
    beta_average_method='medoid',
)
bootstrap_tables_artifact_collection <- action_results$resampled_tables
kmer_tables_artifact_collection <- action_results$kmer_tables
bootstrap_alpha_diversities_artifact_collection <- action_results$alpha_diversities
bootstrap_distance_matrices_artifact_collection <- action_results$distance_matrices
bootstrap_pcoas_artifact_collection <- action_results$pcoas
kmer_diversity_scatter_plot_viz <- action_results$scatter_plot

use.action(
    use.UsageAction(plugin_id='boots',
                    action_id='kmer_diversity'),
    use.UsageInputs(table=asv_table_ms2,
                    sequences=asv_seqs_ms2,
                    metadata=sample_metadata,
                    sampling_depth=96,
                    n=10,
                    replacement=True,
                    alpha_average_method='median',
                    beta_average_method='medoid'),
    use.UsageOutputNames(
        resampled_tables='bootstrap_tables',
        kmer_tables='kmer_tables',
        alpha_diversities='bootstrap_alpha_diversities',
        distance_matrices='bootstrap_distance_matrices',
        pcoas='bootstrap_pcoas',
        scatter_plot='kmer_diversity_scatter_plot')
)

boots-kmer-diversity/resampled_tables | download
boots-kmer-diversity/kmer_tables | download
boots-kmer-diversity/alpha_diversities | download
boots-kmer-diversity/distance_matrices | download
boots-kmer-diversity/pcoas | download
boots-kmer-diversity/scatter_plot.qzv | download | view

After computing diversity metrics, we can begin to explore the microbial composition of the samples in the context of the sample metadata. You can review the sample metadata using one of the tabulated views of this file that we created above.

Alpha rarefaction plotting¶

In this section we’ll explore alpha diversity as a function of sampling depth using thealpha-rarefaction action. This visualizer computes one or more alpha diversity metrics at multiple sampling depths, in steps between 1 (optionally controlled with min-depth) and the value provided as max-depth. At each sampling depth step, 10 rarefied tables will be generated, and the diversity metrics will be computed for all samples in the tables. The number of iterations (rarefied tables computed at each sampling depth) can be controlled with the iterations parameter. Average diversity values will be plotted for each sample at each even sampling depth, and samples can be grouped based on metadata in the resulting visualization if sample metadata is provided.

The value that you provide for max-depth should be determined by reviewing the “Frequency per sample” information presented in the asv-table-ms2.qzv file. In general, choosing a value that is somewhere around the median frequency seems to work well, but you may want to increase that value if the lines in the resulting rarefaction plot don’t appear to be leveling out, or decrease that value if you seem to be losing many of your samples due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime diversity alpha-rarefaction \
  --i-table asv-table-ms2.qza \
  --p-max-depth 260 \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization alpha-rarefaction.qzv

import qiime2.plugins.diversity.actions as diversity_actions

alpha_rarefaction_viz, = diversity_actions.alpha_rarefaction(
    table=asv_table_ms2,
    max_depth=260,
    metadata=sample_metadata_md,
)

Using the qiime2 diversity alpha-rarefaction tool:

Set "table" to #: asv-table-ms2.qza
Set "max_depth" to 260
Expand the additional options section
- For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Leave as Metadata from TSV
    2. Set "Metadata Source" to sample-metadata.tsv
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 diversity alpha-rarefaction [...] : visualization.qzv`	`alpha-rarefaction.qzv`

diversity_actions <- import("qiime2.plugins.diversity.actions")

action_results <- diversity_actions$alpha_rarefaction(
    table=asv_table_ms2,
    max_depth=260L,
    metadata=sample_metadata_md,
)
alpha_rarefaction_viz <- action_results$visualization

use.action(
    use.UsageAction(plugin_id='diversity',
                    action_id='alpha_rarefaction'),
    use.UsageInputs(table=asv_table_ms2,
                    max_depth=260,
                    metadata=sample_metadata),
    use.UsageOutputNames(visualization='alpha_rarefaction'))

alpha-rarefaction.qzv | download | view

The visualization will have two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to “level out” (i.e., approach a slope of zero) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in the plot don’t level out, this may be because the richness of the samples hasn’t been fully observed yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).

The bottom plot in this visualization is important when grouping samples by metadata. It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth. If a given sampling depth d is larger than the total frequency of a sample s (i.e., the number of sequences that were obtained for sample s), it is not possible to compute the diversity metric for sample s at sampling depth d. If many of the samples in a group have lower total frequencies than d, the average diversity presented for that group at d in the top plot will be unreliable because it will have been computed on relatively few samples. When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable.

Taxonomic analysis¶

Next, we can view the taxonomic composition of our samples with interactive bar plots. Generate those plots with the following command and then open the visualization.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime taxa barplot \
  --i-table asv-table-ms2.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv

Using the qiime2 taxa barplot tool:

Set "table" to #: asv-table-ms2.qza
Expand the additional options section
1. Set "taxonomy" to #: taxonomy.qza
2. For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Leave as Metadata from TSV
    2. Set "Metadata Source" to sample-metadata.tsv
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 taxa barplot [...] : visualization.qzv`	`taxa-bar-plots.qzv`

taxa_actions <- import("qiime2.plugins.taxa.actions")

action_results <- taxa_actions$barplot(
    table=asv_table_ms2,
    taxonomy=taxonomy,
    metadata=sample_metadata_md,
)
taxa_bar_plots_viz <- action_results$visualization

use.action(
    use.UsageAction(plugin_id='taxa',
                    action_id='barplot'),
    use.UsageInputs(table=asv_table_ms2,
                    taxonomy=taxonomy,
                    metadata=sample_metadata),
    use.UsageOutputNames(visualization='taxa_bar_plots'))

taxa-bar-plots.qzv | download | view

Solution to Exercise 13

We use the ASV table here because the feature ids in that table are the same as the ones used in the FeatureData[Taxonomy] artifact. Additionally, kmerization of our data is a tool used for computing diversity metrics - not something we generally intend to use throughout our analyses.

Differential abundance testing with ANCOM-BC2¶

ANCOM-BC2 is a compositionally-aware linear regression model that allows testing for differentially abundant features across sample groups while also implementing bias correction. This can be accessed using the ancombc2 action in the composition plugin.

Differential abundance testing is easy to get wrong! ☠️

Accurately identifying individual features that are differentially abundant across sample types in microbiome data is a challenging problem and an open area of research, particularly if you don’t have an a priori hypothesis about which feature(s) are differentially abundant. A q-value that suggests that you’ve identified a feature that is differentially abundant across sample groups should be considered a hypothesis, not a conclusion, and you need new samples to test that new hypothesis.

In addition to the methods contained in the composition plugin, new approaches for differential abundance testing are regularly introduced. It’s worth assessing the current state of the field when performing differential abundance testing to see if there are new methods that might be useful for your data. If in doubt, consult a statistician.

We’ll perform this analysis in a few steps.

Filter samples from the feature table¶

First, we’ll filter our samples from our feature table such that we only have the three groups that we have the most samples for.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime feature-table filter-samples \
  --i-table asv-table-ms2.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-where '[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")' \
  --o-filtered-table asv-table-ms2-dominant-sample-types.qza

asv_table_ms2_dominant_sample_types, = feature_table_actions.filter_samples(
    table=asv_table_ms2,
    metadata=sample_metadata_md,
    where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)

Using the qiime2 feature-table filter-samples tool:

Set "table" to #: asv-table-ms2.qza
Expand the additional options section
1. For "metadata":
  - Press the + Insert metadata button to set up the next steps.
    1. Leave as Metadata from TSV
    2. Set "Metadata Source" to sample-metadata.tsv
2. Set "where" to [SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 feature-table filter-samples [...] : filtered_table.qza`	`asv-table-ms2-dominant-sample-types.qza`

action_results <- feature_table_actions$filter_samples(
    table=asv_table_ms2,
    metadata=sample_metadata_md,
    where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)
asv_table_ms2_dominant_sample_types <- action_results$filtered_table

asv_table_ms2_dominant_sample_types, = use.action(
    use.UsageAction(plugin_id='feature_table',
                    action_id='filter_samples'),
    use.UsageInputs(table=asv_table_ms2,
                    metadata=sample_metadata,
                    where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")'),
    use.UsageOutputNames(filtered_table='asv_table_ms2_dominant_sample_types'))

asv-table-ms2-dominant-sample-types.qza | download | view

Apply differential abundance testing¶

Then, we’ll apply ANCOM-BC2 to see which ASV are differentially abundant across those sample types. I specify a reference level here as this defines what each group is compared against. Since the focus of this study is HEC, I choose that as my reference level. That will let us see what ASVs are over- or under-represented in the other two sample groups (Human Excrement and Food Compost) relative to HEC, as HEC defines the “global intercept” that will be measured against.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime composition ancombc2 \
  --i-table asv-table-ms2-dominant-sample-types.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-fixed-effects-formula SampleType \
  --p-reference-levels 'SampleType::Human Excrement Compost' \
  --o-ancombc2-output ancombc2-results.qza

import qiime2.plugins.composition.actions as composition_actions

ancombc2_results, = composition_actions.ancombc2(
    table=asv_table_ms2_dominant_sample_types,
    metadata=sample_metadata_md,
    fixed_effects_formula='SampleType',
    reference_levels=['SampleType::Human Excrement Compost'],
)

Using the qiime2 composition ancombc2 tool:

Set "table" to #: asv-table-ms2-dominant-sample-types.qza
For "metadata":
- Perform the following steps.
  1. Leave as Metadata from TSV
  2. Set "Metadata Source" to sample-metadata.tsv
Set "fixed_effects_formula" to SampleType
Expand the additional options section
- For "reference_levels":
  1. Set "element" to SampleType::Human Excrement Compost
  2. (Do not insert additional values.)
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 composition ancombc2 [...] : ancombc2_output.qza`	`ancombc2-results.qza`

composition_actions <- import("qiime2.plugins.composition.actions")

action_results <- composition_actions$ancombc2(
    table=asv_table_ms2_dominant_sample_types,
    metadata=sample_metadata_md,
    fixed_effects_formula='SampleType',
    reference_levels=list('SampleType::Human Excrement Compost'),
)
ancombc2_results <- action_results$ancombc2_output

ancombc2_results, = use.action(
    use.UsageAction(plugin_id='composition',
                    action_id='ancombc2'),
    use.UsageInputs(table=asv_table_ms2_dominant_sample_types,
                    metadata=sample_metadata,
                    fixed_effects_formula='SampleType',
                    reference_levels=['SampleType::Human Excrement Compost']),
    use.UsageOutputNames(ancombc2_output='ancombc2_results'))

ancombc2-results.qza | download | view

Finally, we’ll visualize the results. Taxonomic annotations can optionally be provided here: this helps make ASVs ids more interpretable.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime composition ancombc2-visualizer \
  --i-data ancombc2-results.qza \
  --i-taxonomy taxonomy.qza \
  --o-visualization ancombc2-barplot.qzv

Using the qiime2 composition ancombc2-visualizer tool:

Set "data" to #: ancombc2-results.qza
Expand the additional options section
- Set "taxonomy" to #: taxonomy.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 composition ancombc2-visualizer [...] : visualization.qzv`	`ancombc2-barplot.qzv`

use.action(
    use.UsageAction(plugin_id='composition',
                    action_id='ancombc2_visualizer'),
    use.UsageInputs(data=ancombc2_results,
                    taxonomy=taxonomy),
    use.UsageOutputNames(visualization='ancombc2-barplot'))

ancombc2-barplot.qzv | download | view

Solution to Exercise 16

To collapse our ASVs into genera (i.e. level 6 of the Greengenes taxonomy), we can use the following command.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime taxa collapse \
  --i-table asv-table-ms2-dominant-sample-types.qza \
  --i-taxonomy taxonomy.qza \
  --p-level 6 \
  --o-collapsed-table genus-table-ms2-dominant-sample-types.qza

Using the qiime2 taxa collapse tool:

Set "table" to #: asv-table-ms2-dominant-sample-types.qza
Set "taxonomy" to #: taxonomy.qza
Set "level" to 6
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 taxa collapse [...] : collapsed_table.qza`	`genus-table-ms2-dominant-sample-types.qza`

genus_table_ms2_dominant_sample_types, = use.action(
    use.UsageAction(plugin_id='taxa',
                    action_id='collapse'),
    use.UsageInputs(table=asv_table_ms2_dominant_sample_types,
                    taxonomy=taxonomy,
                    level=6),
    use.UsageOutputNames(collapsed_table='genus_table_ms2_dominant_sample_types'))

genus-table-ms2-dominant-sample-types.qza | download | view

We can then provide the resulting table as the input to ANCOM-BC2.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime composition ancombc2 \
  --i-table genus-table-ms2-dominant-sample-types.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-fixed-effects-formula SampleType \
  --p-reference-levels 'SampleType::Human Excrement Compost' \
  --o-ancombc2-output genus-ancombc2-results.qza

genus_ancombc2_results, = composition_actions.ancombc2(
    table=genus_table_ms2_dominant_sample_types,
    metadata=sample_metadata_md,
    fixed_effects_formula='SampleType',
    reference_levels=['SampleType::Human Excrement Compost'],
)

Using the qiime2 composition ancombc2 tool:

Set "table" to #: genus-table-ms2-dominant-sample-types.qza
For "metadata":
- Perform the following steps.
  1. Leave as Metadata from TSV
  2. Set "Metadata Source" to sample-metadata.tsv
Set "fixed_effects_formula" to SampleType
Expand the additional options section
- For "reference_levels":
  1. Set "element" to SampleType::Human Excrement Compost
  2. (Do not insert additional values.)
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 composition ancombc2 [...] : ancombc2_output.qza`	`genus-ancombc2-results.qza`

action_results <- composition_actions$ancombc2(
    table=genus_table_ms2_dominant_sample_types,
    metadata=sample_metadata_md,
    fixed_effects_formula='SampleType',
    reference_levels=list('SampleType::Human Excrement Compost'),
)
genus_ancombc2_results <- action_results$ancombc2_output

genus_ancombc2_results, = use.action(
    use.UsageAction(plugin_id='composition',
                    action_id='ancombc2'),
    use.UsageInputs(table=genus_table_ms2_dominant_sample_types,
                    metadata=sample_metadata,
                    fixed_effects_formula='SampleType',
                    reference_levels=['SampleType::Human Excrement Compost']),
    use.UsageOutputNames(ancombc2_output='genus_ancombc2_results'))

genus-ancombc2-results.qza | download | view

And finally, we can visualize the results. Notice that in this case we’re not providing the taxonomy, because we’ve already intergrated that information by collapsing at the genus level.

[Command Line]

[Python API]

[Galaxy]

[R API]

[View Source]

qiime composition ancombc2-visualizer \
  --i-data genus-ancombc2-results.qza \
  --o-visualization genus-ancombc2-barplot.qzv

Using the qiime2 composition ancombc2-visualizer tool:

Set "data" to #: genus-ancombc2-results.qza
Press the Execute button.

Once completed, for the new entry in your history, use the Edit button to set the name as follows:

(Renaming is optional, but it will make any subsequent steps easier to complete.)

History Name	"Name" to set (be sure to press [Save])
`#: qiime2 composition ancombc2-visualizer [...] : visualization.qzv`	`genus-ancombc2-barplot.qzv`

use.action(
    use.UsageAction(plugin_id='composition',
                    action_id='ancombc2_visualizer'),
    use.UsageInputs(data=genus_ancombc2_results),
    use.UsageOutputNames(visualization='genus-ancombc2-barplot'))

genus-ancombc2-barplot.qzv | download | view

That’s it for now, but more is coming soon!¶

In the near future (as of 17 April 2025) we plan to integrate analyses using the sample-classifier plugin and longitudinal plugin. In the meantime, here are some suggestions to continue your learning:

Build a machine learning classifier that classifies samples accordining to the three dominant sample types in the feature table that we used with ANCOM-BC. (Hint: see classify-samples.)
Perform a longitudinal analysis that tracks samples from different buckets over time. Which taxa change most over time? (Hint: see feature-volatility.)
Remember that the full data set (five sequencing runs) are available in the gut-to-soil Artifact Repository. Grab one of the larger sequencing runs (we worked with a small sequencing run that was generated as a preliminary test), and adapt the commands in this tutorial to work on a bigger data set.

We’re also in the process of refactoring our statistical methods for assessing alpha and beta diversity across groups, using the new stats plugin. We’re therefore holding off on integrating statistical analysis until we have that ready. In the meantime, you can refer to you can refer to the Moving Pictures tutorial, as well as the sample-classifier and longitudinal tutorials.

Replay provenance (work in progress!)¶

You might next want to try to adapt the commands presented in this tutorial to your own data, adjusting parameter settings and metadata column headers as is relevant. QIIME 2’s provenance replay functionality can help with this. Assuming that you ran all of the steps above in a directory called gut-to-soil/, run the following command to generate a template script that you can adapt for your workflow:

qiime tools replay-provenance \
  --in-fp gut-to-soil/taxa-bar-plots.qzv \
  --out-fp g2s-replayed.bash

If you need help, head over to the QIIME 2 Forum.

Footnotes¶

The resource requirements exceed those provided by the Read the Docs (RTD) build system, which is used to build the documentation that you’re reading. RTD provides systems with 7GB of RAM for 30 minutes maximum to build documentation. That’s a very reasonable (and generous) allocation for building documentation, so we choose to work within those contraints rather than creating our own documentation build system like we’ve had in the past (e.g., for https://docs.qiime2.org).
↩↩
kmerization of biological sequences is described in the Database Searching chapter of An Introduction to Applied Bioinformatics.
↩
Learn more about the available metrics in this QIIME 2 Forum post.
↩
This process is discussed in the Machine Learning in Bioinformatics chapter of An Introduction to Applied Bioinformatics.
↩

References¶

Raspet, I., Gehret, E., Herman, C., Meilander, J., Manley, A., Simard, A., Bolyen, E., & Caporaso, J. G. (2025). Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2-boots. F1000Research, 14, 87. 10.12688/f1000research.156295.1
Bokulich, N. A. (2025). Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting. mSystems, 10(3). 10.1128/msystems.01550-24
Meilander, J., Herman, C., Manley, A., Augustine, G., Birdsell, D., Bolyen, E., Celona, K. R., Coffey, H., Cocking, J., Donoghue, T., Draves, A., Erickson, D., Foley, M., Gehret, L., Hagen, J., Hepp, C., Ingram, P., John, D., Kadar, K., … Caporaso, J. G. (2024). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. arXiv. 10.48550/ARXIV.2411.04148
Caporaso, J. G., & Meilander, J. (2025). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis (supporting data). Zenodo. 10.5281/ZENODO.13887456
Keefe, C. R., Dillon, M. R., Gehret, E., Herman, C., Jewell, M., Wood, C. V., Bolyen, E., & Caporaso, J. G. (2023). Facilitating bioinformatics reproducibility with QIIME 2 Provenance Replay. PLOS Computational Biology, 19(11), e1011676. 10.1371/journal.pcbi.1011676
Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Huntley, J., Fierer, N., Owens, S. M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J. A., Smith, G., & Knight, R. (2012). Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. The ISME Journal, 6(8), 1621–1624. 10.1038/ismej.2012.8
Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581–583. 10.1038/nmeth.3869
Amir, A., McDonald, D., Navas-Molina, J. A., Kopylova, E., Morton, J. T., Zech Xu, Z., Kightley, E. P., Thompson, L. R., Hyde, E. R., Gonzalez, A., & Knight, R. (2017). Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems, 2(2). 10.1128/msystems.00191-16
Bokulich, N. A., Subramanian, S., Faith, J. J., Gevers, D., Gordon, J. I., Knight, R., Mills, D. A., & Caporaso, J. G. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature Methods, 10(1), 57–59. 10.1038/nmeth.2276
Kaehler, B. D., Bokulich, N. A., McDonald, D., Knight, R., Caporaso, J. G., & Huttley, G. A. (2019). Species abundance information improves sequence taxonomy classification accuracy. Nature Communications, 10(1). 10.1038/s41467-019-12669-6
Lin, H., & Peddada, S. D. (2023). Multigroup analysis of compositions of microbiomes with covariate adjustments and repeated measures. Nature Methods, 21(1), 83–91. 10.1038/s41592-023-02092-7

The gut-to-soil axis microbiome analysis tutorial

Why study human excrement composting?