Background¶
In this tutorial youâll learn an end-to-end microbiome marker-gene data science workflow, building on data presented in Meilander et al. (2024): Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. The data used here is a subset (a single sequencing run) of that generated for the paper, specifically selected so that this tutorial can be run quickly on a personal computer. The full data set for the paper can be found in the studyâs Artifact Repository. In the final step (not yet written, as of 17 April 2025), youâll learn how to adapt the workflow for use in analyzing your own data using Provenance Replay.
The data used in this tutorial was generated using the Earth Microbiome Project protocol. Specifically, the hypervariable region 4 (V4) of the 16S rRNA gene was amplified using the F515-R806 primers - a broad-coverage primer pair for Bacteria that also amplifies some Archaea. Paired-end sequencing was performed on an Illumina MiSeq. Full details are presented in Meilander et al. (2024).
Sample metadata¶
Before starting the analysis, explore the sample metadata to familiarize yourself with the samples used in this study.
The following command will download the sample metadata as tab-separated text and save it in the file sample-metadata.tsv
.
This sample-metadata.tsv
file is used throughout the rest of the tutorial.
To learn more about metadata in QIIME 2, including how it should be formatted, refer to Using QIIME 2âs Metadata file format.
To learn more about metadata standards, you can refer to Chloe Hermanâs video on this topic, which was developed in collaboration with the National Microbiome Data Collaborative (NMDC).
wget -O 'sample-metadata.tsv' \
'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'
from qiime2 import Metadata
from urllib import request
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'
fn = 'sample-metadata.tsv'
request.urlretrieve(url, fn)
sample_metadata_md = Metadata.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
sample-metadata.tsv
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /2025 .4 /data /gut -to -soil /sample -metadata .tsv - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
library(reticulate)
Metadata <- import("qiime2")$Metadata
request <- import("urllib")$request
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/sample-metadata.tsv'
fn <- 'sample-metadata.tsv'
request$urlretrieve(url, fn)
sample_metadata_md <- Metadata$load(fn)
sample_metadata = use.init_metadata_from_url(
'sample-metadata',
'https://www.dropbox.com/scl/fi/irosimbb1aud1aa7frzxf/sample-metadata.tsv?rlkey=f45jpxzajjz9xx9vpvfnf1zjx&st=nahafuvy&dl=1')
sample-metadata.tsv
| download
QIIME 2âs metadata plugin provides a Visualizer called tabulate
that generates a convenient view of a sample metadata file.
Letâs run this, and then weâll look at the result.
Hereâs the first QIIME 2 command that you should run in this tutorial:
qiime metadata tabulate \
--m-input-file sample-metadata.tsv \
--o-visualization sample-metadata.qzv
import qiime2.plugins.metadata.actions as metadata_actions
sample_metadata_viz, = metadata_actions.tabulate(
input=sample_metadata_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
sample-metadata.qzv
metadata_actions <- import("qiime2.plugins.metadata.actions")
action_results <- metadata_actions$tabulate(
input=sample_metadata_md,
)
sample_metadata_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=sample_metadata),
use.UsageOutputNames(visualization='sample_metadata')
)
You can learn more about viewing Visualizations, including alternatives to QIIME 2 View if you canât use that for any reason, in Using QIIME 2.
This will generate a QIIME 2 Visualization. Visualizations can be viewed by loading them with QIIME 2 View. Navigate to QIIME 2 View, and drag and drop the visualization that was created to view it.
Access already-imported QIIME 2 data¶
This tutorial begins with paired-end read sequencing data that has already been demultiplexed and imported into a QIIME 2 Artifact. Because sequence data can be delivered to you in many different forms, itâs not possible to cover the varieties here. Instead we refer you to How to import data for use with QIIME 2 to learn how to import your data. If you want to learn why importing is necessary, refer to Why importing is necessary.
You should ask your sequencing center to provide data already demultiplexed. In some cases, they may be able to provide a âQIIME 2 demux artifactsâ. If not, ask them to provide a fastq manifest file for your sequencing data. It should be easy for them to generate.
wget -O 'demux.qza' \
'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/demux.qza'
from qiime2 import Artifact
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/demux.qza'
fn = 'demux.qza'
request.urlretrieve(url, fn)
demux = Artifact.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
demux.qza
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /2025 .4 /data /gut -to -soil /demux .qza - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
Artifact <- import("qiime2")$Artifact
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/demux.qza'
fn <- 'demux.qza'
request$urlretrieve(url, fn)
demux <- Artifact$load(fn)
demux = use.init_artifact_from_url(
'demux',
'https://www.dropbox.com/scl/fi/73f6rmcq7lelug5qbuhl6/demux-10p.qza?rlkey=s0hoh326fes3z2usvgs62tv3c&st=caz1avkn&dl=1')
Summarize demultiplexed sequences¶
When you have demultiplexed sequence data, the next step is typically to generate a visual summary of it. This allows you to determine how many sequences were obtained per sample, and also to get a summary of the distribution of sequence qualities at each position in your sequence data.
qiime demux summarize \
--i-data demux.qza \
--o-visualization demux.qzv
import qiime2.plugins.demux.actions as demux_actions
demux_viz, = demux_actions.summarize(
data=demux,
)
- Using the
qiime2 demux summarize
tool: - Set "data" to
#: demux.qza
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 demux summarize [...] : visualization.qzv
demux.qzv
demux_actions <- import("qiime2.plugins.demux.actions")
action_results <- demux_actions$summarize(
data=demux,
)
demux_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='demux',
action_id='summarize'),
use.UsageInputs(data=demux),
use.UsageOutputNames(visualization='demux'))
How many samples are represented in this sequencing data? What is the median number of sequence reads obtained per sample? What is the median quality score at position 200 of the forward reads?
Upstream data analysis¶
Generally, the term âupstreamâ is used to refer to data analysis pre-feature-asv_table, and âdownstreamâ is used to refer to data analysis post-feature-asv_table. Letâs jump into our upstream analysis.
Sequence quality control and feature table construction¶
QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering.
In this tutorial we present this step using DADA2.
The result of this method will be a FeatureTable[Frequency]
QIIME 2 artifact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence]
QIIME 2 artifact, which maps feature identifiers in the FeatureTable
to the sequences they represent.
DADA2 is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. As implemented in the dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, filter chimeric sequences, and merge paired end reads.
The denoise-paired
action, which weâll use here, requires four parameters that are used in quality filtering:
trim-left-f a
, which trims off the firsta
bases of each forward readtrunc-len-f b
which truncates each forward read at positionb
trim-left-r c
, which trims off the firstc
bases of each forward readtrunc-len-r d
which truncates each forward read at positiond
This allows the user to remove low quality regions of the sequences. To determine what values to pass for these parameters, you should review the Interactive Quality Plot tab in thedemux.qzv
file that was generated above.
Based on the plots you see in the demux.qzv
file that was generated above, what values would you choose for trim-left-f
and trunc-len-f
in this case?
What about trim-left-r
and trunc-len-r
?
Solution to Exercise 2
The quality of the initial bases seems to be high, so I choose not to trim any bases from the beginning of the sequences. The quality also seems good all the way out to the end, though maybe dropping off after 250 bases. Iâll therefore truncate at 250. Iâll keep these values the same for both the forward and reverse reads, though that is not a requirement.
Now run your DADA2 command. This step may take up to 10 minutes to complete - itâs the longest running step in this tutorial.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 0 \
--p-trunc-len-f 250 \
--p-trim-left-r 0 \
--p-trunc-len-r 250 \
--o-representative-sequences asv-seqs.qza \
--o-table asv-table.qza \
--o-denoising-stats stats.qza
import qiime2.plugins.dada2.actions as dada2_actions
asv_table, asv_seqs, stats = dada2_actions.denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250,
)
- Using the
qiime2 dada2 denoise-paired
tool: - Set "demultiplexed_seqs" to
#: demux.qza
- Set "trunc_len_f" to
250
- Set "trunc_len_r" to
250
- Expand the
additional options
section- Leave "trim_left_f" as its default value of
0
- Leave "trim_left_r" as its default value of
0
- Leave "trim_left_f" as its default value of
- Press the
Execute
button.
- Set "demultiplexed_seqs" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 dada2 denoise-paired [...] : table.qza
asv-table.qza
#: qiime2 dada2 denoise-paired [...] : representative_sequences.qza
asv-seqs.qza
#: qiime2 dada2 denoise-paired [...] : denoising_stats.qza
stats.qza
dada2_actions <- import("qiime2.plugins.dada2.actions")
action_results <- dada2_actions$denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0L,
trunc_len_f=250L,
trim_left_r=0L,
trunc_len_r=250L,
)
asv_seqs <- action_results$representative_sequences
asv_table <- action_results$table
stats <- action_results$denoising_stats
asv_seqs, asv_table, stats = use.action(
use.UsageAction(plugin_id='dada2',
action_id='denoise_paired'),
use.UsageInputs(demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250),
use.UsageOutputNames(representative_sequences='asv_seqs',
table='asv_table',
denoising_stats='stats'))
One of the outputs created by DADA2 is a summary of the denoising run. That is generated as an Artifact, so canât be viewed directly. However this is one of many QIIME 2 types that can be viewed as Metadata - a very powerful concept that weâll use again later in this tutorial. Learning to view artifacts as Metadata creates nearly infinite possibilities for how you can explore your microbiome data with QIIME 2.
Here, weâll again use the metadata plugins tabulate
visualizer, but this time weâll apply it to the DADA2 statistics.
qiime metadata tabulate \
--m-input-file stats.qza \
--o-visualization stats.qzv
stats_as_md_md = stats.view(Metadata)
stats_viz, = metadata_actions.tabulate(
input=stats_as_md_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Change to
Metadata from Artifact
- Set "Metadata Source" to
stats.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
stats.qzv
stats_as_md_md <- stats$view(Metadata)
action_results <- metadata_actions$tabulate(
input=stats_as_md_md,
)
stats_viz <- action_results$visualization
stats_as_md = use.view_as_metadata('stats_as_md', stats)
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=stats_as_md),
use.UsageOutputNames(visualization='stats'))
Which three samples had the smallest percentage of input reads passing the quality filter?
Refer back to your tabulated metadata: what do you know about those samples (e.g., what values do they have in the SampleType
column)?
(Hint: the sample-id
is the key that connects data across these two visualizations.)
If itâs annoying to switch back and forth between visualizations to answer the question in the last exercise, you can create a combined tabulation of the metadata. Try to do that by adapting the instructions in How to merge metadata.
This is also useful if you want to create a large tabular summary describing your samples following analysis, as you can include as many different metadata objects as youâd like in these summaries.
Solution to Exercise 4
qiime metadata tabulate \
--m-input-file sample-metadata.tsv stats.qza \
--o-visualization sample-metadata-w-dada2-stats.qzv
sample_metadata_and_dada2_stats_md_md = sample_metadata_md.merge(stats_as_md_md)
sample_metadata_w_dada2_stats_viz, = metadata_actions.tabulate(
input=sample_metadata_and_dada2_stats_md_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
+ Insert input
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
stats.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
sample-metadata-w-dada2-stats.qzv
sample_metadata_and_dada2_stats_md_md <- sample_metadata_md$merge(stats_as_md_md)
action_results <- metadata_actions$tabulate(
input=sample_metadata_and_dada2_stats_md_md,
)
sample_metadata_w_dada2_stats_viz <- action_results$visualization
sample_metadata_and_dada2_stats_md = use.merge_metadata('sample_metadata_and_dada2_stats_md', sample_metadata, stats_as_md)
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=sample_metadata_and_dada2_stats_md),
use.UsageOutputNames(visualization='sample_metadata_w_dada2_stats'))
Feature table and feature data summaries¶
After DADA2 completes, youâll want to explore the resulting data.
You can do this using the following two commands, which will create visual summaries of the data.
The feature-table summarize
action command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.
qiime feature-table summarize-plus \
--i-table asv-table.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary asv-table.qzv \
--o-sample-frequencies sample-frequencies.qza \
--o-feature-frequencies asv-frequencies.qza
import qiime2.plugins.feature_table.actions as feature_table_actions
asv_frequencies, sample_frequencies, asv_table_viz = feature_table_actions.summarize_plus(
table=asv_table,
metadata=sample_metadata_md,
)
- Using the
qiime2 feature-table summarize-plus
tool: - Set "table" to
#: asv-table.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza
asv-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza
sample-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : summary.qzv
asv-table.qzv
feature_table_actions <- import("qiime2.plugins.feature_table.actions")
action_results <- feature_table_actions$summarize_plus(
table=asv_table,
metadata=sample_metadata_md,
)
asv_table_viz <- action_results$summary
sample_frequencies <- action_results$sample_frequencies
asv_frequencies <- action_results$feature_frequencies
_, _, asv_frequencies = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize_plus'),
use.UsageInputs(table=asv_table,
metadata=sample_metadata),
use.UsageOutputNames(summary='asv_table',
sample_frequencies='sample_frequencies',
feature_frequencies='asv_frequencies'))
What is the total number of sequences represented in the feature table? What is the identifier of the feature that is observed the most number of times (i.e., has the highest frequency)?
The feature-table tabulate-seqs
action command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.
We can also include the feature frequency information in this visualization by passing it as metadata, similar to how we merged metadata in the exercise above.
In this case, however, weâre looking a feature metadata, as opposed to sample metadata.
As far as QIIME 2 is concerned, there is no difference between these two - in our case, itâll only be the identifiers that differ.
This visualization will be very useful later in the tutorial, when you want to learn more about specific features that are important in the data set.
qiime feature-table tabulate-seqs \
--i-data asv-seqs.qza \
--m-metadata-file asv-frequencies.qza \
--o-visualization asv-seqs.qzv
asv_frequencies_md_md = asv_frequencies.view(Metadata)
asv_seqs_viz, = feature_table_actions.tabulate_seqs(
data=asv_seqs,
metadata=asv_frequencies_md_md,
)
- Using the
qiime2 feature-table tabulate-seqs
tool: - Set "data" to
#: asv-seqs.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
asv-frequencies.qza
- Change to
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table tabulate-seqs [...] : visualization.qzv
asv-seqs.qzv
asv_frequencies_md_md <- asv_frequencies$view(Metadata)
action_results <- feature_table_actions$tabulate_seqs(
data=asv_seqs,
metadata=asv_frequencies_md_md,
)
asv_seqs_viz <- action_results$visualization
asv_frequencies_as_md = use.view_as_metadata('asv_frequencies_md',
asv_frequencies)
use.action(
use.UsageAction(plugin_id='feature_table',
action_id='tabulate_seqs'),
use.UsageInputs(data=asv_seqs,
metadata=asv_frequencies_as_md),
use.UsageOutputNames(visualization='asv_seqs'),
)
What is the taxonomy associated with the most frequently observed feature, based on a BLAST search?
Filtering features from a feature table¶
If you review the tabulated feature sequences, or the feature detail table of the feature table summary, youâll notice that there are many sequences that are observed in only a single sample. Letâs filter those out to reduce the number of sequences weâre working with - this will speed up several slower steps that are coming up.
This is a two-step process. First we filter our feature table, and then we use the new feature table to filter our sequences to only the ones that are contained in the new table.
I include _ms2
in the new output names here to remind us that these are filtered to those features in a minimum of 2 samples.
Generally speaking, file names are a convenient place to store information like this, but theyâre unreliable.
File names can easily be changed, and therefore could be modified to contain inaccurate information about the data they contain.
Luckily, QIIME 2âs provenance tracking system records all of the information that we need about how results were generated. Weâre therefore free to include information like this in file names if itâs helpful for us, but we shouldnât ever rely on the file names. If in doubt -- and always be in doubt đ”ïžââïž -- refer to the data provenance.
QIIME 2âs data provenance is your source for the truth about how a result was created.
qiime feature-table filter-features \
--i-table asv-table.qza \
--p-min-samples 2 \
--o-filtered-table asv-table-ms2.qza
asv_table_ms2, = feature_table_actions.filter_features(
table=asv_table,
min_samples=2,
)
- Using the
qiime2 feature-table filter-features
tool: - Set "table" to
#: asv-table.qza
- Expand the
additional options
section- Set "min_samples" to
2
- Set "min_samples" to
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-features [...] : filtered_table.qza
asv-table-ms2.qza
action_results <- feature_table_actions$filter_features(
table=asv_table,
min_samples=2L,
)
asv_table_ms2 <- action_results$filtered_table
asv_table_ms2, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_features'),
use.UsageInputs(table=asv_table,
min_samples=2),
use.UsageOutputNames(filtered_table='asv_table_ms2'),
)
qiime feature-table filter-seqs \
--i-data asv-seqs.qza \
--i-table asv-table-ms2.qza \
--o-filtered-data asv-seqs-ms2.qza
asv_seqs_ms2, = feature_table_actions.filter_seqs(
data=asv_seqs,
table=asv_table_ms2,
)
- Using the
qiime2 feature-table filter-seqs
tool: - Set "data" to
#: asv-seqs.qza
- Expand the
additional options
section- Set "table" to
#: asv-table-ms2.qza
- Set "table" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-seqs [...] : filtered_data.qza
asv-seqs-ms2.qza
action_results <- feature_table_actions$filter_seqs(
data=asv_seqs,
table=asv_table_ms2,
)
asv_seqs_ms2 <- action_results$filtered_data
asv_seqs_ms2, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_seqs'),
use.UsageInputs(data=asv_seqs,
table=asv_table_ms2),
use.UsageOutputNames(filtered_data='asv_seqs_ms2'),
)
Now that you have a second (filtered) feature table, create your own command to summarize it, like we did for the original feature table. How many features did we lose as a result of this filter? How many total sequences did we lose?
Solution to Exercise 7
Hereâs the command you would use:
qiime feature-table summarize-plus \
--i-table asv-table-ms2.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary asv-table-ms2.qzv \
--o-sample-frequencies sample-frequencies-ms2.qza \
--o-feature-frequencies asv-frequencies-ms2.qza
asv_frequencies_ms2, sample_frequencies_ms2, asv_table_ms2_viz = feature_table_actions.summarize_plus(
table=asv_table_ms2,
metadata=sample_metadata_md,
)
- Using the
qiime2 feature-table summarize-plus
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza
asv-frequencies-ms2.qza
#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza
sample-frequencies-ms2.qza
#: qiime2 feature-table summarize-plus [...] : summary.qzv
asv-table-ms2.qzv
action_results <- feature_table_actions$summarize_plus(
table=asv_table_ms2,
metadata=sample_metadata_md,
)
asv_table_ms2_viz <- action_results$summary
sample_frequencies_ms2 <- action_results$sample_frequencies
asv_frequencies_ms2 <- action_results$feature_frequencies
_, _, asv_frequencies_ms2 = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize_plus'),
use.UsageInputs(table=asv_table_ms2,
metadata=sample_metadata),
use.UsageOutputNames(summary='asv_table_ms2',
sample_frequencies='sample_frequencies_ms2',
feature_frequencies='asv_frequencies_ms2'))
Be sure to run this as weâre going to use one of the results below.
Taxonomic annotation¶
Before we complete our upstream analysis steps, weâll generate taxonomic annotations for our sequences using the feature-classifier plugin.
The taxonomic classifier used here is very specific to the sample preparation and sequencing protocols used for this study, and Greengenes 13_8, which it is trained on, is an outdated reference database. The reason we use it here is because the reference data is relatively small, so classification can be run on most modern computers with this classifier[1].
When youâre ready to work on your own data, one of the choices youâll need to make is what classifier to use for your data. You can find pre-trained classifiers the QIIME 2 Library Resources page. If you donât find a classifier that will work for you there, you may be able to find one on the Forum, or you can learn how to train your own by referencing the RESCRIPt documentation.
We strongly recommend the use of environment-weighted classifiers, as described in Kaehler et al., (2019), and you can find builds of these on the Resources page.
First, weâll download a pre-trained classifier artifact.
wget -O 'suboptimal-16S-rRNA-classifier.qza' \
'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn = 'suboptimal-16S-rRNA-classifier.qza'
request.urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier = Artifact.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
suboptimal-16S-rRNA-classifier.qza
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /2025 .4 /data /gut -to -soil /suboptimal -16S -rRNA -classifier .qza - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/2025.4/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn <- 'suboptimal-16S-rRNA-classifier.qza'
request$urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier <- Artifact$load(fn)
def classifier_factory():
from urllib import request
from qiime2 import Artifact
fp, _ = request.urlretrieve(
'https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes/gg-13-8-99-515-806-nb-classifier.qza')
return Artifact.load(fp)
classifier = use.init_artifact('suboptimal-16S-rRNA-classifier', classifier_factory)
Then, weâll apply it to our sequences using classify-sklearn
.
qiime feature-classifier classify-sklearn \
--i-classifier suboptimal-16S-rRNA-classifier.qza \
--i-reads asv-seqs-ms2.qza \
--o-classification taxonomy.qza
import qiime2.plugins.feature_classifier.actions as feature_classifier_actions
taxonomy, = feature_classifier_actions.classify_sklearn(
classifier=suboptimal_16S_rRNA_classifier,
reads=asv_seqs_ms2,
)
- Using the
qiime2 feature-classifier classify-sklearn
tool: - Set "reads" to
#: asv-seqs-ms2.qza
- Set "classifier" to
#: suboptimal-16S-rRNA-classifier.qza
- Press the
Execute
button.
- Set "reads" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-classifier classify-sklearn [...] : classification.qza
taxonomy.qza
feature_classifier_actions <- import("qiime2.plugins.feature_classifier.actions")
action_results <- feature_classifier_actions$classify_sklearn(
classifier=suboptimal_16S_rRNA_classifier,
reads=asv_seqs_ms2,
)
taxonomy <- action_results$classification
taxonomy, = use.action(
use.UsageAction(plugin_id='feature_classifier',
action_id='classify_sklearn'),
use.UsageInputs(classifier=classifier,
reads=asv_seqs_ms2),
use.UsageOutputNames(classification='taxonomy'))
Then, to get an initial look at our taxonomic classifications, letâs integrate taxonomy in the sequence summary, like the one we generated above.
If you want to compare taxonomic annotations achieved with different classifiers, you can do that with the feature-table tabulate-seqs
action by passing in multiple FeatureData[Taxonomy]
artifacts.
See an example of what that result might look like here.
While you have that visualization loaded, take a look at the data provenance. The complexity of that data provenance should give you an idea of why itâs helpful to have the computer record all of this information, rather than trying to embed it all in file names or keep track of it in your written notes.
What was used as the DADA2 trim and trunc parameters for the data leading to this visualization? (Hint: use the provenance search feature).
qiime feature-table tabulate-seqs \
--i-data asv-seqs-ms2.qza \
--i-taxonomy Greengenes_13_8:taxonomy.qza \
--m-metadata-file asv-frequencies-ms2.qza \
--o-visualization asv-seqs-ms2.qzv
from qiime2 import ResultCollection
asv_frequencies_md = asv_frequencies_ms2.view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection({
'Greengenes_13_8': taxonomy,
})
asv_seqs_ms2_viz, = feature_table_actions.tabulate_seqs(
data=asv_seqs_ms2,
taxonomy=taxonomy_collection_artifact_collection,
metadata=asv_frequencies_md,
)
- Using the
qiime2 feature-table tabulate-seqs
tool: - Set "data" to
#: asv-seqs-ms2.qza
- Expand the
additional options
section- Set "taxonomy" to
#: taxonomy-collection/
- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
asv-frequencies-ms2.qza
- Change to
- Press the
- Set "taxonomy" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table tabulate-seqs [...] : visualization.qzv
asv-seqs-ms2.qzv
ResultCollection <- import("qiime2")$ResultCollection
asv_frequencies_md <- asv_frequencies_ms2$view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection(dict(
'Greengenes_13_8' = taxonomy
))
action_results <- feature_table_actions$tabulate_seqs(
data=asv_seqs_ms2,
taxonomy=taxonomy_collection_artifact_collection,
metadata=asv_frequencies_md,
)
asv_seqs_ms2_viz <- action_results$visualization
asv_frequencies_ms2_as_md = use.view_as_metadata('asv_frequencies',
asv_frequencies_ms2)
taxonomy_collection = use.construct_artifact_collection(
'taxonomy_collection', {'Greengenes_13_8': taxonomy}
)
use.action(
use.UsageAction(plugin_id='feature_table',
action_id='tabulate_seqs'),
use.UsageInputs(data=asv_seqs_ms2,
taxonomy=taxonomy_collection,
metadata=asv_frequencies_ms2_as_md),
use.UsageOutputNames(visualization='asv_seqs_ms2'),
)
What is the taxonomy associated with the most frequently observed feature, based on a BLAST search? How does that compare to the taxonomy assigned by our feature classifier?
In general, we tend to trust the results of our feature classifier over those generated by BLAST, though the BLAST results are good for a quick look or a sanity check. The reason for this is that the reference databases used with our feature classifiers tend to be specifically designed for this purpose and in some cases all of the sequences included are vetted. The BLAST databases can contain mis-annotations that may negatively impact the classifications.
Recall that our asv-seqs.qzv
visualization allows you to easily BLAST the sequence associated with each feature against the NCBI nt database.
Using that visualization and the taxonomy.qzv
visualization created here, compare the taxonomic assignments with the taxonomy of the best BLAST hit for a few features.
How similar are the assignments?
If theyâre dissimilar, at what taxonomic level do they begin to differ (e.g., species, genus, family, ...)?
Building a tree for phylogenetic diversity calculations¶
QIIME supports several phylogenetic diversity metrics, including Faithâs Phylogenetic Diversity and weighted and unweighted UniFrac.
In addition to counts of features per sample (i.e., the data in the FeatureTable[Frequency]
QIIME 2 artifact), these metrics require a rooted phylogenetic tree relating the features to one another.
The amplicon distribution offers a few ways to build these trees, including a reference-based approach in the fragment-insertion plugin and de novo (i.e., reference-free) approaches in the phylogeny plugin.
The reference based approach, by default, is specific to 16S rRNA marker gene analysis. We could use that here, but the runtime is too long for our documentation.[1] If youâd like to see this demonstrated, you can refer to the Parkinsonâs Mouse tutorial.
The de novo approach is known to generate low quality trees when very short sequences are used as input, but can be used with any phylogenetically informative marker gene (not just 16S). If youâd like to see this demonstrated, you can refer to the Moving Pictures tutorial.
For those reasons, weâre going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here.
Downstream data analysis¶
As mentioned above, we tend to think of âdownstreamâ analysis as beginning with a feature table, taxonomic annotation of our features, and optionally a phylogenetic tree. Now that we have those (with the exception of the tree, which we wonât use here), letâs jump in. This is where it starts to get fun! â·ïž
Kmer-based diversity analysis¶
As mentioned above, weâre going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here.
This will use two stand-alone QIIME 2 plugins, q2-boots and q2-kmerizer, which are integrated through the kmer-diversity
action in q2-boots.
To learn about kmerization of features and how this relates to phylogenetic diversity metrics, read the q2-kmerizer paper.
q2-boots provides actions that mirror the interface of the diversity metric calculation actions in the diversity plugin, but generates more robust results because it integrates rarefaction and/or bootstrapping.
You can learn more about this in the q2-boots paper.
When using q2-kmerizer, normalization should be done at the ASV level before kmerization, not at the kmer level.
This is automated by the kmer-diversity
action that weâre using here.
This is because you are trying to normalize by sequencing depth, not kmer complexity.
In the end, the difference should not be huge but this distinction could be important for some metrics.
kmer-diversity
is a qiime2.Pipeline
: a type of action that links multiple other QIIME 2 actions together for convenience.
As such, it does a lot of work.
Here are the steps that it takes:
- Resample the input feature table (i.e., the ASV table) to contain exactly
sampling-depth
sequences per sample, either by bootstrapping or rarefaction,n
times. Samples with fewer thansampling-depth
sequences will be removed from the feature table and not included in the subsequent steps. This will result inn
feature tables. - For each feature table resulting from step 1, using the input sequences (i.e., the ASV sequences), kmerize all sequences into kmers of length
kmer-size
. [2] Use this information to create one kmer table per resampled feature table. This will result inn
feature tables, where the features are kmers (instead of ASVs, as in the input feature table). - Compute the user-requested alpha- and beta-diversity metrics on each of the kmer tables resulting from step 2. [3]
The metrics computed by default are:
- Alpha diversity
- Shannonâs diversity index (a quantitative measure of community richness)
- Observed Features (a qualitative measure of community richness)
- Evenness (i.e., Pielouâs Evenness; a measure of community evenness)
- Beta diversity
- Jaccard distance (a qualitative measure of community dissimilarity)
- Bray-Curtis distance (a quantitative measure of community dissimilarity)
- Alpha diversity
- For each diversity metric, average the results computed across the
n
kmer tables. These results can be used in subsequent analysis steps (e.g., ordination, statistical modeling, machine learning). - Perform PCoA ordination on the averaged beta diversity distance matrices resulting from Step 4. [4]
- Generate an interactive q2-vizard scatter plot that contains all user-provided sample metadata, all averaged alpha diversity metrics, and the first three ordination axes for all PCoA matrices computed in step 5.
q2-kmerizer decomposes each sequence into its constituent kmers. This should be carefully considered when interpreting alpha-diversity in particular, as the number of observed features (for example) would correspond to the number of unique kmers observed in a sample (representing the genetic diversity), not the number of unique sequences or taxa. For more information, read the q2-kmerizer paper.
A key parameter that needs to be provided to kmer-diversity
is sampling-depth
, which is the even sampling (i.e., bootstrapping or rarefaction) depth.
Because most diversity metrics are sensitive to different sampling depths (i.e., sequence counts) across different samples, the tables are randomly subsampled such that the total frequency for each sample is the user-specified sampling depth.
For example, if you set sampling-depth=500
, this step will subsample the sequence counts in each sample so that each sample in the resulting table has a total frequency of 500.
If the total frequency (i.e., the number of sequences observed) for any sample(s) is smaller than this value, those samples will be dropped from the diversity analysis.
Choosing this value is tricky.
We recommend making your choice by reviewing the information presented in the asv-table-ms2.qzv
file that you created above.
This video provides guidance on choosing an even sampling depth.
View the asv-table-ms2.qzv
QIIME 2 artifact, and in particular the Interactive Sample Detail tab in that visualization.
What value would you choose to pass for sampling-depth
?
How many samples will be excluded from your analysis based on this choice?
How many total sequences will you be analyzing in the kmer-diversity
command?
In many Illumina runs youâll observe a few samples that have very low sequence counts. You will typically want to exclude those from the analysis by choosing a larger value for the sampling depth at this stage.
Iâm going to choose values that is around the first quartile of the sample total frequencies.
qiime boots kmer-diversity \
--i-table asv-table-ms2.qza \
--i-sequences asv-seqs-ms2.qza \
--m-metadata-file sample-metadata.tsv \
--p-sampling-depth 96 \
--p-n 10 \
--p-replacement \
--p-alpha-average-method median \
--p-beta-average-method medoid \
--output-dir boots-kmer-diversity
import qiime2.plugins.boots.actions as boots_actions
action_results = boots_actions.kmer_diversity(
table=asv_table_ms2,
sequences=asv_seqs_ms2,
metadata=sample_metadata_md,
sampling_depth=96,
n=10,
replacement=True,
alpha_average_method='median',
beta_average_method='medoid',
)
bootstrap_tables_artifact_collection = action_results.resampled_tables
kmer_tables_artifact_collection = action_results.kmer_tables
bootstrap_alpha_diversities_artifact_collection = action_results.alpha_diversities
bootstrap_distance_matrices_artifact_collection = action_results.distance_matrices
bootstrap_pcoas_artifact_collection = action_results.pcoas
kmer_diversity_scatter_plot_viz = action_results.scatter_plot
- Using the
qiime2 boots kmer-diversity
tool: - Set "table" to
#: asv-table-ms2.qza
- Set "sequences" to
#: asv-seqs-ms2.qza
- Set "sampling_depth" to
96
- For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Set "n" to
10
- Set "replacement" to
Yes
- Expand the
additional options
section- Leave "alpha_average_method" as its default value of
median
- Set "beta_average_method" to
medoid
- Leave "alpha_average_method" as its default value of
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 boots kmer-diversity [...] : resampled_tables.qza
bootstrap-tables/
#: qiime2 boots kmer-diversity [...] : kmer_tables.qza
kmer-tables/
#: qiime2 boots kmer-diversity [...] : alpha_diversities.qza
bootstrap-alpha-diversities/
#: qiime2 boots kmer-diversity [...] : distance_matrices.qza
bootstrap-distance-matrices/
#: qiime2 boots kmer-diversity [...] : pcoas.qza
bootstrap-pcoas/
#: qiime2 boots kmer-diversity [...] : scatter_plot.qzv
kmer-diversity-scatter-plot.qzv
boots_actions <- import("qiime2.plugins.boots.actions")
action_results <- boots_actions$kmer_diversity(
table=asv_table_ms2,
sequences=asv_seqs_ms2,
metadata=sample_metadata_md,
sampling_depth=96L,
n=10L,
replacement=TRUE,
alpha_average_method='median',
beta_average_method='medoid',
)
bootstrap_tables_artifact_collection <- action_results$resampled_tables
kmer_tables_artifact_collection <- action_results$kmer_tables
bootstrap_alpha_diversities_artifact_collection <- action_results$alpha_diversities
bootstrap_distance_matrices_artifact_collection <- action_results$distance_matrices
bootstrap_pcoas_artifact_collection <- action_results$pcoas
kmer_diversity_scatter_plot_viz <- action_results$scatter_plot
use.action(
use.UsageAction(plugin_id='boots',
action_id='kmer_diversity'),
use.UsageInputs(table=asv_table_ms2,
sequences=asv_seqs_ms2,
metadata=sample_metadata,
sampling_depth=96,
n=10,
replacement=True,
alpha_average_method='median',
beta_average_method='medoid'),
use.UsageOutputNames(
resampled_tables='bootstrap_tables',
kmer_tables='kmer_tables',
alpha_diversities='bootstrap_alpha_diversities',
distance_matrices='bootstrap_distance_matrices',
pcoas='bootstrap_pcoas',
scatter_plot='kmer_diversity_scatter_plot')
)
Remember that weâre working with a small subset of a full sequencing run in this tutorial, to keep the runtime short for the tutorial.
As a result the value used for sampling-depth
here is very low.
Often this might be closer to 10,000 for an Illumina MiSeq run (as of 2025), but this is highly dependent on the sequencing run, the number of samples included, and other factors.
After computing diversity metrics, we can begin to explore the microbial composition of the samples in the context of the sample metadata. You can review the sample metadata using one of the tabulated views of this file that we created above.
Open the scatter plot that was generated by the previous command, and plot the first two ordination axes computed from the Bray-Curtis distances.
Experiment with plotting different information on the x and y axes. Change the colorBy and colorPalette selections. Which of the metadata categories results in samples grouping most by color?
When plotting Bray-Curtis PCoA axes 1 and 2 and coloring by SampleType, is the HEC more similar to the food compost or HE samples?
What sample type is the Microbe Mix most similar to? The inside of the toilet pre-use? The bulking material?
What other interesting relationships do you see when changing the x- and y-axes and sample coloring? Which diversity axes (PCoA or alpha diversity results) appear to be correlated with one another?
Which sample has the lowest microbiome richness?
Alpha rarefaction plotting¶
In this section weâll explore alpha diversity as a function of sampling depth using thealpha-rarefaction
action.
This visualizer computes one or more alpha diversity metrics at multiple sampling depths, in steps between 1 (optionally controlled with min-depth
) and the value provided as max-depth
.
At each sampling depth step, 10 rarefied tables will be generated, and the diversity metrics will be computed for all samples in the tables.
The number of iterations (rarefied tables computed at each sampling depth) can be controlled with the iterations
parameter.
Average diversity values will be plotted for each sample at each even sampling depth, and samples can be grouped based on metadata in the resulting visualization if sample metadata is provided.
The value that you provide for max-depth
should be determined by reviewing the âFrequency per sampleâ information presented in the asv-table-ms2.qzv
file.
In general, choosing a value that is somewhere around the median frequency seems to work well, but you may want to increase that value if the lines in the resulting rarefaction plot donât appear to be leveling out, or decrease that value if you seem to be losing many of your samples due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.
qiime diversity alpha-rarefaction \
--i-table asv-table-ms2.qza \
--p-max-depth 260 \
--m-metadata-file sample-metadata.tsv \
--o-visualization alpha-rarefaction.qzv
import qiime2.plugins.diversity.actions as diversity_actions
alpha_rarefaction_viz, = diversity_actions.alpha_rarefaction(
table=asv_table_ms2,
max_depth=260,
metadata=sample_metadata_md,
)
- Using the
qiime2 diversity alpha-rarefaction
tool: - Set "table" to
#: asv-table-ms2.qza
- Set "max_depth" to
260
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 diversity alpha-rarefaction [...] : visualization.qzv
alpha-rarefaction.qzv
diversity_actions <- import("qiime2.plugins.diversity.actions")
action_results <- diversity_actions$alpha_rarefaction(
table=asv_table_ms2,
max_depth=260L,
metadata=sample_metadata_md,
)
alpha_rarefaction_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='diversity',
action_id='alpha_rarefaction'),
use.UsageInputs(table=asv_table_ms2,
max_depth=260,
metadata=sample_metadata),
use.UsageOutputNames(visualization='alpha_rarefaction'))
The visualization will have two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to âlevel outâ (i.e., approach a slope of zero) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in the plot donât level out, this may be because the richness of the samples hasnât been fully observed yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).
The bottom plot in this visualization is important when grouping samples by metadata.
It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth.
If a given sampling depth d
is larger than the total frequency of a sample s
(i.e., the number of sequences that were obtained for sample s
), it is not possible to compute the diversity metric for sample s
at sampling depth d
.
If many of the samples in a group have lower total frequencies than d
, the average diversity presented for that group at d
in the top plot will be unreliable because it will have been computed on relatively few samples.
When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable.
When grouping samples by âSampleTypeâ and viewing the alpha rarefaction plot for the âobserved_featuresâ metric, which sample types (if any) appear to exhibit sufficient diversity coverage (i.e., their rarefaction curves level out)?
Taxonomic analysis¶
Next, we can view the taxonomic composition of our samples with interactive bar plots. Generate those plots with the following command and then open the visualization.
qiime taxa barplot \
--i-table asv-table-ms2.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file sample-metadata.tsv \
--o-visualization taxa-bar-plots.qzv
import qiime2.plugins.taxa.actions as taxa_actions
taxa_bar_plots_viz, = taxa_actions.barplot(
table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata_md,
)
- Using the
qiime2 taxa barplot
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- Set "taxonomy" to
#: taxonomy.qza
- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- Set "taxonomy" to
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 taxa barplot [...] : visualization.qzv
taxa-bar-plots.qzv
taxa_actions <- import("qiime2.plugins.taxa.actions")
action_results <- taxa_actions$barplot(
table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata_md,
)
taxa_bar_plots_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='taxa',
action_id='barplot'),
use.UsageInputs(table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata),
use.UsageOutputNames(visualization='taxa_bar_plots'))
Why are we using the ASV table when generating our taxonomic barplot, rather than the kmer table that weâve been using in the last few steps?
Solution to Exercise 13
We use the ASV table here because the feature ids in that table are the same as the ones used in the FeatureData[Taxonomy]
artifact.
Additionally, kmerization of our data is a tool used for computing diversity metrics - not something we generally intend to use throughout our analyses.
Visualize the samples at Level 2 (which corresponds to the phylum level in this analysis), and then sort the samples by SampleType
.
What are the dominant phyla in each in SampleType
?
Differential abundance testing with ANCOM-BC2¶
ANCOM-BC2 is a compositionally-aware linear regression model that allows testing for differentially abundant features across sample groups while also implementing bias correction.
This can be accessed using the ancombc2
action in the composition plugin.
Accurately identifying individual features that are differentially abundant across sample types in microbiome data is a challenging problem and an open area of research, particularly if you donât have an a priori hypothesis about which feature(s) are differentially abundant. A q-value that suggests that youâve identified a feature that is differentially abundant across sample groups should be considered a hypothesis, not a conclusion, and you need new samples to test that new hypothesis.
In addition to the methods contained in the composition plugin, new approaches for differential abundance testing are regularly introduced. Itâs worth assessing the current state of the field when performing differential abundance testing to see if there are new methods that might be useful for your data. If in doubt, consult a statistician.
Weâll perform this analysis in a few steps.
Filter samples from the feature table¶
First, weâll filter our samples from our feature table such that we only have the three groups that we have the most samples for.
qiime feature-table filter-samples \
--i-table asv-table-ms2.qza \
--m-metadata-file sample-metadata.tsv \
--p-where '[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")' \
--o-filtered-table asv-table-ms2-dominant-sample-types.qza
asv_table_ms2_dominant_sample_types, = feature_table_actions.filter_samples(
table=asv_table_ms2,
metadata=sample_metadata_md,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)
- Using the
qiime2 feature-table filter-samples
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- Set "where" to
[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-samples [...] : filtered_table.qza
asv-table-ms2-dominant-sample-types.qza
action_results <- feature_table_actions$filter_samples(
table=asv_table_ms2,
metadata=sample_metadata_md,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)
asv_table_ms2_dominant_sample_types <- action_results$filtered_table
asv_table_ms2_dominant_sample_types, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_samples'),
use.UsageInputs(table=asv_table_ms2,
metadata=sample_metadata,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")'),
use.UsageOutputNames(filtered_table='asv_table_ms2_dominant_sample_types'))
Apply differential abundance testing¶
Then, weâll apply ANCOM-BC2 to see which ASV are differentially abundant across those sample types. I specify a reference level here as this defines what each group is compared against. Since the focus of this study is HEC, I choose that as my reference level. That will let us see what ASVs are over- or under-represented in the other two sample groups (Human Excrement and Food Compost) relative to HEC, as HEC defines the âglobal interceptâ that will be measured against.
qiime composition ancombc2 \
--i-table asv-table-ms2-dominant-sample-types.qza \
--m-metadata-file sample-metadata.tsv \
--p-fixed-effects-formula SampleType \
--p-reference-levels 'SampleType::Human Excrement Compost' \
--o-ancombc2-output ancombc2-results.qza
import qiime2.plugins.composition.actions as composition_actions
ancombc2_results, = composition_actions.ancombc2(
table=asv_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
fixed_effects_formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost'],
)
- Using the
qiime2 composition ancombc2
tool: - Set "table" to
#: asv-table-ms2-dominant-sample-types.qza
- For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Set "fixed_effects_formula" to
SampleType
- Expand the
additional options
section- For "reference_levels":
- Set "element" to
SampleType::Human Excrement Compost
- (Do not insert additional values.)
- Set "element" to
- For "reference_levels":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition ancombc2 [...] : ancombc2_output.qza
ancombc2-results.qza
composition_actions <- import("qiime2.plugins.composition.actions")
action_results <- composition_actions$ancombc2(
table=asv_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
fixed_effects_formula='SampleType',
reference_levels=list('SampleType::Human Excrement Compost'),
)
ancombc2_results <- action_results$ancombc2_output
ancombc2_results, = use.action(
use.UsageAction(plugin_id='composition',
action_id='ancombc2'),
use.UsageInputs(table=asv_table_ms2_dominant_sample_types,
metadata=sample_metadata,
fixed_effects_formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost']),
use.UsageOutputNames(ancombc2_output='ancombc2_results'))
Finally, weâll visualize the results. Taxonomic annotations can optionally be provided here: this helps make ASVs ids more interpretable.
qiime composition ancombc2-visualizer \
--i-data ancombc2-results.qza \
--i-taxonomy taxonomy.qza \
--o-visualization ancombc2-barplot.qzv
ancombc2_barplot_viz, = composition_actions.ancombc2_visualizer(
data=ancombc2_results,
taxonomy=taxonomy,
)
- Using the
qiime2 composition ancombc2-visualizer
tool: - Set "data" to
#: ancombc2-results.qza
- Expand the
additional options
section- Set "taxonomy" to
#: taxonomy.qza
- Set "taxonomy" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition ancombc2-visualizer [...] : visualization.qzv
ancombc2-barplot.qzv
action_results <- composition_actions$ancombc2_visualizer(
data=ancombc2_results,
taxonomy=taxonomy,
)
ancombc2_barplot_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='composition',
action_id='ancombc2_visualizer'),
use.UsageInputs(data=ancombc2_results,
taxonomy=taxonomy),
use.UsageOutputNames(visualization='ancombc2-barplot'))
Which genus is most enriched in HEC relative to Food Compost? Which genus is most enriched in HEC relative to Human Excrement?
Which genus is most depleted in HEC relative to Food Compost? Which genus is most depleted in HEC relative to Human Excrement?
You might be interested in performing ANCOM-BC2 (or other analyses) with ASVs grouped based on the genus they are derived from, rather than on the ASVs themselves.
This may or may not increase your statistical power -- for example, if all organisms in a genus have a similar impact on their host or environment, you may want to know if that genus is differentially abundance across some category of interest.
Take a look at the documentation for the collapse
action.
How would you use this to collapse our ASV table at the genus level, and then run ANCOM-BC2 on that table?
Solution to Exercise 16
To collapse our ASVs into genera (i.e. level 6 of the Greengenes taxonomy), we can use the following command.
qiime taxa collapse \
--i-table asv-table-ms2-dominant-sample-types.qza \
--i-taxonomy taxonomy.qza \
--p-level 6 \
--o-collapsed-table genus-table-ms2-dominant-sample-types.qza
genus_table_ms2_dominant_sample_types, = taxa_actions.collapse(
table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6,
)
- Using the
qiime2 taxa collapse
tool: - Set "table" to
#: asv-table-ms2-dominant-sample-types.qza
- Set "taxonomy" to
#: taxonomy.qza
- Set "level" to
6
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 taxa collapse [...] : collapsed_table.qza
genus-table-ms2-dominant-sample-types.qza
action_results <- taxa_actions$collapse(
table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6L,
)
genus_table_ms2_dominant_sample_types <- action_results$collapsed_table
genus_table_ms2_dominant_sample_types, = use.action(
use.UsageAction(plugin_id='taxa',
action_id='collapse'),
use.UsageInputs(table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6),
use.UsageOutputNames(collapsed_table='genus_table_ms2_dominant_sample_types'))
We can then provide the resulting table as the input to ANCOM-BC2.
qiime composition ancombc2 \
--i-table genus-table-ms2-dominant-sample-types.qza \
--m-metadata-file sample-metadata.tsv \
--p-fixed-effects-formula SampleType \
--p-reference-levels 'SampleType::Human Excrement Compost' \
--o-ancombc2-output genus-ancombc2-results.qza
genus_ancombc2_results, = composition_actions.ancombc2(
table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
fixed_effects_formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost'],
)
- Using the
qiime2 composition ancombc2
tool: - Set "table" to
#: genus-table-ms2-dominant-sample-types.qza
- For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Set "fixed_effects_formula" to
SampleType
- Expand the
additional options
section- For "reference_levels":
- Set "element" to
SampleType::Human Excrement Compost
- (Do not insert additional values.)
- Set "element" to
- For "reference_levels":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition ancombc2 [...] : ancombc2_output.qza
genus-ancombc2-results.qza
action_results <- composition_actions$ancombc2(
table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
fixed_effects_formula='SampleType',
reference_levels=list('SampleType::Human Excrement Compost'),
)
genus_ancombc2_results <- action_results$ancombc2_output
genus_ancombc2_results, = use.action(
use.UsageAction(plugin_id='composition',
action_id='ancombc2'),
use.UsageInputs(table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata,
fixed_effects_formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost']),
use.UsageOutputNames(ancombc2_output='genus_ancombc2_results'))
And finally, we can visualize the results. Notice that in this case weâre not providing the taxonomy, because weâve already intergrated that information by collapsing at the genus level.
qiime composition ancombc2-visualizer \
--i-data genus-ancombc2-results.qza \
--o-visualization genus-ancombc2-barplot.qzv
genus_ancombc2_barplot_viz, = composition_actions.ancombc2_visualizer(
data=genus_ancombc2_results,
)
- Using the
qiime2 composition ancombc2-visualizer
tool: - Set "data" to
#: genus-ancombc2-results.qza
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition ancombc2-visualizer [...] : visualization.qzv
genus-ancombc2-barplot.qzv
action_results <- composition_actions$ancombc2_visualizer(
data=genus_ancombc2_results,
)
genus_ancombc2_barplot_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='composition',
action_id='ancombc2_visualizer'),
use.UsageInputs(data=genus_ancombc2_results),
use.UsageOutputNames(visualization='genus-ancombc2-barplot'))
Thatâs it for now, but more is coming soon!¶
In the near future (as of 17 April 2025) we plan to integrate analyses using the sample-classifier plugin and longitudinal plugin. In the meantime, here are some suggestions to continue your learning:
- Build a machine learning classifier that classifies samples accordining to the three dominant sample types in the feature table that we used with ANCOM-BC.
(Hint: see
classify-samples
.) - Perform a longitudinal analysis that tracks samples from different buckets over time. Which taxa change most over time? (Hint: see
feature-volatility
.) - Remember that the full data set (five sequencing runs) are available in the gut-to-soil Artifact Repository. Grab one of the larger sequencing runs (we worked with a small sequencing run that was generated as a preliminary test), and adapt the commands in this tutorial to work on a bigger data set.
Weâre also in the process of refactoring our statistical methods for assessing alpha and beta diversity across groups, using the new stats plugin. Weâre therefore holding off on integrating statistical analysis until we have that ready. In the meantime, you can refer to you can refer to the Moving Pictures tutorial, as well as the sample-classifier and longitudinal tutorials.
Replay provenance (work in progress!)¶
This section is a work in progress as of 17 April 2025. Right now this section only demonstrates replaying provenance from a single visualization, but this will be expanded to reflect the full analysis. Also, note that all commands for training the feature classifier are included in the resulting replay script (in case there are some commands in there that you donât remember running). More on this coming soon!
You might next want to try to adapt the commands presented in this tutorial to your own data, adjusting parameter settings and metadata column headers as is relevant.
QIIME 2âs provenance replay functionality can help with this.
Assuming that you ran all of the steps above in a directory called gut-to-soil/
, run the following command to generate a template script that you can adapt for your workflow:
qiime tools replay-provenance \
--in-fp gut-to-soil/taxa-bar-plots.qzv \
--out-fp g2s-replayed.bash
If you need help, head over to the QIIME 2 Forum.
The resource requirements exceed those provided by the Read the Docs (RTD) build system, which is used to build the documentation that youâre reading. RTD provides systems with 7GB of RAM for 30 minutes maximum to build documentation. Thatâs a very reasonable (and generous) allocation for building documentation, so we choose to work within those contraints rather than creating our own documentation build system like weâve had in the past (e.g., for
https://docs.qiime2.org
).kmerization of biological sequences is described in the Database Searching chapter of An Introduction to Applied Bioinformatics.
Learn more about the available metrics in this QIIME 2 Forum post.
This process is discussed in the Machine Learning in Bioinformatics chapter of An Introduction to Applied Bioinformatics.
- Raspet, I., Gehret, E., Herman, C., Meilander, J., Manley, A., Simard, A., Bolyen, E., & Caporaso, J. G. (2025). Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2-boots. F1000Research, 14, 87. 10.12688/f1000research.156295.1
- Bokulich, N. A. (2025). Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting. mSystems, 10(3). 10.1128/msystems.01550-24
- Meilander, J., Herman, C., Manley, A., Augustine, G., Birdsell, D., Bolyen, E., Celona, K. R., Coffey, H., Cocking, J., Donoghue, T., Draves, A., Erickson, D., Foley, M., Gehret, L., Hagen, J., Hepp, C., Ingram, P., John, D., Kadar, K., ⊠Caporaso, J. G. (2024). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. arXiv. 10.48550/ARXIV.2411.04148
- Caporaso, J. G., & Meilander, J. (2024). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis (supporting data). Zenodo. 10.5281/ZENODO.13887456
- Keefe, C. R., Dillon, M. R., Gehret, E., Herman, C., Jewell, M., Wood, C. V., Bolyen, E., & Caporaso, J. G. (2023). Facilitating bioinformatics reproducibility with QIIME 2 Provenance Replay. PLOS Computational Biology, 19(11), e1011676. 10.1371/journal.pcbi.1011676
- Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Huntley, J., Fierer, N., Owens, S. M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J. A., Smith, G., & Knight, R. (2012). Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. The ISME Journal, 6(8), 1621â1624. 10.1038/ismej.2012.8
- Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581â583. 10.1038/nmeth.3869
- Amir, A., McDonald, D., Navas-Molina, J. A., Kopylova, E., Morton, J. T., Zech Xu, Z., Kightley, E. P., Thompson, L. R., Hyde, E. R., Gonzalez, A., & Knight, R. (2017). Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems, 2(2). 10.1128/msystems.00191-16
- Bokulich, N. A., Subramanian, S., Faith, J. J., Gevers, D., Gordon, J. I., Knight, R., Mills, D. A., & Caporaso, J. G. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature Methods, 10(1), 57â59. 10.1038/nmeth.2276
- Kaehler, B. D., Bokulich, N. A., McDonald, D., Knight, R., Caporaso, J. G., & Huttley, G. A. (2019). Species abundance information improves sequence taxonomy classification accuracy. Nature Communications, 10(1). 10.1038/s41467-019-12669-6
- Lin, H., & Peddada, S. D. (2023). Multigroup analysis of compositions of microbiomes with covariate adjustments and repeated measures. Nature Methods, 21(1), 83â91. 10.1038/s41592-023-02092-7