VhgBoxplot {Virusparies}R Documentation

VhgBoxplot: Generate box plots comparing E-values,identity or contig length (Gatherer only) for each virus group

Description

VhgBoxplot generates box plots comparing either E-values,identity or contig length (Gatherer only) for each group from VirusHunter or VirusGatherer hittable results.

Usage

VhgBoxplot(
  file,
  x_column = "best_query",
  taxa_rank = "Family",
  y_column = "ViralRefSeq_E",
  contiglen_log10_scale = FALSE,
  cut = 1e-05,
  add_cutoff_line = TRUE,
  cut_colour = "#990000",
  reorder_criteria = "median",
  theme_choice = "linedraw",
  flip_coords = TRUE,
  add_mean_point = FALSE,
  mean_color = "white",
  mean_point_size = 2,
  title = "default",
  title_size = 16,
  title_face = "bold",
  title_colour = "#2a475e",
  subtitle = "default",
  subtitle_size = 12,
  subtitle_face = "bold",
  subtitle_colour = "#1b2838",
  xlabel = NULL,
  ylabel = NULL,
  axis_title_size = 12,
  xtext_size = 10,
  x_angle = NULL,
  ytext_size = 10,
  y_angle = NULL,
  remove_group_labels = FALSE,
  legend_title = "Phylum",
  legend_position = "bottom",
  legend_title_size = 12,
  legend_title_face = "bold",
  legend_text_size = 10,
  facet_ncol = NULL,
  group_unwanted_phyla = NULL
)

Arguments

file

A data frame containing VirusHunter or VirusGatherer hittable results.

x_column

(optional): A character specifying the column containing the groups (default:"best_query"). Note: Gatherer hittables do not have a "best_query" column. Please provide an appropriate column for grouping.

taxa_rank

(optional): When x_column is set to "ViralRefSeq_taxonomy", specify the taxonomic rank to group your data by. Supported ranks are:

  • "Subphylum"

  • "Class"

  • "Subclass"

  • "Order"

  • "Suborder"

  • "Family" (default)

  • "Subfamily"

  • "Genus" (including Subgenus)

y_column

A character specifying the column containing the values to be compared. Currently "ViralRefSeq_ident", "contig_len" (column in Gatherer hittable) and "ViralRefSeq_E" are supported columns (default:"ViralRefSeq_E").

contiglen_log10_scale

(optional): When y_column is set to "contig_len", this parameter enables logarithmic scaling (log10) of the y-axis (TRUE). By default, this feature is disabled (FALSE).

cut

(optional): The significance cutoff value for E-values (default: 1e-5).

add_cutoff_line

(optional): Whether to add a horizontal line based on cut for "ViralRefSeq_E" column (default: TRUE).

cut_colour

(optional): The color for the significance cutoff line (default: "#990000").

reorder_criteria

Character string specifying the criteria for reordering the x-axis ('max', 'min', 'median'(Default),'mean','phylum'). NULL sorts alphabetically. You can also specify criteria with 'phylum_' prefix (e.g., 'phylum_median') to sort by phylum first and then by the specified statistic within each phylum.

theme_choice

(optional): A character indicating the ggplot2 theme to apply. Options include "minimal", "classic", "light", "dark", "void", "grey" (or "gray"), "bw", "linedraw" (default), and "test". Append "_dotted" to any theme to add custom dotted grid lines (e.g., "classic_dotted").

flip_coords

(optional): Logical indicating whether to flip the coordinates of the plot (default: TRUE).

add_mean_point

(optional): Logical indicating whether to add mean points to the box plot (default: FALSE).

mean_color

(optional): Change color of point indicating mean value in box plot (default: "white").

mean_point_size

(optional): Change size of point indicating mean value in box plot (default: 2).

title

(optional): A character specifying the title of the plot. Default title is set based on y_column.

title_size

(optional): Numeric specifying the size of the title text (default: 16).

title_face

(optional): A character specifying the font face for the title text (default: "bold").

title_colour

(optional): A character specifying the color for the title text (default: "#2a475e").

subtitle

(optional): A character specifying the subtitle of the plot. Default subtitle is set based on y_column.

subtitle_size

(optional): Numeric specifying the size of the subtitle text(default: 12).

subtitle_face

(optional): A character specifying the font face for the subtitle text (default: "bold").

subtitle_colour

(optional): A character specifying the color for the subtitle text (default: "#1b2838").

xlabel

(optional): A character specifying the label for the x-axis (default: "Virus found in query").

ylabel

(optional): A character specifying the label for the y-axis. Default is set based on y_column.

axis_title_size

(optional): Numeric specifying the size of the axis title text (default: 12).

xtext_size

(optional): Numeric specifying the size of the x-axis tick labels (default: 10).

x_angle

(optional): An integer specifying the angle (in degrees) for the x-axis text labels. Default is NULL, meaning no change.

ytext_size

(optional): Numeric specifying the size of the y-axis tick labels (default: 10).

y_angle

(optional): An integer specifying the angle (in degrees) for the y-axis text labels. Default is NULL, meaning no change.

remove_group_labels

(optional): If TRUE, the group labels will be removed; if FALSE or omitted, the labels will be displayed.

legend_title

(optional): A character specifying the title for the legend (default: "Phylum").

legend_position

(optional): A character specifying the position of the legend (default: "bottom").

legend_title_size

(optional): Numeric specifying the size of the legend title text (default: 12).

legend_title_face

(optional): A character specifying the font face for the legend title text (default: "bold").

legend_text_size

(optional): Numeric specifying the size of the legend text (default: 10).

facet_ncol

(optional): The number of columns for faceting (default: NULL). It is recommended to specify this when the number of viral groups is high, to ensure they fit well in one plot.

group_unwanted_phyla

(optional): A character string specifying which group of viral phyla to retain in the analysis. Valid values are:

"rna"

Retain only the phyla specified for RNA viruses.

"smalldna"

Retain only the phyla specified for small DNA viruses.

"largedna"

Retain only the phyla specified for large DNA viruses.

"others"

Retain only the phyla that match small DNA, Large DNA and RNA viruses.

All other phyla not in the specified group will be grouped into a single category: "Non-RNA-virus" for "rna", "Non-Small-DNA-Virus" for "smalldna","Non-Large-DNA-Virus" for "largedna",or "Other Viruses" for "others".

Details

VhgBoxplot generates box plots comparing either E-values, identity, or contig length (Gatherer only) for each virus group from the VirusHunter or Gatherer hittable.

The user can specify whether to generate box plots for E-values, identity, or contig length (Gatherer only) by specifying the 'y_column'. This means that 'VhgBoxplot' can generate three different types of box plots. By default, 'y_column' is set to "ViralRefSeq_E" and will plot the reference E-Value on the y-axis. Grouping on the x-axis is done by the 'x_column' argument. By default, the "best_query" will be used.

Additionally, the function calculates summary statistics and identifies outliers for further analysis ("ViralRefSeq_E" and "contig_len" only). When 'y_column' is set to "ViralRefSeq_E", the output also includes 'rows_belowthres', which contains the hittable filtered for the rows below the threshold specified in the 'cut' argument.

The 'cut' argument is used differently depending on the 'y_column' value:

This allows the user to plot only the significant contig lengths and identities while also visualizing the number of non-significant and significant values for comparison.

Warning: In some cases, E-values might be exactly 0. When these values are transformed using -log10, R returns "inf" as the output. To avoid this issue, we replace all E-values that are 0 with the smallest e-value that is greater than 0. If the smallest E-value is above the user-defined cutoff, we use a value of cutoff * 10^-10 to replace the zeros.

Value

A list containing:

Author(s)

Sergej Ruff

See Also

VirusHunterGatherer is available here: https://github.com/lauberlab/VirusHunterGatherer.

Examples

path <- system.file("extdata", "virushunter.tsv", package = "Virusparies")
file <- ImportVirusTable(path)

# plot 1 for E-values
plot1 <- VhgBoxplot(file, x_column = "best_query", y_column = "ViralRefSeq_E")
plot1

# plot 2 for identity
plot2 <- VhgBoxplot(file, x_column = "best_query", y_column = "ViralRefSeq_ident")
plot2

# plot 3 custom arguments used
plot3 <- VhgBoxplot(file,
                  x_column = "best_query",
                  y_column = "ViralRefSeq_E",
                  theme_choice = "grey",
                  subtitle = "Custom subtitle: Identity for custom query",
                  xlabel = "Custom x-axis label: Custom query",
                  ylabel = "Custom y-axis label: Viral Reference Evalue in -log10 scale",
                  legend_position = "right")
plot3

# import gatherer files
path2 <- system.file("extdata", "virusgatherer.tsv", package = "Virusparies")
vg_file <- ImportVirusTable(path2)


# plot 4: Virusgatherer plot for ViralRefSeq_taxonomy agains contig length
plot5 <- VhgBoxplot(vg_file,x_column = "ViralRefSeq_taxonomy",y_column = "contig_len")
plot5




[Package Virusparies version 1.1.0 Index]