Project ArHa: Arenaviruses and Hantaviruses, globally distributed rodent borne zoonoses

Authors

Affiliations

David Simons

The Royal Veterinary College

Steph Seifert

Washington State University

Published

February 28, 2025

Abstract

Current host-pathogen association datasets provide synthesised information on hosts and their pathogens but do not contain temporal or geographic information. These resources often provide linking information to publications reporting the association but information including accession numbers of archived sequences, number of individuals tested and measures of prevalence among sampled populations are not immediately retrievable. Using these resources for inference beyond host-pathogen associations is therefore limited. Here, we aim to produce a database of host-pathogen associations for two viral families of small mammals which contain several known zoonoses, namely Arenaviridae and Hantaviridae. This database can be used to explore the distribution of small mammal hosts of known and suspected pathogens and the extent to which they have been sampled. Further, linkage to sequence data of known zoonoses will support analysis of the risk of viral reassortment between viral species within geographically co-located host species.

Project aims

Review the literature to produce a synthesised dataset on Arenaviridae and Hantaviridae among small mammals
Implement visualisation tools to explore sampling extent for hosts and pathogens to further understanding of host-pathogen associations
Investigate the relative contributions of ecological, geographic, and genomic factors leading to cross-species transmission and reassortment in rodent-associated arenaviruses and hantaviruses

Protocol pre-print

The protocol for this project has been submitted to PLOS Neglected Tropical Diseases and is currently available as a pre-print at bioRxiv.

Abstract

Arenaviruses and Hantaviruses, primarily hosted by rodents and shrews, represent significant public health threats due to their potential for zoonotic spillover into human populations. Despite their global distribution, the full impact of these viruses on human health remains poorly understood, particularly in regions like Africa, where data is sparse. Both virus families continue to emerge, with pathogen evolution and spillover driven by anthropogenic factors such as land use change, climate change, and biodiversity loss. Recent research highlights the complex interactions between ecological dynamics, host species, and environmental factors in shaping the risk of pathogen transmission and spillover. This underscores the need for integrated ecological and genomic approaches to better understand these zoonotic diseases. A comprehensive, spatially and temporally explicit dataset, incorporating host-pathogen dynamics and human disease data, is crucial for improving risk assessments, enhancing disease surveillance, and guiding public health interventions. Such a dataset (ArHa) would also support predictive modelling efforts aimed at mitigating future spillover events. This paper proposes the development of this unified database for small-mammal hosts of Arenaviruses and Hantaviruses, identifying gaps in current research and promoting a more comprehensive understanding of pathogen prevalence, spillover risk, and viral evolution.

Strengths and Limitations of this study

This dataset combines detailed spatial and temporal information, providing a unique resource for understanding geographic and temporal trends in Arenavirus and Hantavirus host-pathogen relationships.
By explicitly quantifying sampling biases and detection efforts, the dataset allows more robust and accurate asssessments of pathogen prevalence and distribution.
The dataset offers a platform for linking ecological data with human health outcomes, supporting the identification of spillover hotspots.
The dataset relies on published material, which may vary in terms of detail, accuracy and completeness. Missing or imprecise information may limit the reliability of subsequent analyses.
The dataset will be produced as a static resource which could limit its relevance over time as emerging data will not be added.

Current status of the project

We are currently nearing completion of the first round of data extraction. The slides below are a presentation I delivered to collaborators to summarise the current state of the database.

Data extraction

Included studies

Information about the included study is extracted in a descriptive sheet.

Table 1: Study information extraction sheet

Column name	Description
study_id	A unique identifier for the included study
pubmed_id	The pubmed ID of the included study (if available)
DOI	The digital object identifier of the study (if available), if a DOI is not available a weblink to a persistent page for the reference will be included
first_author_surname	The first authors surname
title	The title of the manuscript, report or book section
journal	The name of the journal, report or book
year	The year of publication
study_design	A free text entry succinctly describing the study design for the rodents or pathogens
sampling_effort	A free text entry to capture the effort of sampling, ideally in trap nights for rodent studies
data_access	Whether study data is available in complete form or whether summarise data only are available
linked_manuscripts	The DOI or weblink to other studies including the same dataset either in its entirety or a subset, this will be used to attempt to de-duplicate data

Rodent sampling

Rodent data is extracted in a rodent sheet. This will include information on the timing of data collection, the location of data collected, the small mammal species detected either as detection/non-detection or number detected. For studies not reporting the detection of an individual species at a location that has been detected at other locations the species will be entered as not-detected at that location.

Table 2: Rodent information extraction sheet

Column name	Description
rodent_record_id	A unique identifier for the rodent species, at a specific location or timepoint reported by the study dependent on the level of aggregation reported in the study
study_id	A unique identifier to link a study to the `descriptive` sheet entry for that study
date	The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
genus	The genus of the small-mammal as reported in the study
species	The species of the small-mammal as reported in the study
location	The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country	Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type	High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution	The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude	Latitude will be converted from coordinates presented to EPSG:4326
longitude	Longitude will be converted from coordinates presented to EPSG:4326
number	The number of detected individuals, for capture-mark-recapture studies the number of distinct individuals will be entered. For studies not explicitly reporting non-detection, values of 0 for a species or genus will be entered if it is detected elsewhere in the study

Pathogen sampling

Pathogen assays are extracted in the pathogen sheet. This includes information on the host the sample originated from, the pathogen family and species being assayed for and the method of the assay. For studies conducting multiple assays on the same samples for different pathogens additional records will be added for each assay. Similarly, if antibody and direct detection for the same pathogen on the same samples is performed additional records will be added.

Table 3: Pathogen information extraction sheet

Column name	Description
pathogen_record_id	A unique identifier for the group of samples from the same rodent species, at a specific location or timepoint, tested for the same pathogen using the same method
study_id	A unique identifier to link a study to the `descriptive` sheet entry for that study
date	The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
host_genus	The genus of the small-mammal from which the sample originated as reported in the study
host_species	The species of the small-mammal from which the sample originated as reported in the study
location	The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country	Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type	High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution	The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude	Latitude will be converted from coordinates presented to EPSG:4326
longitude	Longitude will be converted from coordinates presented to EPSG:4326
pathogen_family	The family of virus being assayed for (i.e. Arenaviridae or Hantaviridae)
pathogen_species	The species of virus being assayed for if a specific test is being used. For assays unable to differentiate between multiple viral species `Multiple` will be entered
detection_method	Whether the assay is attempting to detect antibody, direct detection of virus (i.e. pcr), or other
number_tested	The number of distinct samples tested
number_negative	The number of reported negative samples
number_positive	The number of reported positive samples
number_inconclusive	The number of samples with inconclusive results

Pathogen sequences

If studies include linkage to complete or partial sequences of viruses archived in NCBI they will be lined through the pathogen_sequences sheet.

Table 4: Pathogen sequences extraction sheet

Column name	Description
sequence_record_id	A unique identifier for the sequence record
study_id	A unique identifier to link a study to the `descriptive` sheet entry for that study
host_genus	The genus of the small-mammal from which the sample originated as reported in the study
host_species	The species of the small-mammal from which the sample originated as reported in the study
pathogen_species	The species of the pathogen
accession_number	The accession number for each record archived by the study

Zoonosis status

Finally, an additional sheet known_zoonoses will be produced containing all of the viral species sampled and whether they are known to cause disease in humans.

Table 5: Zoonosis status extraction sheet

Column name	Description
pathogen_id	A unique identifier for the pathogen species
pathogen_family	The family of pathogen
pathogen_species	The species of pathogen
known_zoonosis	A logical statement of pathogenicity among humans
disease_name	The disease name caused by the viral species, if multiple diseases they are entered with a comma separator
icd_10	The ICD-10 associated name of the disease (if known)
disease_reference	A DOI of a publication that supports the `known_zoonosis` statement

Data processing

Raw data will be downloaded from Google Sheets using the googledrive API in R, with date stamped files stored locally. Data will be imported into R for processing, cleaning and formatting to produce a dataset suitable for further analysis.

Outputs

Data visualisation

An RShiny web-based application will be produced to visualise the database and support future analysis. The source code of the app is available from the GitHub repository and it is currently hosted through shinyapps.io here

Rodent sampling

An example of a map produce from rodent sampling data included in the initial 9 studies is shown below.

Code

search_date <- "2023-01-06"
combined_data <- read_rds(here("others", "data", "raw_data", paste0(search_date, "_data.rds")))
project_crs <- "EPSG:4326"
rodent_locations <- combined_data$rodent %>%
  select(genus, species, latitude, longitude, number) %>%
  mutate(number = as.numeric(number)) %>%
  drop_na(number) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = project_crs) %>%
  group_by(geometry, species) %>%
  mutate(number = sum(number)) %>%
  ungroup() %>%
  distinct() %>%
  rowwise() %>%
  mutate(labels = HTML(paste0("Species: ", species, "<br>", "Number detected: ", number)))
pal <- colorFactor(diverging_hcl(n = length(sort(unique(rodent_locations$genus))), palette = "Berlin"),
                   domain = sort(unique(rodent_locations$genus)))
leaflet(data = rodent_locations) %>%
  addTiles() %>%
  addCircleMarkers(radius = ~ifelse(sqrt(number) == 0, 1, sqrt(number)),
                   color = ~pal(genus),
                   stroke = FALSE,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels) %>%
  addLegend("bottomright", pal = pal, values = ~genus, title = "Host genus", opacity = 0.8)

An interactive map displaying the location of detected small species in included studies. Selecting points will expand the number of data points at those coordinates. Point colour indicates small mammal genus with size of the point varying by the number of individuals detected. Hovering over a point or selecting it will show the species name, the number detected and the time period surveyed at the location. Data is currently shown for 9 studies.

Pathogen detection

A similar map can be produced to map pathogen detection with separate layers for acute infection (i.e. PCR positive samples) and evidence of prior infection (i.e. antibody positive samples)

Code

pathogen_locations <- combined_data$pathogen %>%
  select(detection_method, pathogen_family, pathogen_species, host_species, latitude, longitude, number_tested, number_positive) %>%
  mutate(number_tested = as.numeric(number_tested),
         number_positive = as.numeric(number_positive)) %>%
  drop_na(number_tested, number_positive) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = project_crs) %>%
  group_by(geometry, detection_method, pathogen_family, pathogen_species, host_species) %>%
  mutate(number_tested = sum(number_tested),
         number_positive = sum(number_positive)) %>%
  ungroup() %>%
  distinct() %>%
  mutate(percent_positive = case_when(is.nan(number_positive/number_tested * 100) ~ "NA",
                                      number_positive == 0 ~ "NA",
                                      number_positive/number_tested * 100 < 1 ~ "< 1%",
                                      TRUE ~ paste0(as.character(round(number_positive/number_tested * 100, 2)), "%")),
         detection_method = case_when(detection_method == "pcr" ~ "Viral",
                                      detection_method == "antibody" ~ "Antibody"),
         detection = case_when(number_positive >= 1 ~ "Detected",
                               TRUE ~ "Not detected")) %>%
  rowwise() %>%
  mutate(labels = HTML(paste0("Pathogen family: ", pathogen_family, "<br>",
                              "Pathogen species: ", pathogen_species, "<br>",
                              "Host species: ", host_species, "<br>",
                              "Number tested: ", number_tested, "<br>",
                              "Number positive (%): ", number_positive, " (", percent_positive, ")", "<br>")))
path_pal <- colorFactor(diverging_hcl(n = length(unique(pathogen_locations$pathogen_species)), palette = "Berlin"),
                        domain = sort(unique(pathogen_locations$pathogen_species)))
stroke_pal <- colorFactor(palette = c("white", "gray"), levels = c("Detected", "Not detected"))
leaflet(data = pathogen_locations) %>%
  addTiles() %>%
  addCircleMarkers(data = pathogen_locations %>%
                     filter(detection_method == "Viral") %>%
                     filter(number_tested != 0),
                   radius = ~ifelse(sqrt(number_tested) == 0, 1, sqrt(number_tested)),
                   fill = TRUE,
                   fillColor = ~path_pal(pathogen_species),
                   stroke = TRUE,
                   color = ~stroke_pal(detection),
                   weight = 2,
                   opacity = 1,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels,
                   group = "Viral detection") %>%
  addCircleMarkers(data = pathogen_locations %>%
                     filter(detection_method == "Antibody") %>%
                     filter(number_tested != 0),
                   radius = ~ifelse(sqrt(number_tested) == 0, 1, sqrt(number_tested)),
                   fill = TRUE,
                   fillColor = ~path_pal(pathogen_species),
                   stroke = TRUE,
                   color = ~stroke_pal(detection),
                   weight = 2,
                   opacity = 1,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels,
                   group = "Antibody detection") %>%
  addLayersControl(overlayGroups = c("Viral detection", "Antibody detection"),
                   options = layersControlOptions(collapsed = FALSE)) %>%
  addLegend("bottomright", pal = path_pal, values = ~pathogen_species, title = "Pathogen species", opacity = 0.8)

An interactive map showing the locations of pathogen sampling. Selecting clustered points will expand the data. Each individual circle represents a single host species pathogen combination at that sampling location. The number of samples tested is associated with circle size. Selecting the point will display information about the host-pathogen association, including number of samples tested and number of positive samples. A white border of the circle designates host-pathogen associations with at least one positive result. The colour of the circle indicates the pathogen species (where known). Two layers are available, for viral detection and antibody detection.

Host pathogen associations

Code

pathogen_summary <- combined_data$pathogen %>%
  select(genus = host_genus,
         species = host_species,
         pathogen_family, pathogen_species,
         number_tested, number_positive,
         detection_method) %>%
  mutate(number_tested = as.numeric(number_tested)) %>%
  filter(number_tested >= 1) %>%
  left_join(combined_data$zoonoses %>%
              select(pathogen_species, known_zoonosis),
            by = c("pathogen_species"))
hp_associations <- combined_data$rodent %>%
  select(genus, species) %>%
  left_join(pathogen_summary %>%
              group_by(detection_method, species, pathogen_species, known_zoonosis) %>%
              mutate(number_tested = sum(number_tested),
                     number_positive = sum(number_positive)), by = c("genus", "species")) %>%
  drop_na(number_tested) %>%
  distinct() %>%
  filter(!pathogen_species %in% c("Multiple", "Puumala virus/Saaremaa virus")) %>%
  arrange(pathogen_family, pathogen_species, genus, species) %>%
  mutate(association = case_when(number_positive >= 1 ~ TRUE,
                                 TRUE ~ FALSE),
         detection_method = case_when(detection_method == "pcr" ~ "Viral",
                                      detection_method == "antibody" ~ "Antibody"))
scaled_n <- preProcess(as.data.frame(log10(hp_associations$number_tested)), method = "range")
hp_associations$alpha <- predict(scaled_n, as.data.frame(log10(hp_associations$number_tested)))[, 1]
host_pathogen_plot <- hp_associations %>%
  ggplot(aes(x = pathogen_species, y = species)) +
  geom_tile(aes(fill = association, colour = known_zoonosis, alpha = alpha,
                text = paste0("Pathogen species: ", pathogen_species, "<br>",
                              "Host species: ", species, "<br>",
                              "Number tested: ", number_tested, "<br>",
                              "Number positive: ", number_positive, "<br>",
                              "Known zoonosis: ", known_zoonosis))) +
  scale_fill_viridis_d() +
  scale_colour_manual(values = c("white", "black")) +
  facet_wrap(~ detection_method) +
  guides(alpha = "none",
         colour = "none") +
  labs(fill = "H-P association",
       x = element_blank(),
       y = element_blank()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
ggplotly(host_pathogen_plot, tooltip = "text")

An interactive plot displaying observed host pathogen associations. Facets are produced for antibody based assays and viral detection assays. The rodent species are listed in alphabetical order on the y-axis, detected pathogens are displayed in alphabetical order on the x-axis. Purple tiles represent host-pathogen associations that were not observed for that detection type, yellow tiles represent observed host-pathogen associations. The strength of the colour (the alpha) is scaled to the number of assays performed for that host-pathogen pair. A black line surrounding the tile indicates the pathogen as a known zoonosis. Hovering over the relavent tile will highlight the number of assays and the number of positive samples.

Citation

BibTeX citation:

@online{simons2025,
  author = {Simons, David and Seifert, Steph},
  title = {Project {ArHa:} {Arenaviruses} and {Hantaviruses,} Globally
    Distributed Rodent Borne Zoonoses},
  date = {2025-02-28},
  langid = {en},
  abstract = {Current host-pathogen association datasets provide
    synthesised information on hosts and their pathogens but do not
    contain temporal or geographic information. These resources often
    provide linking information to publications reporting the
    association but information including accession numbers of archived
    sequences, number of individuals tested and measures of prevalence
    among sampled populations are not immediately retrievable. Using
    these resources for inference beyond host-pathogen associations is
    therefore limited. Here, we aim to produce a database of
    host-pathogen associations for two viral families of small mammals
    which contain several known zoonoses, namely Arenaviridae and
    Hantaviridae. This database can be used to explore the distribution
    of small mammal hosts of known and suspected pathogens and the
    extent to which they have been sampled. Further, linkage to sequence
    data of known zoonoses will support analysis of the risk of viral
    reassortment between viral species within geographically co-located
    host species.}
}

For attribution, please cite this work as:

Simons, David, and Steph Seifert. 2025. “Project ArHa: Arenaviruses and Hantaviruses, Globally Distributed Rodent Borne Zoonoses.” February 28, 2025.