Synthesising Arena- and Hantavirus data from rodents to understand current known host distributions and viral pathogens.

Rodent
Ecology
Zoonosis
Arenaviridae
Hantaviridae
Open Data
Authors
Affiliations

David Simons

The Royal Veterinary College

Steph Seifert

Washington State University

Published

January 27, 2023

Abstract

Current host-pathogen association datasets provide synthesised information on hosts and their pathogens but do not contain temporal or geographic information. These resources often provide linking information to publications reporting the association but information including accession numbers of archived sequences, number of individuals tested and measures of prevalence among sampled populations are not immediately retrievable. Using these resources for inference beyond host-pathogen associations is therefore limited. Here, we aim to produce a database of host-pathogen associations for two viral families of small mammals which contain several known zoonoses, namely Arenaviridae and Hantaviridae. This database can be used to explore the distribution of small mammal hosts of known and suspected pathogens and the extent to which they have been sampled. Further, linkage to sequence data of known zoonoses will support analysis of the risk of viral reassortment between viral species within geographically co-located host species.

ONGOING WORK

Project aims

  1. Review the literature to produce a synthesised dataset on Arenaviridae and Hantaviridae among small mammals
  2. Implement visualisation tools to explore sampling extent for hosts and pathogens to further understanding of host-pathogen associations
  3. Investigate the relative contributions of ecological, geographic, and genomic factors leading to cross-species transmission and reassortment in rodent-associated arenaviruses and hantaviruses

Method

Search strategy

An initial search was run on NCBI Pubmed 2023-01-06.

  1. rodent* - 165,513
  2. arenavir* - 2,367
  3. hantavir*.mp. - 4,022
  4. 2 OR 3 - 6,308
  5. 1 AND 4 - 1,842

Citations were downloaded as a text file and imported into R for processing. Deduplication by Pubmed ID resulted in 1821 distinct citations.

Code
search_date <- "2023-01-06"
articles <- read_rds(here("others", "data", "search", paste0("citations_", search_date, ".rds")))
articles
# A tibble: 1,821 × 14
# Groups:   pmid [1,821]
   pmid  doi   title abstract year  month day   jabbrv journal keywords lastname
   <chr> <chr> <chr> <chr>    <chr> <chr> <chr> <chr>  <chr>   <chr>    <chr>   
 1 1003… ""    [Ven… "Venezu… 1998  1     1     Acta … Acta c… <NA>     Salas   
 2 1007… "10.… Hant… "We pre… 1999  2     1     South… Southe… <NA>     Schussl…
 3 1007… "10.… Spec… "Hantav… 1999  1     1     Annu … Annual… <NA>     Peters  
 4 1007… "10.… Isol… "Dobrav… 1999  2     1     J Gen… The Jo… <NA>     Nemirov 
 5 1008… "10.… Gene… "The 19… 1999  1     1     Emerg… Emergi… <NA>     Monroe  
 6 1008… "10.… Long… "Hantav… 1999  1     1     Emerg… Emergi… <NA>     Mills   
 7 1008… "10.… Long… "For 35… 1999  1     1     Emerg… Emergi… <NA>     Abbott  
 8 1008… "10.… A lo… "We det… 1999  1     1     Emerg… Emergi… <NA>     Kuenzi  
 9 1008… "10.… Stat… "A long… 1999  1     1     Emerg… Emergi… <NA>     Parment…
10 1008… "10.… Natu… "A mark… 1999  1     1     Emerg… Emergi… <NA>     Calisher
# ℹ 1,811 more rows
# ℹ 3 more variables: firstname <chr>, address <chr>, email <chr>

An initial search of the returned citations was conducted to identify nine studies to trial data extraction.

Data extraction

Included studies

Information about the included study is extracted in a descriptive sheet.

Table 1: Study information extraction sheet
Column name Description
study_id A unique identifier for the included study
pubmed_id The pubmed ID of the included study (if available)
DOI The digital object identifier of the study (if available), if a DOI is not available a weblink to a persistent page for the reference will be included
first_author_surname The first authors surname
title The title of the manuscript, report or book section
journal The name of the journal, report or book
year The year of publication
study_design A free text entry succinctly describing the study design for the rodents or pathogens
sampling_effort A free text entry to capture the effort of sampling, ideally in trap nights for rodent studies
data_access Whether study data is available in complete form or whether summarise data only are available
linked_manuscripts The DOI or weblink to other studies including the same dataset either in its entirety or a subset, this will be used to attempt to de-duplicate data

Rodent sampling

Rodent data is extracted in a rodent sheet. This will include information on the timing of data collection, the location of data collected, the small mammal species detected either as detection/non-detection or number detected. For studies not reporting the detection of an individual species at a location that has been detected at other locations the species will be entered as not-detected at that location.

Table 2: Rodent information extraction sheet
Column name Description
rodent_record_id A unique identifier for the rodent species, at a specific location or timepoint reported by the study dependent on the level of aggregation reported in the study
study_id A unique identifier to link a study to the descriptive sheet entry for that study
date The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
genus The genus of the small-mammal as reported in the study
species The species of the small-mammal as reported in the study
location The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude Latitude will be converted from coordinates presented to EPSG:4326
longitude Longitude will be converted from coordinates presented to EPSG:4326
number The number of detected individuals, for capture-mark-recapture studies the number of distinct individuals will be entered. For studies not explicitly reporting non-detection, values of 0 for a species or genus will be entered if it is detected elsewhere in the study

Pathogen sampling

Pathogen assays are extracted in the pathogen sheet. This includes information on the host the sample originated from, the pathogen family and species being assayed for and the method of the assay. For studies conducting multiple assays on the same samples for different pathogens additional records will be added for each assay. Similarly, if antibody and direct detection for the same pathogen on the same samples is performed additional records will be added.

Table 3: Pathogen information extraction sheet
Column name Description
pathogen_record_id A unique identifier for the group of samples from the same rodent species, at a specific location or timepoint, tested for the same pathogen using the same method
study_id A unique identifier to link a study to the descriptive sheet entry for that study
date The period in which data collection of rodent samples occurred, this will be extracted at the highest temporal resolution provided
host_genus The genus of the small-mammal from which the sample originated as reported in the study
host_species The species of the small-mammal from which the sample originated as reported in the study
location The location of sampling effort, depending on how data are presented in the study this will match to the coordinates given for trapping effort. i.e. if trapping is aggregated at village level, village names will be used
country Country where trapping occurred, for multinational studies where numbers are not disaggregated by country, all countries will be included
habitat_type High level habitat type will be recorded here at the scale for which trapping is recorded
coordinate_resolution The description of coordinate levels provided in the study, i.e. aggregated at study site or study region
latitude Latitude will be converted from coordinates presented to EPSG:4326
longitude Longitude will be converted from coordinates presented to EPSG:4326
pathogen_family The family of virus being assayed for (i.e. Arenaviridae or Hantaviridae)
pathogen_species The species of virus being assayed for if a specific test is being used. For assays unable to differentiate between multiple viral species Multiple will be entered
detection_method Whether the assay is attempting to detect antibody, direct detection of virus (i.e. pcr), or other
number_tested The number of distinct samples tested
number_negative The number of reported negative samples
number_positive The number of reported positive samples
number_inconclusive The number of samples with inconclusive results

Pathogen sequences

If studies include linkage to complete or partial sequences of viruses archived in NCBI they will be lined through the pathogen_sequences sheet.

Table 4: Pathogen sequences extraction sheet
Column name Description
sequence_record_id A unique identifier for the sequence record
study_id A unique identifier to link a study to the descriptive sheet entry for that study
host_genus The genus of the small-mammal from which the sample originated as reported in the study
host_species The species of the small-mammal from which the sample originated as reported in the study
pathogen_species The species of the pathogen
accession_number The accession number for each record archived by the study

Zoonosis status

Finally, an additional sheet known_zoonoses will be produced containing all of the viral species sampled and whether they are known to cause disease in humans.

Table 5: Zoonosis status extraction sheet
Column name Description
pathogen_id A unique identifier for the pathogen species
pathogen_family The family of pathogen
pathogen_species The species of pathogen
known_zoonosis A logical statement of pathogenicity among humans
disease_name The disease name caused by the viral species, if multiple diseases they are entered with a comma separator
icd_10 The ICD-10 associated name of the disease (if known)
disease_reference A DOI of a publication that supports the known_zoonosis statement

Data processing

Raw data will be downloaded from Google Sheets using the googledrive API in R, with date stamped files stored locally. Data will be imported into R for processing, cleaning and formatting to produce a dataset suitable for further analysis.

Outputs

Data visualisation

An RShiny web-based application will be produced to visualise the database and support future analysis. The source code of the app is available from the GitHub repository and it is currently hosted through shinyapps.io here

Rodent sampling

An example of a map produce from rodent sampling data included in the initial 9 studies is shown below.

Code
combined_data <- read_rds(here("others", "data", "raw_data", paste0(search_date, "_data.rds")))
project_crs <- "EPSG:4326"
rodent_locations <- combined_data$rodent %>%
  select(genus, species, latitude, longitude, number) %>%
  mutate(number = as.numeric(number)) %>%
  drop_na(number) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = project_crs) %>%
  group_by(geometry, species) %>%
  mutate(number = sum(number)) %>%
  ungroup() %>%
  distinct() %>%
  rowwise() %>%
  mutate(labels = HTML(paste0("Species: ", species, "<br>", "Number detected: ", number)))
pal <- colorFactor(diverging_hcl(n = length(sort(unique(rodent_locations$genus))), palette = "Berlin"),
                   domain = sort(unique(rodent_locations$genus)))
leaflet(data = rodent_locations) %>%
  addTiles() %>%
  addCircleMarkers(radius = ~ifelse(sqrt(number) == 0, 1, sqrt(number)),
                   color = ~pal(genus),
                   stroke = FALSE,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels) %>%
  addLegend("bottomright", pal = pal, values = ~genus, title = "Host genus", opacity = 0.8)

An interactive map displaying the location of detected small species in included studies. Selecting points will expand the number of data points at those coordinates. Point colour indicates small mammal genus with size of the point varying by the number of individuals detected. Hovering over a point or selecting it will show the species name, the number detected and the time period surveyed at the location. Data is currently shown for 9 studies.

Pathogen detection

A similar map can be produced to map pathogen detection with separate layers for acute infection (i.e. PCR positive samples) and evidence of prior infection (i.e. antibody positive samples)

Code
pathogen_locations <- combined_data$pathogen %>%
  select(detection_method, pathogen_family, pathogen_species, host_species, latitude, longitude, number_tested, number_positive) %>%
  mutate(number_tested = as.numeric(number_tested),
         number_positive = as.numeric(number_positive)) %>%
  drop_na(number_tested, number_positive) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = project_crs) %>%
  group_by(geometry, detection_method, pathogen_family, pathogen_species, host_species) %>%
  mutate(number_tested = sum(number_tested),
         number_positive = sum(number_positive)) %>%
  ungroup() %>%
  distinct() %>%
  mutate(percent_positive = case_when(is.nan(number_positive/number_tested * 100) ~ "NA",
                                      number_positive == 0 ~ "NA",
                                      number_positive/number_tested * 100 < 1 ~ "< 1%",
                                      TRUE ~ paste0(as.character(round(number_positive/number_tested * 100, 2)), "%")),
         detection_method = case_when(detection_method == "pcr" ~ "Viral",
                                      detection_method == "antibody" ~ "Antibody"),
         detection = case_when(number_positive >= 1 ~ "Detected",
                               TRUE ~ "Not detected")) %>%
  rowwise() %>%
  mutate(labels = HTML(paste0("Pathogen family: ", pathogen_family, "<br>",
                              "Pathogen species: ", pathogen_species, "<br>",
                              "Host species: ", host_species, "<br>",
                              "Number tested: ", number_tested, "<br>",
                              "Number positive (%): ", number_positive, " (", percent_positive, ")", "<br>")))
path_pal <- colorFactor(diverging_hcl(n = length(unique(pathogen_locations$pathogen_species)), palette = "Berlin"),
                        domain = sort(unique(pathogen_locations$pathogen_species)))
stroke_pal <- colorFactor(palette = c("white", "gray"), levels = c("Detected", "Not detected"))
leaflet(data = pathogen_locations) %>%
  addTiles() %>%
  addCircleMarkers(data = pathogen_locations %>%
                     filter(detection_method == "Viral") %>%
                     filter(number_tested != 0),
                   radius = ~ifelse(sqrt(number_tested) == 0, 1, sqrt(number_tested)),
                   fill = TRUE,
                   fillColor = ~path_pal(pathogen_species),
                   stroke = TRUE,
                   color = ~stroke_pal(detection),
                   weight = 2,
                   opacity = 1,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels,
                   group = "Viral detection") %>%
  addCircleMarkers(data = pathogen_locations %>%
                     filter(detection_method == "Antibody") %>%
                     filter(number_tested != 0),
                   radius = ~ifelse(sqrt(number_tested) == 0, 1, sqrt(number_tested)),
                   fill = TRUE,
                   fillColor = ~path_pal(pathogen_species),
                   stroke = TRUE,
                   color = ~stroke_pal(detection),
                   weight = 2,
                   opacity = 1,
                   fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(spiderfyOnMaxZoom = TRUE,
                                                         spiderLegPolylineOptions = list(length = 6),
                                                         spiderfyDistanceMultiplier = 2),
                   label = ~labels,
                   popup = ~labels,
                   group = "Antibody detection") %>%
  addLayersControl(overlayGroups = c("Viral detection", "Antibody detection"),
                   options = layersControlOptions(collapsed = FALSE)) %>%
  addLegend("bottomright", pal = path_pal, values = ~pathogen_species, title = "Pathogen species", opacity = 0.8)

An interactive map showing the locations of pathogen sampling. Selecting clustered points will expand the data. Each individual circle represents a single host species pathogen combination at that sampling location. The number of samples tested is associated with circle size. Selecting the point will display information about the host-pathogen association, including number of samples tested and number of positive samples. A white border of the circle designates host-pathogen associations with at least one positive result. The colour of the circle indicates the pathogen species (where known). Two layers are available, for viral detection and antibody detection.

Host pathogen associations

Code
pathogen_summary <- combined_data$pathogen %>%
  select(genus = host_genus,
         species = host_species,
         pathogen_family, pathogen_species,
         number_tested, number_positive,
         detection_method) %>%
  mutate(number_tested = as.numeric(number_tested)) %>%
  filter(number_tested >= 1) %>%
  left_join(combined_data$zoonoses %>%
              select(pathogen_species, known_zoonosis),
            by = c("pathogen_species"))
hp_associations <- combined_data$rodent %>%
  select(genus, species) %>%
  left_join(pathogen_summary %>%
              group_by(detection_method, species, pathogen_species, known_zoonosis) %>%
              mutate(number_tested = sum(number_tested),
                     number_positive = sum(number_positive)), by = c("genus", "species")) %>%
  drop_na(number_tested) %>%
  distinct() %>%
  filter(!pathogen_species %in% c("Multiple", "Puumala virus/Saaremaa virus")) %>%
  arrange(pathogen_family, pathogen_species, genus, species) %>%
  mutate(association = case_when(number_positive >= 1 ~ TRUE,
                                 TRUE ~ FALSE),
         detection_method = case_when(detection_method == "pcr" ~ "Viral",
                                      detection_method == "antibody" ~ "Antibody"))
scaled_n <- preProcess(as.data.frame(log10(hp_associations$number_tested)), method = "range")
hp_associations$alpha <- predict(scaled_n, as.data.frame(log10(hp_associations$number_tested)))[, 1]
host_pathogen_plot <- hp_associations %>%
  ggplot(aes(x = pathogen_species, y = species)) +
  geom_tile(aes(fill = association, colour = known_zoonosis, alpha = alpha,
                text = paste0("Pathogen species: ", pathogen_species, "<br>",
                              "Host species: ", species, "<br>",
                              "Number tested: ", number_tested, "<br>",
                              "Number positive: ", number_positive, "<br>",
                              "Known zoonosis: ", known_zoonosis))) +
  scale_fill_viridis_d() +
  scale_colour_manual(values = c("white", "black")) +
  facet_wrap(~ detection_method) +
  guides(alpha = "none",
         colour = "none") +
  labs(fill = "H-P association",
       x = element_blank(),
       y = element_blank()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
ggplotly(host_pathogen_plot, tooltip = "text")

An interactive plot displaying observed host pathogen associations. Facets are produced for antibody based assays and viral detection assays. The rodent species are listed in alphabetical order on the y-axis, detected pathogens are displayed in alphabetical order on the x-axis. Purple tiles represent host-pathogen associations that were not observed for that detection type, yellow tiles represent observed host-pathogen associations. The strength of the colour (the alpha) is scaled to the number of assays performed for that host-pathogen pair. A black line surrounding the tile indicates the pathogen as a known zoonosis. Hovering over the relavent tile will highlight the number of assays and the number of positive samples.

Citation

BibTeX citation:
@online{simons2023,
  author = {Simons, David and Seifert, Steph},
  title = {Synthesising {Arena-} and {Hantavirus} Data from Rodents to
    Understand Current Known Host Distributions and Viral Pathogens.},
  date = {2023-01-27},
  url = {https://www.dsimons.org/others/arena_hanta.html},
  langid = {en},
  abstract = {Current host-pathogen association datasets provide
    synthesised information on hosts and their pathogens but do not
    contain temporal or geographic information. These resources often
    provide linking information to publications reporting the
    association but information including accession numbers of archived
    sequences, number of individuals tested and measures of prevalence
    among sampled populations are not immediately retrievable. Using
    these resources for inference beyond host-pathogen associations is
    therefore limited. Here, we aim to produce a database of
    host-pathogen associations for two viral families of small mammals
    which contain several known zoonoses, namely Arenaviridae and
    Hantaviridae. This database can be used to explore the distribution
    of small mammal hosts of known and suspected pathogens and the
    extent to which they have been sampled. Further, linkage to sequence
    data of known zoonoses will support analysis of the risk of viral
    reassortment between viral species within geographically co-located
    host species.}
}
For attribution, please cite this work as:
Simons, David, and Steph Seifert. 2023. “Synthesising Arena- and Hantavirus Data from Rodents to Understand Current Known Host Distributions and Viral Pathogens.” January 27, 2023. https://www.dsimons.org/others/arena_hanta.html.