COVID 19 Forecast About

Acknowledgments: This project has been funded by the School of Data Science, University of North Carolina at Charlotte.


Descriptions of Variables Displayed on the Maps:


State Level:


County Level:


Software Used: R version 4.0.1 and R packages ABSEIR, AMOEBA, aws.s3, dplyr, ggplot2, lubridate, matrixStats, RCurl, rgdal, rlecuyer, R.utils, spatstat.utils, snow, snowfall, spdep, tidyverse, and zoo.


Data Sources: The databases and resources used for this project are acknowledged here:


Data Cleaning:  Standardized datasets for COVID-19 confirmed and death cases at the county and state levels in the United States were created by scraping the data from the Johns Hopkins University GitHub repository. Data were cleaned by: 1. Removing the counties with zero population, 2. Renaming the columns to more straightforward names, 3. Removing missing values, 4. Modifying the date formats and 5. Identifying data reporting errors, such as, decrease in cumulative counts, and replacing the incorrect cumulative counts with the value from the previous day, as well as removing an outlier county (Garfield County in Washington State) in case fatality rate. This county has a reporting error with 1 reported case and 2 reported deaths starting from 3.22.2020.


Data Summaries: Data sets at the county and state levels were created for the following:

Hotspots and Coldspots: Hotspots and coldspots were obtained using a spatial analytic method developed by Getis and Ord, the Gi statistic (Getis and Ord, 1992). The processing is done on several measures of Covid-19 incidence:


This statistic identifies areas in which low values (coldspots) or high values (hotspots) cluster over space. The search radius used to label counties by degree of coldness/hotness is a 100-mile as the crow flies; an inverse distance weighting is used.

Hotspot and coldspot analysis is done using the R function localG.

High Incidence Clusters. Clusters of high incidence of Covid-19 are detected using A Multidirectional Optimum Ecotope-Based Algorithm (AMOEBA) proposed by Aldstadt and Getis (2006) on the following measures:

The AMOEBA procedure is an iterative spatial data mining technique that incorporates the Gi statistic. It is capable of generating clusters of any shape.

It is conducted using the R function AMOEBA.

Social Distancing Data from Safegraph: The social distancing metric provided by this website (Time Spent at Home) uses data from the Safegraph Social Distancing Metrics. GPS pings from anonymous mobile devices, divided by census block groups, are used to generate this data. For the purposes of this website we have aggregated the data to the county and state levels. Safegraph describes the metric that we examined, median_home_dwell_time, as representing the median time in minutes that phones sent pings from their home location in each census block group. The nighttime location of each phone determines its assigned “home location.” The total minutes spent at home locations on each day were summed for each device. The median of these values for each census block group was calculated by Safegraph.

The process for determining our county and state level social distancing metric, “Time Spent at Home”, involved calculating the weighted median from the census block group level datasets and aggregating to the county and state levels. The median of median_home_dwell_time was calculated from the Safegraph dataset, weighted by the device_count present in all the included census block groups in each county or state. This value is an interpolated median, or a calculated estimate of the median that incorporates the percentage of data points present above and below the median. This provides a weighted result with a greater granularity than the scale used permits. Any results that fell outside three standard deviations of that day’s data were replaced with the 3 SD value. Counties that did not provide data for a particular day were marked as “N/A” on those dates.

The states and counties used for these calculations include the 48 contiguous states, Alaska, Hawaii, and Washington D.C. Areas that were not included are American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and the U.S. Virgin Islands. Additional data was sourced from the U.S. Census Bureau for this website. The TIGER/Line Shapefiles from 2019 were used to derive the county name, county code, state name, and state code for each census block group present in the Safegraph dataset. Census block groups are defined by the U.S. Census Bureau as statistical divisions of census tracts covering a contiguous area. All names and legal boundaries are current as of January 1, 2019.



Aldstadt, J., and A. Getis, (2006). “Using AMOEBA to Create a Spatial Weight Matrix and Identify Spatial Clusters,” Geographical Analysis, 37, 327-343.

1Centers for Disease Control and Prevention. (2018, May 18). Principles of Epidemiology in

Public Health Practice. An Introduction to Applied Epidemiology and Biostatistics.

Getis, A., and J.K. Ord, (1992). “The Analysis of Spatial Association by Use of Distance Statistics,” Geographical Analysis, 24(3), 189-206.


Contact us