COVID-19 Data Project

The goal of the COVID-19 Data Project was to create a research-ready data set. Our data is sourced from local public health departments from across the country and then validated by our team of over one-hundred interns through an extensive quality-assurance procedure. Our data is updated regularly, but to allow time for this procedure before publication, the last three days of data are sourced from the Johns Hopkins University data set.

About the BroadStreet COVID-19 Project

 

The BroadStreet COVID-19 Data Project began on March 16, 2020. The project currently tracks COVID-19 cases and deaths at the county level for all U.S. states and territories, including historical data going back to January 20, 2020. BroadStreet is releasing this dataset to the public to help scientists, public health workers, educators and government personnel understand, predict, and learn from the COVID-19 outbreak. In the coming weeks, this document will be accompanied by a scientific in-depth and peer-reviewed account of the BroadStreet COVID-19 Data Project. Our hope is this will help to guide future disease surveillance for future epidemics and pandemics.

 

On a day-to-day basis, the BroadStreet COVID-19 Data Project team of approximately 125 volunteers compiles and organizes data on COVID-19 case numbers and death numbers at the county level from all 50 states and the District of Columbia. Within the company, there are many focus groups who use this collected data to lead research, investigations and data collection in different areas of study such as policy, race equality and more. 

 

Report Contributors

 

 

Tom Schmitt, PhD

Tracy Flood, MD PhD

Zach Sturgeon

Samin Charepoo

April Miller

Cedonia Thomas

Grace Gibbon

Sabrine Benzakour

Allison Muir, MHA

Background

 

In public health emergencies, the collection of data on a county, state, or regional basis is critical to enable public health officials to identify hotspots, restrict large gatherings, limit movement, and determine health care policy. [1] Health IT infrastructure from Singapore, Taiwan, and Korea have contributed to the ability of those countries to test, track, trace, and quarantine individuals with COVID-19, allowing for a more rapid lowering of cases and deaths compared to previous projections. [2-4]

 

In response to this public health emergency, BroadStreet is releasing data files with cumulative counts of coronavirus cases and mortalities in the United States and their territories, at the county level, from January 20th, 2020 to present. Over 120 interns and volunteers were recruited via the BroadStreet website [5], and through institutional reference for interested undergraduate or graduate students. The project quickly grew as the virus spread, leading to a large data set involving historical and present data collection, input, and analysis. The data team tracked daily numbers of deaths and cases throughout the entire United States at the county level. 

 

The BroadStreet COVID-19 Data Project data is unique in that data is updated/corrected by quality assurance reviews in days, weeks and months following initial entry. We feel this will provide an alternate method of collecting COVID-19 data and an alternate historical snapshot.  This data allows individuals to visualize the pandemic impact on counties, states and the United States. This data, including daily case totals from the first known case of the virus, gives the public access to knowledge regarding the rate of transmission and spread of the virus.

 

BroadStreet COVID-19 Data Project Handbook

 

Due to the extenuating circumstances of the COVID-19 pandemic and availability of emergency data relief interns, new recruits are being accepted at the beginning of each month. To assist with the recruitment process, an educational handbook was created. This handbook contains information regarding the region and block system, data collection methods, and data quality assurance methods, along with examples of each and frequently asked questions.  

 

Data Collection

 

Data entered prior to March 9th was input from Johns Hopkins University. Data from March 9th to March 16th was gathered through researching historic data gathered from news sources and health department websites. Data entered from March 16th onward was entered daily using numbers reported by state and county health departments. Quality assurance teams compared recorded data to values reported by other aggregators, including the New York Times [15], Johns Hopkins University [16], and USA Facts [17] in order to identify errors and areas where further research was needed.

Findings & Highlights 

 

Counting Deaths and Cases

For this project, specific definitions of cases and deaths often depends on the definitions used by the reporting State Health Departments. On April 16, we began including probable cases and deaths in our data wherever available, in accordance with CDC reporting recommendations. The CDC defines probable cases as meeting any of the following three requirements: 1) Incidents that meet clinical criteria and epidemiologic evidence without a positive molecular test result. 2) Incidents that meet presumptive laboratory evidence and either clinical criteria or epidemiologic evidence. 3) Incidents that meet vital records criteria; however, no confirmatory laboratory testing is performed for COVID-19 [14]

 

This is how the first case in the United States would be represented at the county level for January 20, 2020: 05000US53061, Snohomish County Washington, WA, 1, 0. This is how the first case would be represented at the state level: 04000US53, Washington (state), WA, 1, 0. Both the county and state data begins with the state identification number/county identification number, the state/county, then the number of positive cases, followed by the number of confirmed deaths. 

 

Data Challenges

 

Accessibility of Data: Formatting

State websites reported data related to the COVID-19 pandemic in various formats, complicating any efforts to automate data entry. Formats used to report COVID-19 cases include .csv, .pdf, GIS, Microsoft BI, Tableau, and plain text. Consistent reporting in machine-readable formats would improve general access to public health data.

 

Reliability of Data: Lab Delays and Lab Reports Missing Data

Gathering accurate, real-time reports was complicated by an infrastructure not designed for the volume of tests required [11]. Many commercial labs were unable to process tests quickly, and both the state health departments and commercial labs encountered difficulties in issuing and processing reports of COVID-19 cases. As a result, state health departments may take anywhere from a day to several months to report a positive COVID-19 case. [12, 13]

 

Laboratories that are testing for COVID-19 are required to report the test results to state or county health departments, depending on local laws [8]. Although researchers and labs are trying to take necessary measures for proper data collection, through time it was shown that commercial and federal laboratories have not been able to consistently properly report the test results as needed. One example where we see laboratory delays for reporting as well as missing data is in Arizona where “due to the large volume of results being processed, there was a slight delay in the result” and reports “being incomplete because lab partners did not meet submission deadlines” [7].

Lab reports issued to health departments were frequently incomplete, requiring case interviewers to obtain missing information. This delayed inclusion of COVID-19 cases in the data for their county of residence [9, 10].

 

Interpretation of Data: Unclear Criteria

Health Departments used varying criteria to define what constituted a COVID-19 case. This information was frequently elusive or absent from health department websites, making it difficult to resolve differences between data reported by the state and county. In many cases we needed to resort to directly contacting health departments to determine what criteria they were using to define a COVID-19 case.

 

Exceptions

The BroadStreet COVID-19 Data Project aimed to create a research-ready data set of confirmed COVID cases and deaths in all U.S. states and territories. Primary data sources for this project included state and local health departments. Methodologies used for reporting COVID-19 cases and deaths differed among states and territories. This project's geographic exceptions were defined as instances where cases reported did not map to standard county boundaries. Data entry exceptions included instances within states where reporting methods varied. Data lags are instances where data was not updated every 24 hours; these discrepancies were present at the state and county levels. Case definition differed among states as some reported probable cases and deaths. Reporting exceptions occurred when states altered the methods they used to report cases and deaths. Notable geographic, data entry, case definition, and reporting exceptions have been organized into a table here Exception Table.

 

Anomalies/Trend Break Estimates

Quality assurance teams assessed the validity of data collected by conducting investigatory research. Changes in previously reported cases and deaths led to anomalies or trend breaks within the data. Trend breaks were identified using automated statistical analysis. Anomalies were investigated and corrected if possible; however, not all were capable of correction because of data limitations. Over 2,000 discrepancies were identified and fixed by the BroadStreet team.

Conclusion 

 

In the midst of these unpredictable and challenging times, society is in great need of a reliable, accessible and consistent data source about the pandemic which BroadStreet provides. Through consistent collecting and tracking of death counts and case numbers of every state at the county level, the BroadStreet team has been able to systematically produce a public data source that is accessible across systems in a reliable, machine readable format. Although there are many datasets available, many are inconsistent and have data discrepancies that pose issues- this is where the BroadStreet COVID-19 project is of great importance and addresses these discrepancies.

Future Releases

 

This will be the first release of the data and this document. As we continue to release aggregate and release COVID-19 data these release notes will be updated to document methods. We feel it is essential to document the good and bad of the process so future pandemics can benefit from what we have learned.    





Additional Projects

 

Policy data collection

The COVID-19 Data Project and BroadStreet has partnered with Temple University’s Center for Public Health Law Research (CPHLR) to look at longitudinal executive orders (i.e. policy) relating to COVID-19. CPHLR’s process has been applied to the crowd-sourcing capabilities of the COVID-19 Data Project in order to gather up-to-date executive orders and proclamations for each state, convert them into machine readable files, and visually quantify the qualitative data. Each order is read and coded by Policy Teams in line with the scoping criteria of CPHLR, with coordinating sections of the orders highlighted as a form of citation. Topics considered “In Scope” include: Mask Requirement, Gathering Ban, School and Restaurant Restrictions and/or Closure, Traveler Restrictions, and more. Additional topics that are considered “Out of Scope,” but are found to be common between states are recognized and noted. The effective period and any amending feature of each order is also notated. On a weekly basis, around 200 Executive Orders and their original sources are gathered, converted, read, and coded before a quality assurance process is completed to verify that all of the applicable information has been captured. These files and data can be used for quantitative analysis and further research projects. 

 

Equity data collection

The COVID-19 Data Project’s newest creation, the Health Equity data track, is the first project of its kind collecting data on COVID-19 cases by race and ethnicity on the county level. The Equity teams conduct a monthly evaluation of every county in the United States and US territories to determine which counties are reporting their cases by race (in which the different options are White, Black/African American, Asian, American Indian/Alaska Native, Native Hawaiian/Pacific Islander, two or more races, Other, and Unknown) and ethnicity (in which the options are Hispanic (all races), Non-Hispanic, or Not Specified). After obtaining this information, the Equity teams begin documenting cumulative case numbers by race and ethnicity for each county on a daily basis. A quality assurance process is also implemented to ensure that the numbers have been reported accurately. This data can be used for quantitative analysis and further research projects.

 

References

 

 

 (1) Effler P, Ching-Lee M, Bogard A, Ieong M-C, Nekomoto T, Jernigan D. Statewide System of Electronic Notifiable Disease Reporting From Clinical Laboratories: Comparing Automated Reporting With Conventional Methods. JAMA. 1999;282(19):1845-1850. doi:10.1001/jama.282.19.1845

(2) Wang CJ, Ng CY, Brook RH. Response to COVID-19 in Taiwan: Big Data Analytics, New Technology, and Proactive Testing. JAMA. 2020;323(14):1341-1342. doi:10.1001/jama.2020.3151

(3) Wong JEL, Leo YS, Tan CC. COVID-19 in Singapore—Current Experience: Critical Global Issues That Require Attention and Action. JAMA. 2020;323(13):1243-1244. doi:10.1001/jama.2020.2467

(4) Information Technology–Based Tracing Strategy in Response to COVID-19 in South Korea—Privacy Controversies | Global Health | JAMA | JAMA Network. Accessed June 23, 2020. https://jamanetwork.com/journals/jama/fullarticle/2765252

(5) Sittig DF, Singh H. COVID-19 and the Need for a National Health Information Technology Infrastructure. JAMA. 2020;323(23):2373-2374. doi:10.1001/jama.2020.7239

(6) CDC. Coronavirus Disease 2019 (COVID-19). Centers for Disease Control and Prevention. Published February 11, 2020. Accessed June 14, 2020. https://www.cdc.gov/coronavirus/2019-ncov/php/reporting-pui.html

(7) Lab giant Sonora Quest missing from Arizona’s COVID-19 report. KTAR.com. Published June 29, 2020. Accessed July 12, 2020. https://ktar.com/story/3347600/with-incomplete-data-arizona-reports-625-coronavirus-cases-0-deaths/

(8) CDC. How to report COVID-19 Laboratory Data. Centers for Disease Control and Prevention. Published June 25, 2020. Accessed July 12, 2020. https://www.cdc.gov/coronavirus/2019-ncov/lab/reporting-lab-data.html#who-must-report

(9) Lindquist, Scott. Letter to Lab Directors/Managers and Clinical Partners in Washington state. Published June 8, 2020. Accessed July 12, 2020. https://www.doh.wa.gov/portals/1/documents/1500/clr-labprovdemorptltr.pdf

(10) Carlson, Cheri. Missing data leaves unknowns about who gets tested for COVID-19 in Ventura County. Ventura County Star. May 18. 2020. Accessed July 12, 2020. https://www.clarionledger.com/story/news/local/2020/05/18/ventura-county-california-coronavirus-covid-19-testing-data-missing/5191614002/

(11) Byers, Paul. Updated Guidelines for Submitting COVID-19 Reports to MSDH. Published June 24, 2020. Accessed July 12, 2020. https://msdh.ms.gov/msdhsite/_static/resources/8677.pdf

(12) Louisiana Department of Health Updates for 5/21/2020. Published May 21, 2020. Accessed July 12, 2020. http://ldh.la.gov/index.cfm/newsroom/detail/5598

(13) City of St. Louis Department of Health. Delays Continue in Processing COVID-19 Test Results. Published July 8, 2020. Accessed July 12, 2020. https://www.stlouis-mo.gov/government/departments/health/news/delays-continue-processing-covid-19-test-results.cfm

(14) CDC. Coronavirus Disease 2019 (COVID-19) 2020 Interim Case Definition. Published April 5, 2020. Accessed July 13, 2020. https://wwwn.cdc.gov/nndss/conditions/coronavirus-disease-2019-covid-19/case-definition/2020/

(15) Coronavirus in the U.S.: Latest Map and Case Count - The New York Times. Accessed May 14, 2020. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

(16) COVID-19 United States Cases by County. Johns Hopkins Coronavirus Resource Center. Accessed May 12, 2020. https://coronavirus.jhu.edu/us-map

(17) Coronavirus Locations: COVID-19 Map by County and State-  USA Facts. Accessed July 14, 2020. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/