COVID-19 Data Project

The goal of the COVID-19 Data Project was to create a research-ready data set. Our data is sourced from local public health departments from across the country and then validated by our team of over one-hundred interns through an extensive quality-assurance procedure. Our data is updated regularly, but to allow time for this procedure before publication, the last three days of data are sourced from the Johns Hopkins University data set.

You can also download our codebook here.

About the BroadStreet COVID-19 Project

 

The BroadStreet COVID-19 Data Project began on March 16, 2020. The project currently tracks COVID-19 cases and deaths at the county level for all U.S. states and territories, including historical data going back to January 20, 2020. BroadStreet is releasing this dataset to the public to help scientists, public health workers, educators and government personnel understand, predict, and learn from the COVID-19 outbreak. In the coming weeks, this document will be accompanied by a scientific in-depth and peer-reviewed account of the BroadStreet COVID-19 Data Project. Our hope is this will help to guide future disease surveillance for future epidemics and pandemics.

 

On a day-to-day basis, the BroadStreet COVID-19 Data Project team of over 120 volunteers compiles and organizes data on COVID-19 case numbers and death numbers at the county level from all 50 states and the District of Columbia. Within the company, there are many focus groups who use this collected data to lead research, investigations and data collection in different areas of study such as policy, race equality and more.

 

Report Contributors

 

 

Tom Schmitt, PhD

Tracy Flood, MD PhD

Zach Sturgeon

Samin Charepoo

April Miller

Cedonia Thomas

Grace Gibbon

Sabrine Benzakour

Allison Muir, MHA

Tracey Yao

 

Background

 

In public health emergencies, the collection of data on a county, state, or regional basis is critical to enable public health officials to identify hotspots, restrict large gatherings, limit movement, and determine health care policy. 1 Health IT infrastructure from Singapore, Taiwan, and Korea have contributed to the ability of those countries to test, track, trace, and quarantine individuals with COVID-19, allowing for a more rapid lowering of cases and deaths compared to previous projections. 2-4

In response to this public health emergency, BroadStreet is releasing data files with cumulative counts of coronavirus cases and mortalities in the United States and their territories, at the county level, from January 20th, 2020 to present. Over 120 interns and volunteers were recruited via the BroadStreet website 5, and through institutional reference for interested undergraduate or graduate students. The project quickly grew as the virus spread, leading to a large data set involving historical and present data collection, input, and analysis. The data team tracked daily numbers of deaths and cases throughout the entire United States at the county level. 

The BroadStreet COVID-19 Data Project data is unique in that data is updated/corrected by quality assurance reviews in days, weeks and months following initial entry. We feel this will provide an alternate method of collecting COVID-19 data and an alternate historical snapshot.  This data allows individuals to visualize the pandemic impact on counties, states and the United States. This data, including daily case totals from the first known case of the virus, gives the public access to knowledge regarding the rate of transmission and spread of the virus.

 

BroadStreet COVID-19 Data Project Handbook

 

Due to the extenuating circumstances of the COVID-19 pandemic and availability of emergency data relief interns, new recruits are being accepted at the beginning of each month. To assist with the recruitment process, an educational handbook was created. This handbook contains information regarding the region and block system, data collection methods, and data quality assurance methods, along with examples of each and frequently asked questions.  

 

Data Collection

 

Data entered prior to March 9th was input from Johns Hopkins University. Data from March 9th to March 16th was gathered through researching historic data gathered from news sources and health department websites. Data entered from March 16th onward was entered daily using numbers reported by state and county health departments. Quality assurance teams compared recorded data to values reported by other aggregators, including the New York Times 15, Johns Hopkins University 16, and USA Facts 17 in order to identify errors and areas where further research was needed.

Beginning on March 16, 2020, the BroadStreet team (consisting of approximately 120 volunteers) began tracking confirmed cumulative case and death totals. Volunteers were organized into regional groups consisting of a data entry team and a quality assurance team. Each team was supervised by a designated lead who held team members accountable for their duties.

Data entry team members enter data from state or county health departments each day onto a shared google sheet. Team leads would then do a first check to ensure data was filled in completely, and matched reported state totals. Quality assurance team members then compared what was entered into our dataset with the totals reported by our comparison sets: Johns Hopkins University 16, the New York Times 15, and USAFACTS 17. These were used to identify potential errors. If the value entered in any county was inconsistent with what these comparison sets reported, quality assurance team members would research the case or death totals in that county to determine which value was correct and leave a comment on the relevant cell with the results of their research.

Cumulative totals should, by definition, be constantly trending upwards. In any instance where the case or death totals decreased from the previous day we investigated the cause of the revision and corrected historic data accordingly. Health departments frequently do not report the information needed to correct these “trend breaks.”

 

Findings & Highlights 

 

Counting Deaths and Cases

 

For this project, specific definitions of cases and deaths often depends on the definitions used by the reporting State Health Departments. On April 16, we began including probable cases and deaths in our data wherever available, in accordance with CDC reporting recommendations. The CDC defines probable cases as meeting any of the following three requirements: 1) Incidents that meet clinical criteria and epidemiologic evidence without a positive molecular test result. 2) Incidents that meet presumptive laboratory evidence and either clinical criteria or epidemiologic evidence. 3) Incidents that meet vital records criteria; however, no confirmatory laboratory testing is performed for COVID-19 14

 

The data is structured in wide form. This is how the first case in the United States would be represented at the county level for January 20, 2020: 05000US53061, Snohomish County Washington, WA, 1, 0, 0, 0. This is how the first case would be represented at the state level: 04000US53, Washington (state), WA, 1, 0. Both the county and state data begins with the state identification number/county identification number, followed by the state/county. At the county level, the data recorded are first confirmed positive cases, followed by the number of probable positive cases, the number of confirmed deaths, then the number of probable deaths. At the state level, the data recorded are total number of cases, followed by the total number of deaths.

 

Data Challenges

 

Changes in Process

During early months, we found that there were often multiple reliable sources reporting conflicting case totals in many counties. The most frequent source of disagreement was between county health departments and state health departments. There are several potential causes for this: they may be using different criteria to report cases, such as one source including probable cases or prisoners where the other does not. The difference between the two is most commonly affected by the path that information surrounding a case takes in that particular state. If the state is being notified of cases by individual county health department, there may be a delay in the state’s ability to report these cases as it processes new reports.

Quality assurance team members would regularly check multiple reliable sources and use the highest case total reported for that day. As the virus spread and more counties began reporting cases, checking multiple sources for each county became an untenable strategy. In late June, we began choosing designated sources to enter data from in each county to simplify this process. More details on specific sources can be found here.

On 4/16/20, we began including probable cases in our data wherever they are reported. This is consistent with CDC interim guidelines published on 4/5/20.

On 7/27/20, we began separating probable cases and deaths from confirmed cases and deaths in our data. Prior to this, we had been including them in the same figure. In some states, these cases are not possible to separate due to the health department reporting only the total cases.

On 9/4/20, the CDC published a guideline for use of rapid antigen test, stating they are a valuable tool for screening for the presence of COVID-19 and can be useful as a diagnostic tool if properly interpreted by clinicians. Some states have since begun including individuals testing positive via rapid antigen tests in either confirmed or probable case totals. These are included in our data where reported.

 

Accessibility of Data: Formatting

State websites reported data related to the COVID-19 pandemic in various formats, complicating any efforts to automate data entry. Formats used to report COVID-19 cases include .csv, .pdf, GIS, Microsoft BI, Tableau, and plain text. Consistent reporting in machine-readable formats would improve general access to public health data.

 

Reliability of Data: Lab Delays and Lab Reports Missing Data

Gathering accurate, real-time reports was complicated by an infrastructure not designed for the volume of tests required 11. Many commercial labs were unable to process tests quickly, and both the state health departments and commercial labs encountered difficulties in issuing and processing reports of COVID-19 cases. As a result, state health departments may take anywhere from a day to several months to report a positive COVID-19 case. 12, 13

Laboratories that are testing for COVID-19 are required to report the test results to state or county health departments, depending on local laws 8. Although researchers and labs are trying to take necessary measures for proper data collection, through time it was shown that commercial and federal laboratories have not been able to consistently properly report the test results as needed. One example where we see laboratory delays for reporting as well as missing data is in Arizona where “due to the large volume of results being processed, there was a slight delay in the result” and reports “being incomplete because lab partners did not meet submission deadlines” 7.

Lab reports issued to health departments were frequently incomplete, requiring case interviewers to obtain missing information. This delayed inclusion of COVID-19 cases in the data for their county of residence 9, 10.

 

Interpretation of Data: Unclear Criteria

Health Departments used varying criteria to define what constituted a COVID-19 case. This information was frequently elusive or absent from health department websites, making it difficult to resolve differences between data reported by the state and county. In many cases we needed to resort to directly contacting health departments to determine what criteria they were using to define a COVID-19 case.

 

Exceptions

The BroadStreet COVID-19 Data Project aimed to create a research-ready data set of confirmed COVID cases and deaths in all U.S. states and territories. Primary data sources for this project included state and local health departments. Methodologies used for reporting COVID-19 cases and deaths differed among states and territories. This project's geographic exceptions were defined as instances where cases reported did not map to standard county boundaries. Data entry exceptions included instances within states where reporting methods varied. Data lags are instances where data was not updated every 24 hours; these discrepancies were present at the state and county levels. Case definition differed among states as some reported probable cases and deaths. Reporting exceptions occurred when states altered the methods they used to report cases and deaths. Notable geographic, data entry, case definition, and reporting exceptions have been organized into a table here Exception Table.

 

Anomalies/Trend Break Estimates

Quality assurance teams assessed the validity of data collected by conducting investigatory research. Changes in previously reported cases and deaths led to anomalies or trend breaks within the data. Trend breaks were identified using automated statistical analysis. Anomalies were investigated and corrected if possible; however, not all were capable of correction because of data limitations. Over 2,000 discrepancies were identified and fixed by the BroadStreet team.

Conclusion 

 

In the midst of these unpredictable and challenging times, society is in great need of a reliable, accessible and consistent data source about the pandemic which BroadStreet provides. Through consistent collecting and tracking of death counts and case numbers of every state at the county level, the BroadStreet team has been able to systematically produce a public data source that is accessible across systems in a reliable, machine readable format. Although there are many datasets available, many are inconsistent and have data discrepancies that pose issues- this is where the BroadStreet COVID-19 project is of great importance and addresses these discrepancies.

 

Future Releases

 

The most recent release of this document occurred on September 29th. Previous releases of this data and document occurred on July 17th and August 27th. As we continue to release aggregate and release COVID-19 data these release notes will be updated to document methods. We feel it is essential to document the good and bad of the process so future pandemics can benefit from what we have learned.  

 

 

References

 

 

  1.   Effler P, Ching-Lee M, Bogard A, Ieong M-C, Nekomoto T, Jernigan D. Statewide System of Electronic Notifiable Disease Reporting From Clinical Laboratories: Comparing Automated Reporting With Conventional Methods. JAMA. 1999;282(19):1845-1850. doi:10.1001/jama.282.19.1845

  2.  Wang CJ, Ng CY, Brook RH. Response to COVID-19 in Taiwan: Big Data Analytics, New Technology, and Proactive Testing. JAMA. 2020;323(14):1341-1342. doi:10.1001/jama.2020.3151

  3.  Wong JEL, Leo YS, Tan CC. COVID-19 in Singapore—Current Experience: Critical Global Issues That Require Attention and Action. JAMA. 2020;323(13):1243-1244. doi:10.1001/jama.2020.2467

  4.  Information Technology–Based Tracing Strategy in Response to COVID-19 in South Korea—Privacy Controversies | Global Health | JAMA | JAMA Network. Accessed June 23, 2020. https://jamanetwork.com/journals/jama/fullarticle/2765252

  5.  Sittig DF, Singh H. COVID-19 and the Need for a National Health Information Technology Infrastructure. JAMA. 2020;323(23):2373-2374. doi:10.1001/jama.2020.7239

  6. CDC. Coronavirus Disease 2019 (COVID-19). Centers for Disease Control and Prevention. Published February 11, 2020. Accessed June 14, 2020. https://www.cdc.gov/coronavirus/2019-ncov/php/reporting-pui.html

  7.  Lab giant Sonora Quest missing from Arizona’s COVID-19 report. KTAR.com. Published June 29, 2020. Accessed July 12, 2020. https://ktar.com/story/3347600/with-incomplete-data-arizona-reports-625-coronavirus-cases-0-deaths/

  8.  CDC. How to report COVID-19 Laboratory Data. Centers for Disease Control and Prevention. Published June 25, 2020. Accessed July 12, 2020. https://www.cdc.gov/coronavirus/2019-ncov/lab/reporting-lab-data.html#who-must-report

  9.  Lindquist, Scott. Letter to Lab Directors/Managers and Clinical Partners in Washington state. Published June 8, 2020. Accessed July 12, 2020. https://www.doh.wa.gov/portals/1/documents/1500/clr-labprovdemorptltr.pdf

  10.  Carlson, Cheri. Missing data leaves unknowns about who gets tested for COVID-19 in Ventura County. Ventura County Star. May 18. 2020. Accessed July 12, 2020. https://www.clarionledger.com/story/news/local/2020/05/18/ventura-county-california-coronavirus-covid-19-testing-data-missing/5191614002/

  11.  Byers, Paul. Updated Guidelines for Submitting COVID-19 Reports to MSDH. Published June 24, 2020. Accessed July 12, 2020. https://msdh.ms.gov/msdhsite/_static/resources/8677.pdf

  12.  Louisiana Department of Health Updates for 5/21/2020. Published May 21, 2020. Accessed July 12, 2020. http://ldh.la.gov/index.cfm/newsroom/detail/5598

  13.  City of St. Louis Department of Health. Delays Continue in Processing COVID-19 Test Results. Published July 8, 2020. Accessed July 12, 2020. https://www.stlouis-mo.gov/government/departments/health/news/delays-continue-processing-covid-19-test-results.cfm

  14.  CDC. Coronavirus Disease 2019 (COVID-19) 2020 Interim Case Definition. Published April 5, 2020. Accessed July 13, 2020. https://wwwn.cdc.gov/nndss/conditions/coronavirus-disease-2019-covid-19/case-definition/2020/

  15.  Coronavirus in the U.S.: Latest Map and Case Count - The New York Times. Accessed May 14, 2020. https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

  16.  COVID-19 United States Cases by County. Johns Hopkins Coronavirus Resource Center. Accessed May 12, 2020. https://coronavirus.jhu.edu/us-map

  17.   Coronavirus Locations: COVID-19 Map by County and State-  USA Facts. Accessed July 14, 2020. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/