hypergraph

COVID-19 and Workplace Closure

In this notebook, we study the relationship between daily COVID-19 cases and workplace closures and how it varies over space and time. The goal is to set a template for studying how adherence to nonpharmaceutical interventions (NPI) affects the spread of COVID-19. Per the CDC, an NPI is any approach other than vaccination or medication to prevent the spread of pandemic illnesses. In response to COVID-19, these included mask mandates and workplace closures. As NPIs can be implemented before vaccines and therapies are developed, it is crucial to understand which NPIs are adhered to and are effective. The specific nonpharmaceutical intervention studied here is workplace closure, which we consider at a daily, country-level granularity between February and November, 2020. We will use workplace mobility data as a proxy for adherence to this NPI and daily case counts as a proxy for its efficacy.

Table of Contents

In [1]:
# general data manipulation
import pandas as pd

# ENSIGN tools
import ensign.csv2tensor as c2t
import ensign.cp_decomp as cpd
from ensign.visualize import plot_component

# inline plotting
import matplotlib
%matplotlib inline

# map plotting
import pycountry
import plotly.express as px

Data

In order to study the above relationships, it is necessary to bring together data about workplace closures, work attendance, and case counts on a daily, per country basis. These data are found, respectively, from:

  • University of Oxford: Coronavirus Government Response Tracker
  • Google: COVID-19 Community Mobility Reports
  • Johns Hopkins University: COVID-19 Data Repository

They were downloaded using the C3.ai COVID-19 API and joined by date and country so that for each country and each date between 2020-02-15 and 2020-11-29 there is an entry with an indication of workplace closure, a measure of workplace mobility, and daily case counts. Work closure policies are represented by 0 (no closure), 1 (closures recommended), 2 (partial closure in effect), or 3 (all-but-essential closure in effect). The workplace mobility measurement provided by Google gives the percent change from baseline (January 1, 2020) in traffic to workplaces. Finally, the case counts are the change in cases on the date 5 days after entry data (the count may be negative to indicate a decrease in cases). The reason for the lag is to account for the median incubation period for COVID-19.

The joined dataset is previewed below.

For more information about the C3.ai COVID-19 Data Lake, see the following:

In [2]:
pd.read_csv('data/covid_data.csv')
Out[2]:
country date work_closure work_mobility cases
0 Afghanistan 2020-02-15 0 -28 0
1 Afghanistan 2020-02-16 0 4 0
2 Afghanistan 2020-02-17 0 5 0
3 Afghanistan 2020-02-18 0 6 0
4 Afghanistan 2020-02-19 0 5 1
... ... ... ... ... ...
56895 Zimbabwe 2020-11-25 1 8 128
56896 Zimbabwe 2020-11-26 1 6 0
56897 Zimbabwe 2020-11-27 1 10 0
56898 Zimbabwe 2020-11-28 1 17 0
56899 Zimbabwe 2020-11-29 1 18 0

56900 rows × 5 columns

Tensor Construction and Decomposition

The joined dataset was constructed with the exact columns to be used in the tensor, so all columns are chosen and their types are specified. The most important part of the tensor construction in this experiment is the choice of binnings. Neither the country nor date are binned as the spatial and temporal resolution were set to the appropriate level in the data construction. The workplace closure orders are not binned as they represent discrete, distinct events. The workplace mobility change is binned with a binsize of 10 so that similar levels of change in work traffic are considered the same. Finally, the case count is binned logarithmically in order to consider the order of magnitude of the change in cases.

A rank 100 decomposition isolated distinct relationships in the data. A higher rank did not yield more interesting nor cleaner components.

In [3]:
tensor = c2t.csv2tensor('data/covid_data.csv',
                        columns=['country', 'date', 'work_closure', 'work_mobility', 'cases'],
                        types=['str', 'date', 'int64', 'int64', 'int64'],
                        binning=['none', 'none', 'none', 'binsize=10', 'log10'])
decomp = cpd.cp_apr(tensor, 100)

tensor.write('covid_decomposition')
cpd.write_cp_decomp_dir('covid_decomposition', decomp)

Evaluating the decomposition quality: The CPDecomp object provides a dictionary metrics as a field that contains information on the decomposition: running time, various quality metrics, and the number of completed iterations. One simple metric of quality is the fit, which is a similarity metric between the tensor and the reconstructed decomposition. While the highest possible score is 1, it is not necessary or even good to achieve such a fit score. Indeed, tensor decomposition's ability to find patterns in data is due to finding a low-rank approximation of the data. The fit achieved by this decomposition is good, and coupled with the high cosine similarity, it indicates a quality decomposition.

In [4]:
decomp.metrics
Out[4]:
{'time': 28.12104892730713,
 'fit': 0.5148133181031846,
 'cosine_sim': 0.874503860012878,
 'norm_scaling': 0.8617321151256395,
 'coverage': 1.0,
 'cp_total_iter': 100}

Interpreting Components

We can visualize each component by plotting the scores in each mode vector involved in the outer product reconstructing that component. The labels along each mode correspond to the binned values created during tensor construction. For example, the label '1' in the case count mode corresponds to 10 to 100 cases as we chose logarithmic binning. Any tuple of scoring indices in the outer product is a tensor index involved in the pattern described by the component. Therefore, the labels of the scoring indices describe the pattern. Specifically, the country and date modes describe the spatiotemporal extent of the pattern, while the scoring indices in the remaining modes describe the adherence to workplace closures and the efficacy in terms of reducing cases. Reading these plots of components in this manner allows us to describe coherent trends in the data. For example, the pre-pandemic behavior of all countries is clustered into a single component and another component captures the second wave of cases.

As this study has a geospatial element, we can also produce a choropleth map representing the degree to which each country is involved in the patterns. The shades of the countries are determined by the scores in that mode. A custom plotting function to this end is defined below.

Custom Plotting

Plotting components on a world map according to the scores in the country mode. This function corrects country names in the label set according to what pycountry expects. The scores are scaled as a fraction of the maximum score in the component.

In [5]:
def plot_world(decomp, comp_id):
    COUNTRIES = decomp.labels[0]
    scores = list(decomp.factors[0][:, comp_id] / decomp.factors[0][:, comp_id].max()) 

    df = pd.DataFrame({'country': COUNTRIES, 'score': scores})

    input_countries = df['country'].values

    countries = {}
    for country in pycountry.countries:
        name = country.name
        if name == 'Taiwan':
            name = 'Taiwan*'
        elif name == 'Russian Federation':
            name = 'Russia'
        elif name == 'Taiwan, Province of China':
            name = 'Taiwan*'
        elif name == 'Bolivia, Plurinational State of':
            name = 'Bolivia'
        elif name == 'Lao People\'s Democratic Republic':
            name = 'Laos'
        elif name == 'Venezuela, Bolivarian Republic of':
            name = 'Venezuela'
        elif name == 'Hong Kong':
            name = 'HongKong_China'
        elif name == 'Korea, Republic of':
            name = 'KoreaSouth'
        elif name == 'Puerto Rico':
            name = 'PuertoRico_UnitedStates'
        elif name == 'Aruba':
            name = 'Aruba_Netherlands'
        elif name == 'Moldova, Republic of':
            name = 'Moldova'
        elif name == 'Viet Nam':
            name = 'Vietnam'
        elif name == 'Tanzania, United Republic of':
            name = 'Tanzania'
        else:
            name = name.replace(' ', '')
        countries[name] = country.alpha_3

    df['iso_alpha'] = [countries.get(country, 'Unknown code') for country in input_countries]
    
    fig = px.choropleth(df, locations="iso_alpha",
                    color="score", 
                    hover_name="country", 
                    color_continuous_scale=px.colors.sequential.OrRd,
                    color_continuous_midpoint=0,
                    range_color=[0,1])

    fig.update_layout(mapbox_style="open-street-map")
    fig.update_geos(fitbounds="locations")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

    fig.show()

Pre-pandemic Component

The following component captures the initial state before COVID-19 spread globally. We determine this by noting the scoring indices in each mode. To make reading the graphs easier, top scoring indices, labels, and scores appear to the right of each mode in this plot and those that follow. Also, in all modes, the scores sum to 1. Nearly every index in the country mode has a non-trivial score, so this pattern involves most countries. The scoring indices in the time mode correspond to those dates in February, right before COVID-19 was declared a pandemic. Finally, the last three modes respectively show that this pattern involves no workplace closures, significant change in work traffic, nor cases. These interpretations together indicate that this component clusters the pre-pandemic data points.

The map also shows the global extent of this pattern as most countries are colored. Note that countries in Africa are colored darker and thus are most represented by this component. This is explained by the fact that it took longer for COVID-19 to appear on that continent.

In [6]:
plot_component(decomp, 5)
Out[6]:
In [7]:
plot_world(decomp, 5)