Today we will make the plots that are present in this webpage that tracks the situation in New York City - New York City Coronavirus Map and Case Count

There are two main plots -

  1. Cases per capita
  2. Plot analyzing the relationship b/w average household size, income and cases per capita

They have since corrected the graph -

Old

old chart

New

new chart

#hide_output
import pandas as pd
import altair as alt
import geopandas as gpd
alt.renderers.set_embed_options(actions=False)

Fortunately NYC Department of Health and Mental Hygiene publishes their data in their GitHub Repo. It has all the data from cases to the shapefiles too.

But we will use the data from NYC Open Data which is equally good. We will use the NYC Department of Health and Mental Hygiene GitHub repo only for their COVID data per MODZCTA.

The location of the dataset on NYC Open Data portal is https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk which we will export as geojson.

import requests
resp = requests.get('https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=GeoJSON')
data = resp.json()
nyc_zip = gpd.GeoDataFrame.from_features(data['features'])
nyc_zip.head()
geometry label modzcta pop_est zcta
0 MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ... 10001, 10118 10001 23072 10001, 10119, 10199
1 MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ... 10002 10002 74993 10002
2 MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ... 10003 10003 54682 10003
3 MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ... 10004 10004 3028 10004
4 MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ... 10005 10005 8831 10005, 10271
# If you have the data locally
#modzcta = 'MODZCTA/Modified Zip Code Tabulation Areas.geojson'
#nyc_zip_shp = gpd.read_file(modzcta)

This is how it looks when colored based on population -

alt.Chart(nyc_zip).mark_geoshape().encode(
    color='pop_est:Q'
)

Now we will get the Median Household Income data from Census Reporter's wonderful website. In particular the url query for the 5 boroughs in NYC is https://censusreporter.org/data/table/?table=B19013&geo_ids=860%7C05000US36061,860%7C05000US36047,860%7C05000US36081,860%7C05000US36005,860%7C05000US36085 which I came to know thanks to this comment in NYC Health covid data repo.

Download it and export it to work with it further (I chose the Shapefile, but I suppose others would be fine too).

median_income_path = 'MODZCTA/median_income_nyc/acs2018_5yr_B19013_86000US11417/acs2018_5yr_B19013_86000US11417.shp' #push it to datasets repo and link here
median_income = gpd.read_file(median_income_path)
median_income.head()
geoid name B19013001 B19013001e geometry
0 86000US10001 10001 88526.0 8060.0 POLYGON ((-74.00828 40.75027, -74.00783 40.751...
1 86000US10002 10002 35859.0 2149.0 POLYGON ((-73.99750 40.71407, -73.99709 40.714...
2 86000US10003 10003 112131.0 13190.0 POLYGON ((-73.99937 40.73132, -73.99911 40.731...
3 86000US10004 10004 157645.0 17195.0 MULTIPOLYGON (((-73.99814 40.70152, -73.99617 ...
4 86000US10005 10005 173333.0 33390.0 POLYGON ((-74.01251 40.70677, -74.01195 40.707...

The regions in median income data that are not part of NTC's modzcta can be seen as follows(basically we are finding out if both datasets show the same regions) -

median_income[median_income['name'].isin(nyc_zip['modzcta']) == False].plot()
<AxesSubplot:>

Fun Fact - the areas marked above are also NOT SHOWN in NYT's Charts

If you want you can plot the Median Income Data too using -

alt.Chart(median_income).mark_geoshape().encode(
    color='B19013001:Q'
)

Now we will get the COVID data per MODZCTA

covid_zip_uri = 'https://raw.githubusercontent.com/nychealth/coronavirus-data/master/data-by-modzcta.csv'
covid_zip = pd.read_csv(covid_zip_uri)
covid_zip.head()
MODIFIED_ZCTA NEIGHBORHOOD_NAME BOROUGH_GROUP COVID_CASE_COUNT COVID_CASE_RATE POP_DENOMINATOR COVID_DEATH_COUNT COVID_DEATH_RATE PERCENT_POSITIVE TOTAL_COVID_TESTS
0 10001 Chelsea/NoMad/West Chelsea Manhattan 440 1867.33 23563.03 26 110.34 7.18 6124
1 10002 Chinatown/Lower East Side Manhattan 1318 1717.14 76755.41 161 209.76 8.33 15831
2 10003 East Village/Gramercy/Greenwich Village Manhattan 534 992.54 53801.62 35 65.05 3.91 13671
3 10004 Financial District Manhattan 40 1095.71 3650.61 1 27.39 5.33 751
4 10005 Financial District Manhattan 94 1119.57 8396.11 2 23.82 4.91 1914

Merge the covid data with NYC shapefile data to be able to see the number of cases per modzcta

def modify_merge(nyc_zip, covid_zip):
    covid_zip = covid_zip.loc[:, ['MODIFIED_ZCTA', 'BOROUGH_GROUP', 'COVID_CASE_COUNT', 'COVID_DEATH_COUNT', 'NEIGHBORHOOD_NAME']]
    nyc_zip['pop_est'] = nyc_zip['pop_est'].astype(int)
    nyc_zip['modzcta'] = nyc_zip['modzcta'].astype(int)
    covid_zip.rename(columns = {'MODIFIED_ZCTA': 'modzcta'}, inplace=True)
    covid_nyc_zip = nyc_zip.merge(covid_zip, how='left', on='modzcta')
    return covid_nyc_zip
covid_nyc_zip = modify_merge(nyc_zip, covid_zip)
covid_nyc_zip.head()
geometry label modzcta pop_est zcta BOROUGH_GROUP COVID_CASE_COUNT COVID_DEATH_COUNT NEIGHBORHOOD_NAME
0 MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ... 10001, 10118 10001 23072 10001, 10119, 10199 Manhattan 440.0 26.0 Chelsea/NoMad/West Chelsea
1 MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ... 10002 10002 74993 10002 Manhattan 1318.0 161.0 Chinatown/Lower East Side
2 MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ... 10003 10003 54682 10003 Manhattan 534.0 35.0 East Village/Gramercy/Greenwich Village
3 MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ... 10004 10004 3028 10004 Manhattan 40.0 1.0 Financial District
4 MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ... 10005 10005 8831 10005, 10271 Manhattan 94.0 2.0 Financial District

Customary covid cases visualization per modfied zip code tabulation area -

alt.Chart(covid_nyc_zip).mark_geoshape().encode(
    color='COVID_CASE_COUNT'
)

Now we will merge median income data with the covid cases per zip code tabulation area data we just derieved above -

def modify_merge(covid_nyc_zip, median_income):
    median_income.rename(columns={'name': 'modzcta', 'B19013001': 'median_income'}, inplace=True)
    median_income['modzcta'] = median_income['modzcta'].astype(int)
    median_income = median_income.drop(columns=['geometry', 'B19013001e'])
    covid_income_zip = covid_nyc_zip.merge(median_income, how='inner', on='modzcta')
    covid_income_zip =  covid_income_zip.assign(
        #case_per_people = covid_income_zip['COVID_CASE_COUNT']/covid_income_zip['POP_DENOMINATOR'], #their earlier calculation
        case_per_people = covid_income_zip['COVID_CASE_COUNT']/covid_income_zip['pop_est'], # current calculation
    )
    return covid_income_zip
covid_income_zip = modify_merge(covid_nyc_zip, median_income)
covid_income_zip.head()
geometry label modzcta pop_est zcta BOROUGH_GROUP COVID_CASE_COUNT COVID_DEATH_COUNT NEIGHBORHOOD_NAME geoid median_income case_per_people
0 MULTIPOLYGON (((-73.98774 40.74407, -73.98819 ... 10001, 10118 10001 23072 10001, 10119, 10199 Manhattan 440.0 26.0 Chelsea/NoMad/West Chelsea 86000US10001 88526.0 0.019071
1 MULTIPOLYGON (((-73.99750 40.71407, -73.99709 ... 10002 10002 74993 10002 Manhattan 1318.0 161.0 Chinatown/Lower East Side 86000US10002 35859.0 0.017575
2 MULTIPOLYGON (((-73.98864 40.72293, -73.98876 ... 10003 10003 54682 10003 Manhattan 534.0 35.0 East Village/Gramercy/Greenwich Village 86000US10003 112131.0 0.009766
3 MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ... 10004 10004 3028 10004 Manhattan 40.0 1.0 Financial District 86000US10004 157645.0 0.013210
4 MULTIPOLYGON (((-74.00783 40.70309, -74.00786 ... 10005 10005 8831 10005, 10271 Manhattan 94.0 2.0 Financial District 86000US10005 173333.0 0.010644

Now let's plot this data and we will find that it exactly MATCHES NYT's chart for cases per people v/s median income -

alt.Chart(covid_income_zip).mark_circle().encode(
    color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('median_income:Q'),
    y='case_per_people:Q'
)

Getting the number of people per household data

This data can be calculated from Household Type by Household Size data, again from Census Reporter - the url query would be https://censusreporter.org/data/table/?table=B11016&geo_ids=860|05000US36061,860|05000US36047,860|05000US36081,860|05000US36005,860|05000US36085

Getting data like these honestly deserves its own blog post. It took me a very long time to find the proper data. Perhaps I will write about it some other time.

household_path = 'MODZCTA/household_type_by_household_size/csvofsame/acs2018_5yr_B11016_86000US11417.csv'
household = pd.read_csv(household_path)
household.head()
geoid name Total Total, Error Family Households Family Households, Error 2-person household B11016003, Error 3-person household B11016004, Error ... 3-person nonfamily household B11016012, Error 4-person nonfamily household B11016013, Error 5-person nonfamily household B11016014, Error 6-person nonfamily household B11016015, Error 7 or more-person nonfamily household B11016016, Error
0 86000US10001 10001 12431 521 3838 413 2330 338 819 295 ... 253 97 63 55 0 22 0 22 0 22
1 86000US10002 10002 33540 614 16565 738 8090 620 4287 432 ... 621 157 84 87 0 28 0 28 0 28
2 86000US10003 10003 26124 703 7946 551 5182 472 1413 316 ... 586 212 94 70 38 38 0 28 0 28
3 86000US10004 10004 1659 238 709 183 410 160 196 106 ... 42 31 0 12 0 12 0 12 0 12
4 86000US10005 10005 4374 337 1614 316 944 260 358 182 ... 199 96 15 25 0 17 0 17 0 17

5 rows × 34 columns

def consolidated_household_per_zip(df):
    household_data = pd.DataFrame()
    household_data = household_data.assign(
        one_person = df['1-person nonfamily household'],
        two_person = df['2-person household'] + df['2-person nonfamily household'],
        three_person = df['3-person household'] + df['3-person nonfamily household'],
        four_person = df['4-person household'] + df['4-person nonfamily household'],
        five_person = df['5-person household'] + df['5-person nonfamily household'],
        six_person = df['6-person household'] + df['6-person nonfamily household'],
        seven_or_more_person = df['7 or more-person household'] + df['7 or more-person nonfamily household'],
        modzcta = df['name'],
        total = df['Total']
        #avg_people_per_household = df.apply(lambda x: (x[1]+2*x[2]+3*x[3]+4*x[4]+5*x[5]+6*x[6]+7*x[7])/(x['total']), axis=1)
    )
    return household_data

household_data = consolidated_household_per_zip(household)
household_data.head()
one_person two_person three_person four_person five_person six_person seven_or_more_person modzcta total
0 6710 3897 1072 429 307 16 0 10001 12431
1 14319 10041 4908 2796 1162 95 219 10002 33540
2 14377 8265 1999 1367 109 7 0 10003 26124
3 794 524 238 95 8 0 0 10004 1659
4 1674 1816 557 314 13 0 0 10005 4374

Now we will merge the household data too to arrive at the final data for plotting -

plot_data = covid_income_zip.merge(household_data, how='inner', on='modzcta')

Calculating Average people per household is simple enough, jsut divide the population with the number of households -

plot_data['avg_person_per_household'] = plot_data['pop_est']/plot_data['total']

Finally let's plot the data side by side - horizontal concatenatioon using | operator -

income = alt.Chart(plot_data).mark_circle().encode(
    color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('median_income:Q'),
    y='case_per_people:Q'
)

household_size = alt.Chart(plot_data).mark_circle().encode(
    #color=alt.Color('median_income:Q', sort = "descending"),
    x=alt.X('avg_person_per_household:Q'),
    y='case_per_people:Q'
)

income|household_size

Let's beautify it a little and add interactivity so that whenever you select an area in one of the graphs, the corresponding points gets highlighted in the second graph and you can drag the selection around to see that actually that the outbreak is worse in poorer areas where more people live together.

brush = alt.selection_interval(encodings=['x'])

chart = alt.Chart(plot_data).mark_circle(size=100).encode(
    color=alt.condition(brush, 'BOROUGH_GROUP:N', alt.value('lightgray'), legend=alt.Legend(title="Borough")),
    y=alt.Y('case_per_people:Q')
).add_selection(brush).properties(width=525, height=480)

alt.hconcat(
    chart.encode(
        x=alt.X('median_income:Q', axis=alt.Axis(tickCount=3, format="$d",titleOpacity=0.6, title='Higher Median Income ->')), 
        y=alt.Y('case_per_people:Q', axis=alt.Axis(format="%",titleOpacity=0.6, tickCount=4, title="Cases in % of population"))
    ),
    chart.encode(
        x=alt.X('avg_person_per_household:Q', scale=alt.Scale(zero=False), axis=alt.Axis(tickCount=3,titleOpacity=0.6, title='More People per Household ->')), 
        y=alt.Y('case_per_people:Q', axis=alt.Axis(format="%", tickCount=4, title=None))
    )  
).configure_view(strokeWidth=0).configure_axis(grid=True,gridOpacity=0.6,labelOpacity=0.5)

Way to reproduce the old incorrect graph -

Get the total population data as 1*(1 person households) + .... + 7*(7 or more person households)

Hope you noticed the error over there, but it perfectly captures the old graph as you can see in my old tweet which I had implemented that way -


Gif of the above implementation -
old map gif

Now let's make the geospatial chloropleth plot cases per capita like the second chart at the top of the post

Since we have the data in order, - we will use the covid_income_zip data beause it already has the "case_per_people" field that we have to plot - potting is as simple as -

alt.Chart(covid_income_zip).mark_geoshape().encode(
    color='case_per_people'
)

Using their color scheme -

def color_it(x):
    if x<(1/50):
        return "less than 1/50"
    elif x<(1/40):
        return "1/50 to 1/40"
    elif x<(1/30):
        return "1/40 to 1/30"
    else:
        return "more than 1/30"


covid_income_zip = covid_income_zip.assign(share_of_population=covid_income_zip['case_per_people'].apply(color_it))

alt.Chart(covid_income_zip).mark_geoshape().encode(
    color=alt.Color('share_of_population',  scale=alt.Scale(domain=['less than 1/50', '1/50 to 1/40', '1/40 to 1/30', 'more than 1/30'], range=['#f2df91', '#ffae43', '#ff6e0b', '#ce0a05']))
).properties(width=600, height=600)

How do we add borough names on top?

I love this part - time for some actual spatial analysis. To show the Borough names we will actually dissolve the areas by borough and get the centroid and finally place our text mark there

borough = covid_income_zip.dissolve(by='BOROUGH_GROUP')
borough['lon'] = borough['geometry'].centroid.x
borough['lat'] = borough['geometry'].centroid.y
borough = borough.reset_index()
borough
BOROUGH_GROUP geometry label modzcta pop_est zcta COVID_CASE_COUNT COVID_DEATH_COUNT NEIGHBORHOOD_NAME geoid median_income case_per_people share_of_population lon lat
0 Bronx MULTIPOLYGON (((-73.84290 40.82857, -73.84287 ... 10451 10451 47798 10451 1737.0 152.0 Concourse/Melrose 86000US10451 28921.0 0.036340 more than 1/30 -73.866550 40.853688
1 Brooklyn POLYGON ((-73.93190 40.59469, -73.93193 40.594... 11201 11201 62823 11201 818.0 95.0 Brooklyn Heights/DUMBO/Downtown Brooklyn 86000US11201 124996.0 0.013021 less than 1/50 -73.949561 40.645380
2 Manhattan MULTIPOLYGON (((-74.00760 40.70299, -74.00765 ... 10001, 10118 10001 23072 10001, 10119, 10199 440.0 26.0 Chelsea/NoMad/West Chelsea 86000US10001 88526.0 0.019071 less than 1/50 -73.966228 40.778413
3 Queens MULTIPOLYGON (((-73.86496 40.56663, -73.86509 ... 11004, 11005 11004 19216 11004, 11005, 11040 633.0 68.0 Bellerose/Douglaston-Little Neck 86000US11004 92657.0 0.032941 1/40 to 1/30 -73.819049 40.710385
4 Staten Island POLYGON ((-74.20999 40.51055, -74.21013 40.510... 10301 10301 38733 10301 1314.0 107.0 Silver Lake/St. George 86000US10301 55802.0 0.033925 more than 1/30 -74.153793 40.581313

Let's plot it and put the legend on top -

a = alt.Chart(covid_income_zip).mark_geoshape().encode(
    color=alt.Color(
        'share_of_population',  
        scale=alt.Scale(domain=['less than 1/50', '1/50 to 1/40', '1/40 to 1/30', 'more than 1/30'], range=['#f2df91', '#ffae43', '#ff6e0b', '#ce0a05']),
        legend=alt.Legend(title="Share of Population", orient='top', symbolType='square', columnPadding=20, symbolSize=200)
    ),
    tooltip = ['BOROUGH_GROUP', 'NEIGHBORHOOD_NAME', 'modzcta', 'COVID_CASE_COUNT', 'COVID_DEATH_COUNT']
).properties(width=600, height=600)

b = alt.Chart(borough).mark_text().encode(
    longitude='lon',
    latitude='lat',
    text='BOROUGH_GROUP:N'
)
a+b