Parsing Data and Creating Basic Graphs (Python, Pandas, Geopandas, Matplotlib)

Session 2: Parsing Data and Creating Basic Graphs

You can find a copy of the original Jupyter Notebook file here, and here’s a video recording of the session:

We know subway usage in New York City decreased dramatically during first two weeks of the current COVID-19 outbreak. But by how much exactly? And is it the same all around the city? This tutorial will guide you through the basic steps of parsing the notoriously difficult MTA turnstile data and giving it geographic attributes. Our final product will be a Jupyter Notebook with maps and graphs exploring the magnitude and extent of this decrease, as well as its correlation with income levels across the city.

This analysis is based on previous work by Ben Wellington on his I Quant NY blog and an article on The City (map below).

The City Subway Usage Map

There are many ways of parsing this data. Our tutorial uses Python and Pandas but you could also use, for example, an SQL database. If you would like to try it that way please take a look at Chris Whong’s post Taming the MTA’s Unruly Turnstile Data.

You can also just download a pre-processed version of the data from this Qri repository.

1. Importing libraries

In this tutorial we will use the following libraries: Numpy, Pandas, Geopandas, Matplotlib and Shapely.

Pandas will be our primary library, which will allow us to load our data into dataframes, filter it and merge it with other datasets. Pandas is built on top of Numpy so we need to load that one too.

Geopandas expands Pandas' functionality to include geographic attributes and operations. And Shapely complements Geopandas and will help us transform our MTA data into points with geographic attributes.

Finally, Altair will allow us to create graphs with our data, including maps.

Specifically, are importing all Numpy (as np), all Pandas (as pd), all Geopandas (as gpd), all Altair (as alt), but only Point from Shapely.

import numpy as np
import pandas as pd
import altair as alt
import datetime

Google Colab doesn’t include Geopandas so we need to install it ourselves. In addition, in order to perform spatial joins we need to install two more libraries, libspatialindex-dev and rtree.

# !pip install geopandas
# !apt install libspatialindex-dev
# !pip install rtree

Now we can import Geopandas and Shapely.

import geopandas as gpd
from shapely.geometry import Point

2. Loading the base data

The data for this tutorial comes from two sources:

Turnstile entry and exit data from the Metropolitan Transit Authority
Geocoded turnstile locations from Chris Whong

In working with MTA’s turnstile data make sure you checkout their documentation. Here’s what each of the fields contains:

C/A = Control Area (A002)
UNIT = Remote Unit for a station (R051)
SCP = Subunit Channel Position represents an specific address for a device (02-00-00)
STATION = Represents the station name the device is located at
LINENAME = Represents all train lines that can be boarded at this station. Normally lines are represented by one character. LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND
DATE = Represents the date (MM-DD-YY)
TIME = Represents the time (hh:mm:ss) for a scheduled audit event
DESc = Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours):
1. Audits may occur more that 4 hours due to planning, or troubleshooting activities.
2. Additionally, there may be a “RECOVR AUD” entry: This refers to a missed audit that was recovered.
ENTRIES = The comulative entry register value for a device
EXITS = The cumulative exit register value for a device

Let’s read the data directly from their location online. We load the data into a Pandas dataframe using the read_csv function, and providing the location of the file (in this case its URL), and specifying the type of delimiter the data uses (in this case a comma).

You can ream more about Pandas here and there are simple tutorials here.

In this next line we set the ending dates for the two weeks we will compare (this needs to match the way the MTA names their files), and the specific days we will compare.

week1_endingDate = '200307'
week2_endingDate = '200321'
day1 = pd.to_datetime('03/06/2020')
day2 = pd.to_datetime('03/20/2020')

week1 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_' + week1_endingDate + '.txt', delimiter=',')

week2 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_' + week2_endingDate + '.txt', delimiter=',')

It’s always a good idea to check your data after you load it. In this case we do this by printing its head (its first 5 lines), and its tail (its last 5 lines). You could also do head(25) or tail(25) to print the first or last 25 rows of the dataframe.

week1.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	03:00:00	REGULAR	7394747	2508844
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	07:00:00	REGULAR	7394755	2508857
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	11:00:00	REGULAR	7394829	2508945
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	15:00:00	REGULAR	7395039	2509013
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	19:00:00	REGULAR	7395332	2509058

week2.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	00:00:00	REGULAR	7409147	2514552
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	04:00:00	REGULAR	7409154	2514554
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	08:00:00	REGULAR	7409160	2514587
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	12:00:00	REGULAR	7409214	2514672
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	16:00:00	REGULAR	7409371	2514718

week1.tail()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
207878	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/06/2020	04:00:00	REGULAR	5554	507
207879	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/06/2020	08:00:00	REGULAR	5554	507
207880	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/06/2020	12:00:00	REGULAR	5554	507
207881	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/06/2020	16:00:00	REGULAR	5554	507
207882	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/06/2020	20:00:00	REGULAR	5554	507

week2.tail()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
206740	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/20/2020	05:00:00	REGULAR	5554	508
206741	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/20/2020	09:00:00	REGULAR	5554	508
206742	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/20/2020	13:00:00	REGULAR	5554	508
206743	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/20/2020	17:00:00	REGULAR	5554	508
206744	TRAM2	R469	00-05-01	RIT-ROOSEVELT	R	RIT	03/20/2020	21:00:00	REGULAR	5554	508

Printing the shape of the dataframe will give us its number of rows and columns.

week1.shape

(207883, 11)

week2.shape

(206745, 11)

Now let’s read the geocoded turnstile data, but because the file doesn’t include a header row we should specify the column names. We do this by adding a third attribute to the read_csv function which will contain the header names.

geocoded_turnstiles = pd.read_csv('https://raw.githubusercontent.com/chriswhong/nycturnstiles/master/geocoded.csv', delimiter=',', names=['unit','id2','stationName','division','line','lat','lon'])

geocoded_turnstiles.head()

	unit	id2	stationName	division	line	lat	lon
0	R470	X002	ELTINGVILLE PK	Z	SRT	40.544600	-74.164581
1	R544	PTH02	HARRISON	1	PTH	40.738879	-74.155533
2	R165	S102	TOMPKINSVILLE	1	SRT	40.636948	-74.074824
3	R070	S101	ST. GEORGE	1	SRT	40.643738	-74.073622
4	R070	S101A	ST. GEORGE	1	SRT	40.643738	-74.073622

geocoded_turnstiles.tail()

	unit	id2	stationName	division	line	lat	lon
763	R549	PTH18	NEWARK BM BW	1	PTH	NaN	NaN
764	R549	PTH19	NEWARK C	1	PTH	NaN	NaN
765	R549	PTH20	NEWARK HM HE	1	PTH	NaN	NaN
766	R550	PTH07	CITY / BUS	1	PTH	NaN	NaN
767	R550	PTH16	LACKAWANNA	1	PTH	NaN	NaN

geocoded_turnstiles.shape

(768, 7)

3. Filtering by date and time

Once we’ve loaded the data, we need to get the number of entries and exits for each individual turnstile, for the two days we are comparing. To do this we will compare the last reading on the previous day for each date with the last reading on the actual date. We will do this for every turnstile in the dataset.

This is not a perfect method though. The data has errors and some measurements were taken at 11pm while others were takend at 10pm or midnight. So for some turnstiles, our comparison method will only give us the number of entries or exits that happened over 23 or 22 hours, but hopefully most of them will return the usage over 24 hours.

First, to perform this comparison, we need to join the TIME and DATE columns and create a new column of type DateTime.

week1['DATETIME'] = pd.to_datetime(week1['DATE'] + ' ' + week1['TIME'])
week2['DATETIME'] = pd.to_datetime(week2['DATE'] + ' ' + week2['TIME'])

week1.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	03:00:00	REGULAR	7394747	2508844	2020-02-29 03:00:00
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	07:00:00	REGULAR	7394755	2508857	2020-02-29 07:00:00
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	11:00:00	REGULAR	7394829	2508945	2020-02-29 11:00:00
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	15:00:00	REGULAR	7395039	2509013	2020-02-29 15:00:00
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	02/29/2020	19:00:00	REGULAR	7395332	2509058	2020-02-29 19:00:00

week2.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	00:00:00	REGULAR	7409147	2514552	2020-03-14 00:00:00
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	04:00:00	REGULAR	7409154	2514554	2020-03-14 04:00:00
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	08:00:00	REGULAR	7409160	2514587	2020-03-14 08:00:00
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	12:00:00	REGULAR	7409214	2514672	2020-03-14 12:00:00
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/14/2020	16:00:00	REGULAR	7409371	2514718	2020-03-14 16:00:00

Next, we need to create two new dataframes with just the data for the previous and actual dates we are comparing.

twoDays1 = week1[(week1['DATETIME'].dt.date == (day1 - pd.Timedelta(days=1)).date()) | (week1['DATETIME'].dt.date == day1.date())]
twoDays2 = week2[(week2['DATETIME'].dt.date == (day2 - pd.Timedelta(days=1)).date()) | (week2['DATETIME'].dt.date == day2.date())]

twoDays1.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME
30	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	03:00:00	REGULAR	7399943	2510880	2020-03-05 03:00:00
31	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	07:00:00	REGULAR	7399957	2510928	2020-03-05 07:00:00
32	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	11:00:00	REGULAR	7400074	2511205	2020-03-05 11:00:00
33	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	15:00:00	REGULAR	7400256	2511283	2020-03-05 15:00:00
34	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	19:00:00	REGULAR	7400949	2511369	2020-03-05 19:00:00

twoDays2.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME
30	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	00:00:00	REGULAR	7411352	2515627	2020-03-19 00:00:00
31	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	04:00:00	REGULAR	7411358	2515630	2020-03-19 04:00:00
32	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	08:00:00	REGULAR	7411365	2515672	2020-03-19 08:00:00
33	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	12:00:00	REGULAR	7411397	2515741	2020-03-19 12:00:00
34	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	16:00:00	REGULAR	7411494	2515769	2020-03-19 16:00:00

After this, we will create a new column with a composite ID that will allow us to identify each single turnstile in the data. This composite will be a concatenation of the C/A, UNIT and SCP columns. Having this id will let us loop through the dataset.

twoDays1['ID'] = twoDays1['C/A'] + ' ' + twoDays1['UNIT'] + ' ' + twoDays1['SCP']
twoDays2['ID'] = twoDays2['C/A'] + ' ' + twoDays2['UNIT'] + ' ' + twoDays2['SCP']

/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

twoDays1.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME	ID
30	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	03:00:00	REGULAR	7399943	2510880	2020-03-05 03:00:00	A002 R051 02-00-00
31	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	07:00:00	REGULAR	7399957	2510928	2020-03-05 07:00:00	A002 R051 02-00-00
32	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	11:00:00	REGULAR	7400074	2511205	2020-03-05 11:00:00	A002 R051 02-00-00
33	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	15:00:00	REGULAR	7400256	2511283	2020-03-05 15:00:00	A002 R051 02-00-00
34	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/05/2020	19:00:00	REGULAR	7400949	2511369	2020-03-05 19:00:00	A002 R051 02-00-00

twoDays2.head()

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS	DATETIME	ID
30	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	00:00:00	REGULAR	7411352	2515627	2020-03-19 00:00:00	A002 R051 02-00-00
31	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	04:00:00	REGULAR	7411358	2515630	2020-03-19 04:00:00	A002 R051 02-00-00
32	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	08:00:00	REGULAR	7411365	2515672	2020-03-19 08:00:00	A002 R051 02-00-00
33	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	12:00:00	REGULAR	7411397	2515741	2020-03-19 12:00:00	A002 R051 02-00-00
34	A002	R051	02-00-00	59 ST	NQR456W	BMT	03/19/2020	16:00:00	REGULAR	7411494	2515769	2020-03-19 16:00:00	A002 R051 02-00-00

Now, using the new ID column we can create a list of unique IDs and use it to loop through the dataset.

idList1 = twoDays1['ID'].unique()
idList2 = twoDays2['ID'].unique()

idList1

array(['A002 R051 02-00-00', 'A002 R051 02-00-01', 'A002 R051 02-03-00',
       ..., 'TRAM2 R469 00-03-01', 'TRAM2 R469 00-05-00',
       'TRAM2 R469 00-05-01'], dtype=object)

idList2

array(['A002 R051 02-00-00', 'A002 R051 02-00-01', 'A002 R051 02-03-00',
       ..., 'TRAM2 R469 00-03-01', 'TRAM2 R469 00-05-00',
       'TRAM2 R469 00-05-01'], dtype=object)

We will use loops now to iterate over the list of individual turnstiles we created above and produce a new dataframe with the difference in entries and exits for each individual turnstile.

Specifically we will do the following for each of our main dataframes:

We will create a new empty dataframe with three columns: UNIT, ENTRIES_Day1, and EXITS_Day1
We will loop through each of the turnstile IDs in the list we created above and for each element in that list we will:
1. Extract from the main dataframe just the rows for that turnstile
2. From that subset get just the rows for the previous day
3. From that sub-subset get the row with the latest DATETIME value. This means we will get the last reading for those dates
4. Do the same for the actual dates we are comparing
5. Calculate the difference in the turnstile counter effectively giving us the actual number of entries and exits for that (hopefully) 24 hour period.
6. Add those values to the new dataframe we created at the beginning
All of this wrapped in a try / except statement, which will help us catch any errors without breaking the loop.

oneDay1 = pd.DataFrame(columns=['UNIT', 'ENTRIES_Day1', 'EXITS_Day1'])
for turnstile in idList1:
    try:
        df = twoDays1.loc[(twoDays1['ID'] == turnstile)]
        df2 = df.loc[df['DATETIME'].dt.date == (day1 - pd.Timedelta(days=1)).date()]
        df2 = df2[df2['DATETIME'] == df2['DATETIME'].max()]
        df3 = df.loc[df['DATETIME'].dt.date == day1.date()]
        df3 = df3[df3['DATETIME'] == df3['DATETIME'].max()]
        entries = df3.iloc[0]['ENTRIES'] - df2.iloc[0]['ENTRIES']
        exits = df3.iloc[0]['EXITS                                                               '] - df2.iloc[0]['EXITS                                                               ']
        oneDay1 = oneDay1.append({'UNIT': df2.iloc[0]['UNIT'],'ENTRIES_Day1': entries,'EXITS_Day1':exits}, ignore_index=True)
    except IndexError:
        print('IndexError')

IndexError
IndexError
IndexError
IndexError

oneDay2 = pd.DataFrame(columns=['UNIT', 'ENTRIES_Day2', 'EXITS_Day2'])
for turnstile in idList2:
    try:
        df = twoDays2.loc[(twoDays2['ID'] == turnstile)]
        df2 = df.loc[df['DATETIME'].dt.date == (day2 - pd.Timedelta(days=1)).date()]
        df2 = df2[df2['DATETIME'] == df2['DATETIME'].max()]
        df3 = df.loc[df['DATETIME'].dt.date == day2.date()]
        df3 = df3[df3['DATETIME'] == df3['DATETIME'].max()]
        entries = df3.iloc[0]['ENTRIES'] - df2.iloc[0]['ENTRIES']
        exits = df3.iloc[0]['EXITS                                                               '] - df2.iloc[0]['EXITS                                                               ']
        oneDay2 = oneDay2.append({'UNIT': df2.iloc[0]['UNIT'],'ENTRIES_Day2': entries,'EXITS_Day2':exits}, ignore_index=True)
    except IndexError:
        print('IndexError')

IndexError
IndexError
IndexError
IndexError

These loops might have returned some errors. These errors ocurr because some turnstiles might not have measurements for one of the days we are looking at. The try / except syntax allows us to run the loop, catch the errors, and continue running the loop. Otherwise, as soon as you get an error Python breaks the loop.

4. Cleaning up the data

Now that we have the individual turnstile data, we will create a couple of quick graphs to take a look at its distribution, and given that distribution we will exclude certain values which appear to be errors.

Here we use Altair to create horizontal boxplots of entries and exits.

chart1 = alt.Chart(oneDay1).mark_boxplot().encode(x='ENTRIES_Day1:Q').properties(
    width=1000,
    title='Turnstile Entries ' + str(day1.date()))
chart2 = alt.Chart(oneDay1).mark_boxplot().encode(x='EXITS_Day1:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day1.date()))
alt.vconcat(chart1, chart2)

We see some negative values in the data. We clear them and any outliers with the following line.

oneDay1 = oneDay1[(oneDay1['ENTRIES_Day1'] > 0) & (oneDay1['EXITS_Day1'] > 0) & (oneDay1['ENTRIES_Day1'] < 10000) & (oneDay1['EXITS_Day1'] < 10000)]

Now let’s graph the data again to make sure it looks good.

chart1 = alt.Chart(oneDay1).mark_boxplot().encode(x='ENTRIES_Day1:Q').properties(
    width=1000,
    title='Turnstile Entries ' + str(day1.date()))
chart2 = alt.Chart(oneDay1).mark_boxplot().encode(x='EXITS_Day1:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day1.date()))
alt.vconcat(chart1, chart2)

Let’s do the same thing for the data for the second day.

chart1 = alt.Chart(oneDay2).mark_boxplot().encode(x='ENTRIES_Day2:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day2.date()))
chart2 = alt.Chart(oneDay2).mark_boxplot().encode(x='EXITS_Day2:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day2.date()))
alt.vconcat(chart1, chart2)

We can get rid of any negative values or outliers for the second day with the following line.

oneDay2 = oneDay2[(oneDay2['ENTRIES_Day2'] > 0) & (oneDay2['EXITS_Day2'] > 0) & (oneDay2['ENTRIES_Day2'] < 10000) & (oneDay2['EXITS_Day2'] < 10000)]

Again, let’s check to see the data for the second day looks good.

chart1 = alt.Chart(oneDay2).mark_boxplot().encode(x='ENTRIES_Day2:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day2.date()))
chart2 = alt.Chart(oneDay2).mark_boxplot().encode(x='EXITS_Day2:Q').properties(
    width=1000,
    title='Turnstile Exits ' + str(day2.date()))
alt.vconcat(chart1, chart2)

Now, again, let’s print the head and tail of the two dataframes to make sure they are alright.

oneDay1.head()

	UNIT	ENTRIES_Day1	EXITS_Day1
0	R051	1245	508
1	R051	1184	269
2	R051	706	1843
3	R051	746	1875
4	R051	757	1370

oneDay2.head()

	UNIT	ENTRIES_Day2	EXITS_Day2
0	R051	291	165
1	R051	235	91
2	R051	100	446
3	R051	326	530
4	R051	268	374

oneDay1.tail()

	UNIT	ENTRIES_Day1	EXITS_Day1
4905	R468	437	9
4907	R469	1087	14
4908	R469	1124	24
4909	R469	197	13
4910	R469	177	18

oneDay2.tail()

	UNIT	ENTRIES_Day2	EXITS_Day2
4907	R468	34	3
4909	R469	294	12
4910	R469	353	17
4911	R469	36	8
4912	R469	35	12

5. Aggregating by station

Finally, we can aggregate the data by station using the UNIT column. Here we will use Pandas’ groupby function, specifying that we are aggregating the values using a sum.

day1Stations = oneDay1.groupby('UNIT').agg({'ENTRIES_Day1':'sum', 'EXITS_Day1':'sum'})
day2Stations = oneDay2.groupby('UNIT').agg({'ENTRIES_Day2':'sum', 'EXITS_Day2':'sum'})

day1Stations.head()

	ENTRIES_Day1	EXITS_Day1
UNIT
R001	27053	22042
R003	1548	985
R004	3950	2813
R005	3796	1946
R006	4929	2912

day2Stations.head()

	ENTRIES_Day2	EXITS_Day2
UNIT
R001	5708	5310
R003	630	394
R004	1460	997
R005	1625	784
R006	1910	1191

You will notice that the dataframes appear to have gained a second header row, and now the UNIT column is working as the index. Before we move on we should reset the index so we get only one header row, the UNIT value as a column, not an index, and a regular index. We do this using the reset_index function, and we specify inplace=True so that we don’t create a new dataframe but overwrite our current one.

day1Stations.reset_index(inplace=True)
day2Stations.reset_index(inplace=True)

day1Stations.head()

	UNIT	ENTRIES_Day1	EXITS_Day1
0	R001	27053	22042
1	R003	1548	985
2	R004	3950	2813
3	R005	3796	1946
4	R006	4929	2912

day2Stations.head()

	UNIT	ENTRIES_Day2	EXITS_Day2
0	R001	5708	5310
1	R003	630	394
2	R004	1460	997
3	R005	1625	784
4	R006	1910	1191

6. Merging the datasets

Finally, in this section we will merge both datasets and calculate the difference in entries and exits from those two dates. We will use the merge function with an outer method, which uses the keys from both dataframes. If one of them doesn’t have a matching key, that row is still left in the merged dataframe. We also specify the UNIT column as the one to use to perform the merge between dataframes.

This link has some illustrative diagrams on the different types of merges you can perform with Pandas.

daysDiff = day1Stations.merge(day2Stations, how='outer', on='UNIT')

daysDiff.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2
0	R001	27053	22042	5708	5310
1	R003	1548	985	630	394
2	R004	3950	2813	1460	997
3	R005	3796	1946	1625	784
4	R006	4929	2912	1910	1191

Now we can calculate the difference (in percentage) between the entries and exits for the two dates.

daysDiff['ENTRIES_DIFF'] = (daysDiff['ENTRIES_Day2'] - daysDiff['ENTRIES_Day1']) / daysDiff['ENTRIES_Day1']
daysDiff['EXITS_DIFF'] = (daysDiff['EXITS_Day2'] - daysDiff['EXITS_Day1']) / daysDiff['EXITS_Day1']

daysDiff.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF
0	R001	27053	22042	5708	5310	-0.789007	-0.759096
1	R003	1548	985	630	394	-0.593023	-0.600000
2	R004	3950	2813	1460	997	-0.630380	-0.645574
3	R005	3796	1946	1625	784	-0.571918	-0.597122
4	R006	4929	2912	1910	1191	-0.612497	-0.591003

The describe function will give you some basic statistics on a specific column (or all) of the dataset.

daysDiff['ENTRIES_DIFF'].describe()

count    467.000000
mean      -0.695458
std        0.119255
min       -0.928777
25%       -0.797668
50%       -0.689748
75%       -0.598546
max       -0.446575
Name: ENTRIES_DIFF, dtype: float64

daysDiff['EXITS_DIFF'].describe()

count    467.000000
mean      -0.647893
std        0.142439
min       -0.968966
25%       -0.765893
50%       -0.647394
75%       -0.541357
max       -0.131313
Name: EXITS_DIFF, dtype: float64

daysDiff.shape

(467, 7)

Now let’s get the rows containing the maximum and minimum values for our calculated differences. It’s always good practice to see the edge cases and verify that they make sense.

daysDiff[daysDiff['ENTRIES_DIFF'] == daysDiff['ENTRIES_DIFF'].min()]

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF
446	R464	1390	191	99	42	-0.928777	-0.780105

daysDiff[daysDiff['ENTRIES_DIFF'] == daysDiff['ENTRIES_DIFF'].max()]

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF
419	R433	2190	2484	1212	1436	-0.446575	-0.4219

daysDiff[daysDiff['EXITS_DIFF'] == daysDiff['EXITS_DIFF'].min()]

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF
460	R549	24967	26519	6400	823	-0.743662	-0.968966

daysDiff[daysDiff['EXITS_DIFF'] == daysDiff['EXITS_DIFF'].max()]

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF
280	R292	3901	594	1594	516	-0.591387	-0.131313

Now let’s make a couple of histogram plots so we can see the distribution of our data.

entriesChart = alt.Chart(daysDiff).mark_bar().encode(
    alt.X("ENTRIES_DIFF:Q", bin=True),
    y='count()'
    ).properties(
    title='Difference in Entries ' + str(day1.date()) + ' vs ' + str(day2.date()))
exitsChart = alt.Chart(daysDiff).mark_bar().encode(
    alt.X('EXITS_DIFF:Q', bin=True), y='count()'
    ).properties(
    title='Difference in Exits ' + str(day1.date()) + ' vs ' + str(day2.date()))

alt.hconcat(entriesChart, exitsChart)

7. Merging the turnstile data with the geocoded dataset

In order to get the turnstile data on a map we need to give it geographic information. To do this we will merge our turnstile dataset with the geocoded turnstile data we downloaded from Chris Whong. As we did above we will use Pandas’ merge function, but this time we will do it with the left method, which means that we will only keep the records that exist in our turnstile dataset. Whatever exists in the geocoded dataset but is not in ours will not be merged.

Before that, though, we need to get rid of the duplicates in the geocoded turnstile dataset.

geocoded_turnstiles = geocoded_turnstiles.drop_duplicates(['unit', 'lat', 'lon'])

tdDiff_geo = daysDiff.merge(geocoded_turnstiles, how='left', left_on='UNIT', right_on='unit')

tdDiff_geo.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF	unit	id2	stationName	division	line	lat	lon
0	R001	27053	22042	5708	5310	-0.789007	-0.759096	R001	A060	WHITEHALL ST	R1	BMT	40.703082	-74.012983
1	R003	1548	985	630	394	-0.593023	-0.600000	R003	J025	CYPRESS HILLS	J	BMT	40.689945	-73.872564
2	R004	3950	2813	1460	997	-0.630380	-0.645574	R004	J028	ELDERTS LANE	JZ	BMT	40.691320	-73.867135
3	R005	3796	1946	1625	784	-0.571918	-0.597122	R005	J030	FOREST PARKWAY	J	BMT	40.692304	-73.860151
4	R006	4929	2912	1910	1191	-0.612497	-0.591003	R006	J031	WOODHAVEN BLVD	JZ	BMT	40.693866	-73.851568

tdDiff_geo.shape

(470, 14)

tdDiff_geo.tail()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF	unit	id2	stationName	division	line	lat	lon
465	R551	19462	19990	2282	2957	-0.882746	-0.852076	R551	PTH04	GROVE STREET	1	PTH	40.719876	-74.042616
466	R552	23480	13797	6533	4939	-0.721763	-0.642024	R552	PTH03	JOURNAL SQUARE	1	PTH	40.732102	-74.063915
467	R570	30474	27474	6944	6600	-0.772134	-0.759773	NaN	NaN	NaN	NaN	NaN	NaN	NaN
468	R571	25031	17372	4012	3583	-0.839719	-0.793749	NaN	NaN	NaN	NaN	NaN	NaN	NaN
469	R572	19028	13987	4381	3671	-0.769760	-0.737542	NaN	NaN	NaN	NaN	NaN	NaN	NaN

As you can see from the tail of our new merged dataset there are some turnstiles which do not have geographic information. We will drop these lines.

tdDiff_geo.dropna(inplace=True)

tdDiff_geo.shape

(460, 14)

8. Creating a basic map of change in subway usage

Now we will finally create some maps showing the amount of change in subway usage. To do this we need to first a list of point objects (using the Shapely library) which we will then use to transform our turnstile dataframe into a geodataframe with geographic information.

We will go through the following steps:

Create a list of point objects using the lat and lon columns in our data.
Specify the cordinate reference system (CRS) our data uses
Use the list of points and the coordinate reference system to transform our dataframe into a geodataframe
Reproject our geodataframe to the appropriate coordinate reference system for New York city.

For reference here is Geopandas official documentation page.

First, the let’s create the list of point objects.

geometry = [Point(xy) for xy in zip(tdDiff_geo.lon, tdDiff_geo.lat)]

Next, we need to create the geodataframe (using Geopandas), making sure the coordiante reference system (CRS) is the appropriate one for the type of data we have. EPSG:4326 corresponds to the World Geodetic System (WGS) 1984 CRS, which works with latitude and longitude (decimal degrees).

geocodedTurnstileData = gpd.GeoDataFrame(tdDiff_geo, crs='epsg:4326', geometry=geometry)

geocodedTurnstileData.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF	unit	id2	stationName	division	line	lat	lon	geometry
0	R001	27053	22042	5708	5310	-0.789007	-0.759096	R001	A060	WHITEHALL ST	R1	BMT	40.703082	-74.012983	POINT (-74.01298 40.70308)
1	R003	1548	985	630	394	-0.593023	-0.600000	R003	J025	CYPRESS HILLS	J	BMT	40.689945	-73.872564	POINT (-73.87256 40.68995)
2	R004	3950	2813	1460	997	-0.630380	-0.645574	R004	J028	ELDERTS LANE	JZ	BMT	40.691320	-73.867135	POINT (-73.86714 40.69132)
3	R005	3796	1946	1625	784	-0.571918	-0.597122	R005	J030	FOREST PARKWAY	J	BMT	40.692304	-73.860151	POINT (-73.86015 40.69230)
4	R006	4929	2912	1910	1191	-0.612497	-0.591003	R006	J031	WOODHAVEN BLVD	JZ	BMT	40.693866	-73.851568	POINT (-73.85157 40.69387)

Now we should convert this dataset to the appropriate CRS for a map of New York City. We will use EPSG:2263, which corresponds to the North American Datum (NAD) 1983 / New York Long Island (ftUS).

geocodedTurnstileData = geocodedTurnstileData.to_crs('epsg:2263')

Next, let’s just plot the points to make sure they look alright.

points = alt.Chart(geocodedTurnstileData).mark_circle(size=5).encode(
    longitude='lon:Q',
    latitude='lat:Q')

points

The next thing to do is to create a new column with a value for the marker size. We will do this based on the maximum value in the ENTRIES_DIFF field. We will convert those numbers to somewhere between 0 and 50. This will give us our circle sizes.

maxEntriesValue = max(abs(geocodedTurnstileData['ENTRIES_DIFF'].max()),abs(geocodedTurnstileData['ENTRIES_DIFF'].min()))
print(maxEntriesValue)

0.9287769784172661

geocodedTurnstileData['Size'] = (abs(geocodedTurnstileData['ENTRIES_DIFF']) / maxEntriesValue) * 50

geocodedTurnstileData.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF	unit	id2	stationName	division	line	lat	lon	geometry	Size
0	R001	27053	22042	5708	5310	-0.789007	-0.759096	R001	A060	WHITEHALL ST	R1	BMT	40.703082	-74.012983	POINT (980650.229 195428.470)	42.475577
1	R003	1548	985	630	394	-0.593023	-0.600000	R003	J025	CYPRESS HILLS	J	BMT	40.689945	-73.872564	POINT (1019590.877 190667.713)	31.924955
2	R004	3950	2813	1460	997	-0.630380	-0.645574	R004	J028	ELDERTS LANE	JZ	BMT	40.691320	-73.867135	POINT (1021095.700 191170.902)	33.936013
3	R005	3796	1946	1625	784	-0.571918	-0.597122	R005	J030	FOREST PARKWAY	J	BMT	40.692304	-73.860151	POINT (1023031.906 191532.416)	30.788759
4	R006	4929	2912	1910	1191	-0.612497	-0.591003	R006	J031	WOODHAVEN BLVD	JZ	BMT	40.693866	-73.851568	POINT (1025411.115 192105.415)	32.973334

geocodedTurnstileData['Size'].min()

24.041042836072705

geocodedTurnstileData['Size'].max()

50.0

Now let’s map the stations with the new Size value. In addition, we will also color the dots based on the ENTRIES_DIFF column, using a plasma color scheme.

points = alt.Chart(geocodedTurnstileData).mark_circle(opacity=1).encode(
    longitude = 'lon:Q',
    latitude = 'lat:Q',
    size = alt.Size('Size:Q'),
    color = alt.Color('ENTRIES_DIFF:Q', scale = alt.Scale(scheme = 'plasma', domain=[maxEntriesValue * -1, -0.4])),
    ).properties(
    width=600, height=600,
    title='Change in turnstile entries ' + str(day1.date()) + ' vs ' + str(day2.date())).configure_view(
    strokeWidth=0)

points

9. Adding income data

From this we can already see how the stations that saw the greatest decrease were the ones nearest Manhattan and around Midtown and Downtown. Similarly, the ones that saw the smallest decrease were the ones in the Bronx, outer Queens and outer Brooklyn. But the variation in decrease could also be related to the difference in income levels between those parts of the city. To do this properly, let’s bring in a median household income dataset and create some more maps and graphs.

Median household income comes from the U.S. Census Bureau (table B19013) American Community Survey 5-year Estimates, 2018 for all Census Tracts in New York City.
New York State census tracts shapefile comes from the U.S. Census Bureau Tiger/line shapefiles.

Let’s read first the income data file. Note that we need to skip the second row (row 1) which comes originally with header descriptors. Column B19013_001E contains the median household income and column B19013_001M the margin of error.

incomeData = pd.read_csv('https://brown-institute-assets.s3.amazonaws.com/Objects/VirtualProgramming/IncomeData.csv', delimiter=',', skiprows=[1])

incomeData.head()

	GEO_ID	NAME	B19013_001E	B19013_001M
0	1400000US36005042902	Census Tract 429.02, Bronx County, New York	31678	11851
1	1400000US36005033000	Census Tract 330, Bronx County, New York	38023	6839
2	1400000US36005035800	Census Tract 358, Bronx County, New York	80979	17275
3	1400000US36005037100	Census Tract 371, Bronx County, New York	27925	8082
4	1400000US36005038500	Census Tract 385, Bronx County, New York	20466	4477

Now let’s read the census tract shapefiles.

Note that here we will read it directly into a Geopandas geodataframe and that we read the whole zip file. Shapefiles are a specific file format developed by ArcGIS a while back which have become the standard format for geographic data. One particularity of shapefiles is that they are actually made up of several (between 3 and 12) individual files with the same name but different extensions (you can find a detailed explanation of what each individual file is here), and that’s why we need to read them as a single zip file.

censusTracts = gpd.read_file('ftp://ftp2.census.gov/geo/tiger/TIGER2018/TRACT/tl_2018_36_tract.zip')

censusTracts.head()

	STATEFP	COUNTYFP	TRACTCE	GEOID	NAME	NAMELSAD	MTFCC	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	geometry
0	36	007	012702	36007012702	127.02	Census Tract 127.02	G5020	S	65411230	228281	+42.0350532	-075.9055509	POLYGON ((-75.95917 42.00852, -75.95914 42.008...
1	36	007	012800	36007012800	128	Census Tract 128	G5020	S	12342855	259439	+42.1305335	-075.9181575	POLYGON ((-75.94766 42.13727, -75.94453 42.137...
2	36	007	012900	36007012900	129	Census Tract 129	G5020	S	14480253	63649	+42.1522758	-075.9766029	POLYGON ((-76.02229 42.15699, -76.01917 42.157...
3	36	007	013000	36007013000	130	Census Tract 130	G5020	S	9934434	381729	+42.1236499	-076.0002197	POLYGON ((-76.02417 42.12590, -76.02409 42.125...
4	36	007	013202	36007013202	132.02	Census Tract 132.02	G5020	S	2451349	3681	+42.1238924	-076.0311921	POLYGON ((-76.03996 42.13284, -76.03862 42.132...

censusTracts.shape

(4918, 13)

Now, since we actually imported all the census tracts for New York State, let’s filter them based on their county id (using the COUNTYFP) to just the census tracts for New York City. Use the following codes:

Bronx: 005
Kings (Brooklyn): 047
New York (Manhattan): 061
Queens: 081
Richmond (Staten Island): 085

censusTracts = censusTracts[(censusTracts['COUNTYFP'] == '061') | (censusTracts['COUNTYFP'] == '081') | (censusTracts['COUNTYFP'] == '085') | (censusTracts['COUNTYFP'] == '047') | (censusTracts['COUNTYFP'] == '005')]

censusTracts.shape

(2167, 13)

Notice that both the income data and the census tract dataframes have a geographic ID column (GEOID in the case of the census tracts and GEO_ID in the case of the income data). However, both columns are not equal: the one for the census tracts contains only 11 characters and the one for the income data contains 20 characters. That being said, if we only take the characters after the US in the income data we will get an equivalent column to the one on the census tracts. Therefore we need to create a new column with the last 11 characters from the GEO_ID column from the income data.

incomeData['GEOID'] = incomeData['GEO_ID'].str[-11:]

incomeData.head()

	GEO_ID	NAME	B19013_001E	B19013_001M	GEOID
0	1400000US36005042902	Census Tract 429.02, Bronx County, New York	31678	11851	36005042902
1	1400000US36005033000	Census Tract 330, Bronx County, New York	38023	6839	36005033000
2	1400000US36005035800	Census Tract 358, Bronx County, New York	80979	17275	36005035800
3	1400000US36005037100	Census Tract 371, Bronx County, New York	27925	8082	36005037100
4	1400000US36005038500	Census Tract 385, Bronx County, New York	20466	4477	36005038500

Now let’s merge the income data with the census tracts based on their GEOID columns.

nyc_income = censusTracts.merge(incomeData, how='left', left_on="GEOID", right_on="GEOID")

nyc_income.head()

	STATEFP	COUNTYFP	TRACTCE	GEOID	NAME_x	NAMELSAD	MTFCC	FUNCSTAT	ALAND	INTPTLAT	INTPTLON	geometry	GEO_ID	NAME_y	B19013_001E	B19013_001M
0	36	081	046200	36081046200	462	Census Tract 462	G5020	S	249611	+40.7098547	-073.7879749	POLYGON ((-73.79203 40.71107, -73.79101 40.711...	1400000US36081046200	Census Tract 462, Queens County, New York	53295	9902
1	36	081	045000	36081045000	450	Census Tract 450	G5020	S	172164	+40.7141840	-073.8047700	POLYGON ((-73.80782 40.71591, -73.80767 40.715...	1400000US36081045000	Census Tract 450, Queens County, New York	90568	14835
2	36	081	045400	36081045400	454	Census Tract 454	G5020	S	230996	+40.7126504	-073.7960120	POLYGON ((-73.79870 40.71066, -73.79792 40.710...	1400000US36081045400	Census Tract 454, Queens County, New York	45958	10460
3	36	081	045500	36081045500	455	Census Tract 455	G5020	S	148244	+40.7363111	-073.8623266	POLYGON ((-73.86583 40.73664, -73.86546 40.736...	1400000US36081045500	Census Tract 455, Queens County, New York	48620	10239
4	36	081	045600	36081045600	456	Census Tract 456	G5020	S	162232	+40.7166298	-073.7939898	POLYGON ((-73.79805 40.71743, -73.79697 40.717...	1400000US36081045600	Census Tract 456, Queens County, New York	76563	20682

nyc_income.shape

(2167, 17)

There are a couple of issues we need to fix before we can map this new dataset:

First, there are some census tracts that don’t have income information. These are, for example, parks or purely industrial census tracts. Income here is noted with a -.
Second, census tracts where the median household income is more than $250,000 have a value of 250,000+ which Pandas will understand as a string (text) and won’t know how to symbolize in the map.

To solve the first problem we need to drop the rows with no income information. And to solve the second one we will just replace 250,000+ with 250000. This is, obviously not completely accurate but shouldn’t cause any major distortions in the data.

nyc_income = nyc_income[nyc_income['B19013_001E'] != '-']

nyc_income['B19013_001E'] = nyc_income['B19013_001E'].replace('250,000+', '250000')

nyc_income.shape

(2101, 17)

Next we will create a new column called MHHI with the same values as the B19013_001E column but of type integer. Currently they treated as strings.

nyc_income['MHHI'] = nyc_income['B19013_001E'].astype('int')

nyc_income.head()

	STATEFP	COUNTYFP	TRACTCE	GEOID	NAME_x	NAMELSAD	MTFCC	FUNCSTAT	ALAND	INTPTLAT	INTPTLON	geometry	GEO_ID	NAME_y	B19013_001E	B19013_001M	MHHI
0	36	081	046200	36081046200	462	Census Tract 462	G5020	S	249611	+40.7098547	-073.7879749	POLYGON ((-73.79203 40.71107, -73.79101 40.711...	1400000US36081046200	Census Tract 462, Queens County, New York	53295	9902	53295
1	36	081	045000	36081045000	450	Census Tract 450	G5020	S	172164	+40.7141840	-073.8047700	POLYGON ((-73.80782 40.71591, -73.80767 40.715...	1400000US36081045000	Census Tract 450, Queens County, New York	90568	14835	90568
2	36	081	045400	36081045400	454	Census Tract 454	G5020	S	230996	+40.7126504	-073.7960120	POLYGON ((-73.79870 40.71066, -73.79792 40.710...	1400000US36081045400	Census Tract 454, Queens County, New York	45958	10460	45958
3	36	081	045500	36081045500	455	Census Tract 455	G5020	S	148244	+40.7363111	-073.8623266	POLYGON ((-73.86583 40.73664, -73.86546 40.736...	1400000US36081045500	Census Tract 455, Queens County, New York	48620	10239	48620
4	36	081	045600	36081045600	456	Census Tract 456	G5020	S	162232	+40.7166298	-073.7939898	POLYGON ((-73.79805 40.71743, -73.79697 40.717...	1400000US36081045600	Census Tract 456, Queens County, New York	76563	20682	76563

Now we can at last map the income data.

incomeMap = alt.Chart(nyc_income).mark_geoshape().encode(
    color = alt.Color('MHHI:Q', scale = alt.Scale(scheme = 'Blues'))
    ).properties(
    width=600, height=600,
    title='Median Household Income (2018)').configure_view(
    strokeWidth=0)

incomeMap

Now we can overlay our turnstile data on top of our income map to see where the greatest and smallest change happened.

incomeMap = alt.Chart(nyc_income).mark_geoshape().encode(
    color = alt.Color('MHHI:Q', scale = alt.Scale(scheme = 'Greys'))
    ).properties(
    width=600, height=600)

points = alt.Chart(geocodedTurnstileData).mark_circle(opacity=1).encode(
    longitude = 'lon:Q',
    latitude = 'lat:Q',
    size = alt.Size('Size:Q'),
    color = alt.Color('ENTRIES_DIFF:Q', scale = alt.Scale(scheme='plasma', domain=[maxEntriesValue * -1, -0.4]))
    ).properties(
    width=600,
    height=600,
    title='Median Household Income (2018) & Change in Subway Entries (' + str(day1.date()) + ' vs ' + str(day2.date()) + ')')

alt.layer(incomeMap,points).resolve_scale(color='independent').configure_view(
    strokeWidth=0)

Finally, we can join the income data to the turnstile data. That way we will be able to directly compare the change in entries or exits versus the median household income for that station.

Here we will perform a merge based on the spatial location of both datasets. This is also called a spatial join and works pretty much like a traditional merge function, but instead of joining data based on a common field they are joined based on a series of spatial operations: intersects, within, or contains. In our case we will use the intersects operation. Here’s the documentation for Geopandas’ merge (or spatial join) function.

nyc_income_2263 = nyc_income.to_crs('epsg:2263')

stationsIncome = gpd.sjoin(geocodedTurnstileData, nyc_income_2263, how='inner', op='intersects')

stationsIncome.head()

	UNIT	ENTRIES_Day1	EXITS_Day1	ENTRIES_Day2	EXITS_Day2	ENTRIES_DIFF	EXITS_DIFF	unit	id2	stationName	...	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	GEO_ID	NAME_y	B19013_001E	B19013_001M	MHHI
0	R001	27053	22042	5708	5310	-0.789007	-0.759096	R001	A060	WHITEHALL ST	...	S	302097	473628	+40.6987242	-074.0059997	1400000US36061000900	Census Tract 9, New York County, New York	195313	28360	195313
2	R004	3950	2813	1460	997	-0.630380	-0.645574	R004	J028	ELDERTS LANE	...	S	224878	0	+40.6896831	-073.8647717	1400000US36081000400	Census Tract 4, Queens County, New York	59423	10691	59423
3	R005	3796	1946	1625	784	-0.571918	-0.597122	R005	J030	FOREST PARKWAY	...	S	161256	0	+40.6907493	-073.8592987	1400000US36081001000	Census Tract 10, Queens County, New York	91350	10136	91350
4	R006	4929	2912	1910	1191	-0.612497	-0.591003	R006	J031	WOODHAVEN BLVD	...	S	146397	0	+40.6913525	-073.8495160	1400000US36081002000	Census Tract 20, Queens County, New York	84901	10785	84901
5	R007	2630	1366	1104	529	-0.580228	-0.612738	R007	J034	104 ST	...	S	148030	0	+40.6972603	-073.8430922	1400000US36081002600	Census Tract 26, Queens County, New York	83571	19721	83571

5 rows × 34 columns

stationsIncome.shape

(441, 34)

Now we can create a couple of scatter plots to compare change in entries and exits versus median household income.

entriesChart = alt.Chart(stationsIncome).mark_circle(opacity=1,size=50,color='black').encode(
    x='ENTRIES_DIFF',
    y='MHHI').properties(title='Income vs Change in Subway Usage')

exitsChart = alt.Chart(stationsIncome).mark_circle(opacity=1,size=50,color='black').encode(
    x='EXITS_DIFF',
    y='MHHI').properties(title='Income vs Change in Subway Usage')

alt.hconcat(entriesChart,exitsChart)

And we can include a locally weighted polynomial regression in these scatter plots

entriesChart = alt.Chart(stationsIncome).mark_circle(opacity=1,size=50,color='black').encode(
    x='ENTRIES_DIFF',
    y='MHHI').properties(title='Median Household Income vs. Change in Subway Usage')

entriesRegression = entriesChart + entriesChart.transform_loess(
    'ENTRIES_DIFF',
    'MHHI'
    ).mark_line(
    color="red").properties(title='Median Household Income vs. Change in Subway Usage')

exitsChart = alt.Chart(stationsIncome).mark_circle(opacity=1,size=50,color='black').encode(
    x='EXITS_DIFF',
    y='MHHI')

exitsRegression = exitsChart + exitsChart.transform_loess(
    'EXITS_DIFF',
    'MHHI'
    ).mark_line(
    color="red")

alt.hconcat(entriesChart + entriesRegression, exitsChart + exitsRegression)