hypergraph

Finding "Patterns of Life" in NYC Taxi Traffic

In this notebook, we use tensor decompositions in order to find coherent "patterns of life" in NYC taxi trip records. We show that ENSIGN, when applied to spatiotemporal, entity-based data, can extract distinct travel, work, and leisure activities.

Table of Contents

In [1]:
# data manipulation 
import numpy as np 
import pandas as pd 

# ENSIGN tools
from ensign.csv2tensor import csv2tensor 
from ensign.cp_decomp import cp_apr, read_cp_decomp_dir, write_cp_decomp_dir
from ensign.visualize import plot_component 

# custom plotting
import plotly.express as px 

# Needed to display visuals in Jupyter Notebook
%matplotlib inline 

Data

We explore data provided by the NYC Taxi & Limousine Commision on taxi rides. The data includes features such as pickup and dropoff times and locations, the distance of the trip, the number of passengers, the payment amount and method, and more. As we are specifically interested in "patterns of life", or the types of trips and reasons for trips, we will focus on the trip times and locations. We consider just one week of data from June 13-19, 2016, but several years of data can be found and the T&LC site.

In [2]:
pd.read_csv('data/taxi_data.csv')
Out[2]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2016-06-13 00:02:54 2016-06-13 00:23:07 1 10.30 -73.853882 40.759350 2 N -73.970634 40.793297 2 52.0 0.0 0.5 0.00 5.54 0.3 58.34
1 2 2016-06-13 00:02:54 2016-06-13 00:20:45 2 9.47 -73.874580 40.773991 1 N -73.997139 40.736641 1 27.5 0.5 0.5 6.87 5.54 0.3 41.21
2 1 2016-06-13 00:02:55 2016-06-13 00:07:19 1 1.30 -73.956161 40.771927 1 N -73.967941 40.755821 1 6.0 0.5 0.5 1.45 0.00 0.3 8.75
3 1 2016-06-13 00:02:55 2016-06-13 00:08:56 1 1.00 -73.984879 40.748096 1 N -73.991730 40.754707 2 6.0 0.5 0.5 0.00 0.00 0.3 7.30
4 2 2016-06-13 00:02:55 2016-06-13 00:03:14 1 0.04 -73.950432 40.826599 1 N -73.950233 40.826557 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2572351 2 2016-06-19 23:45:21 2016-06-19 23:52:13 1 1.40 -73.968925 40.760899 1 N -73.991943 40.770603 1 7.5 0.5 0.5 1.00 0.00 0.3 9.80
2572352 1 2016-06-19 23:46:20 2016-06-19 23:50:48 1 0.60 0.000000 0.000000 1 Y 0.000000 0.000000 2 5.0 0.5 0.5 0.00 0.00 0.3 6.30
2572353 2 2016-06-19 23:46:39 2016-06-19 23:48:56 2 0.62 -74.002419 40.750240 1 N -73.994347 40.752590 2 4.0 0.5 0.5 0.00 0.00 0.3 5.30
2572354 2 2016-06-19 23:47:13 2016-06-19 23:59:23 5 4.30 -74.004814 40.725609 1 N -73.950722 40.723656 1 15.0 0.5 0.5 3.26 0.00 0.3 19.56
2572355 2 2016-06-19 23:49:58 2016-06-19 23:56:11 1 1.47 -73.984680 40.748310 1 N -73.979019 40.762081 2 7.0 0.5 0.5 0.00 0.00 0.3 8.30

2572356 rows × 19 columns

Tensor Construction

When constructing a tensor for decomposition, the main considerations are which features to select and how to discretize them. This discretization process, known as binning, ensures that similar values in the chosen dimensions have the same tensor index and results in more coherent patterns. The relevant data here are the time of the trip and the starting and ending locations, so we select five columns: pickup time, pickup latitude, pickup longitude, dropoff latitude, and dropoff longitude. We round the times to the nearest hour and round the coordinates to three points of precision. Moreover, we fuse the latitude and longitude so that the indices in those modes represent specific locations. The spatial binning, together with the fusing operation, results in mode indices corresponding roughly to city blocks.

In [3]:
tensor = csv2tensor(
    filepaths='data/taxi_data.csv', 
    columns=[ 
        'tpep_pickup_datetime', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'
    ],
    types=['datetime', 'float64', 'float64', 'float64', 'float64'], 
    binning=['hour', 'round=3', 'round=3', 'round=3', 'round=3'],
    sort=['tpep_pickup_datetime'],
    fuse_columns=[['pickup_longitude', 'pickup_latitude'], ['dropoff_longitude', 'dropoff_latitude']]
)

Tensor Decomposition

We use CP-APR to decompose the tensor because we constructed the tensor to have count entries. A rank-100 decomposition extracted coherent patterns, and raising the rank did not yield further components of interest.

In [4]:
decomp = cp_apr(tensor, 100, mem_limit_gb=16)
write_cp_decomp_dir('taxi_decomposition', decomp, write_tensor=True)

Evaluating the decomposition quality: The CPDecomp object provides a dictionary metrics as a field that contains information on the decomposition: running time, various quality metrics, and the number of completed iterations. It is not necessary to get a perfect fit of 1 in order to have high-quality, interpretable components. Here, the fit on the order of 10-1 coupled with a high cosine similarity indicate good decomposition results. Therefore, we can be confident that the decomposition components capture almost all of the activity in the original data.

In [5]:
decomp.metrics
Out[5]:
{'time': 441.4710237979889,
 'fit': 0.41096085146631833,
 'cosine_sim': 0.8081654036306326,
 'norm_scaling': 0.7982438044274881,
 'coverage': 0.998795747756958,
 'cp_total_iter': 100}

Component Visualization and Interpretation

We can visualize each component by plotting the scores in each mode vector involved in the outer product reconstructing that component. The labels along each mode correspond to the binned values created during tensor construction. Any tuple of scoring indices in the outer product is a tensor index involved in the pattern described by the component. Therefore, the labels of the scoring indices describe the pattern. Specifically, the hour mode indicates when the trips occur, and the pickup and dropoff modes indicate where the described trips started and ended. Reading these plots of components in this manner allows us to describe coherent trends in the data.

As latitude and longitude location data is not easily human readable, for selected components, we plot the high-scoring pickup and dropoff locations. The pickup and dropoff locations are color-coded and scaled by their score in the component.

In [6]:
# Custom plotting function for geographic visualization of this taxi decomposition
def taxi_plot(decomp, comp_id, zoom_level=12):
    df = pd.DataFrame(
        np.array([
            np.concatenate([decomp.factors[1][:, comp_id], decomp.factors[2][:, comp_id]]),
            np.concatenate([
                len(decomp.factors[1][:, comp_id])*['Pickup Location'], 
                len(decomp.factors[2][:, comp_id])*['Dropoff Location']
            ]),
            np.concatenate([
                [l.split('__') for l in decomp.labels[1]], [l.split('__') for l in decomp.labels[2]]
            ])[:, 0],
            np.concatenate([
                [l.split('__') for l in decomp.labels[1]], [l.split('__') for l in decomp.labels[2]]
            ])[:, 1]
        ]).T,
        columns=['score', 'type', 'lon', 'lat']
    )
    df['score'] = df['score'].astype(float)
    df['lat'] = df['lat'].astype(float)
    df['lon'] = df['lon'].astype(float)
    df = df[df['score'] != 0.0]

    fig = px.scatter_mapbox(df, lat="lat", lon="lon", zoom=zoom_level, height=800, size="score", color="type")
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

Nightlife

Component 27 clusters together the taxi rides associated with nightlife. By plotting the scores in the mode vectors of the component, we see that only Friday and Saturday evening hours in the time mode have non-zero scores. Additionally, the high-scoring pickup and dropoff locations are the Meatpacking District, Greenwich Village, and the Lower East Side, all of which are known for their bars and restaurants. The pickup and dropoff locations are plotted on the map below.

In [7]:
plot_component(decomp, 27)
Out[7]:

The datetime pattern is especially isolated here, peaking at 1-2am on Saturday and Sunday mornings. Also, we can see a few people starting the weekend early on Thursday night!

The other two modes are better interpreted with the geographic visualization below.

In [8]:
taxi_plot(decomp, 27)