EDA is the first and most crucial step to building predictive models from data. It allows the user to make conclusions by confirming or invalidating most of the assumptions they make about their data. It also helps in understanding the relationships between your variables.
In this post, I will demonstrate my thought process through EDA. The dataset I will use represents the customers of a Bike Shop. Through EDA, i will try and provide insights on bike user activity and behavior. Such information/insight can be useful for the shop's team or any other party of interest for that matter.
Lets get started, shall we?
import pandas as pd
import statsmodels.api as sms
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats.kde import gaussian_kde
DFtrips = pd.read_csv('data.csv', parse_dates=['start_date'])
Above are the necessary imports and reading the data into variable DFtrips. Lets take a look at the data.
As we can see, the features are all variables that describe a biker's trip. start_date, start_station, end_date, end_station as well as the rest of the features help identify the user by the trip. I wanted to take a closer look at subscription_type, turns out there are two types, subscriber and customer.
In order to make proper use of the start_date columns, I will extract the hour, date, day-of-week and month into their own designated columns. The reason why I'm doing this is because I'm trying to gain insight on bike trips based on each of those columns. This will make EDA much simpler .
As you can see above, after I've separated the start_date feature information, I was then able to use the newly created 'month' in a groupby method.
After grouping by 'month' I then used the '.count( )' method to count the number of rides per month. As we can see from the graph, bike ride counts were the lowest in August and highest in October. Another thing to notice is that there are only 7 months of data. Upon observing the head and tail of the data, it goes from 8/29/13 to 2/28/14. Which means that the reason we see such a low number in August is there's only 3 days worth of data for it.
Now lets take a look at the daily user count.
I've also marked the
mean
and mean +/- 1.5 * Standard Deviation
as horizontal lines on the plot. This will help identify the outliers in the data. The code for the plot is below. You can conclude from the plot above that count trips vary based on whether it's a weekend or weekday.Now that we've seen the daily counts, I will plot the distribution of the daily user counts for all months as a histogram. I will also fit a KDE to the histogram. If you don't know what a KDE is, this link does a great job explaining it.
From the plot, we can see that it is a bi-modal distribution. That might be the case because there are different numbers of users during the week than on the weekend. So lets re-plot the distribution after dividing the data into weekday or weekend rides. This means we would also have to re-fit the KDEs onto their respective histograms.
Note; a value greater than 5 in 'dayofweek' feature is a weekend.
Now we're going to explore user activity based on hourly trends. First, I will group the bike rides by date and hour, then count the number of rides in the given hour on the given date. I will be using box-plot for this one, as it seems to be the most expressive in this case.
Now because of the wide variation we've seen in the box plot, let's try and gain more insight as to why this is the case. There are two types of bike users, Subscriber and Customer. Given this information and the weekend and weekday categorization, lets plot and inspect the user activity trends. This could be helpful for the shop's product team if they want to run some kind of promo.