In [1]:

```
import pandas as pd
import numpy as np
import scipy.stats as sp
# %matplotlib notebook
%matplotlib inline
import seaborn as sns; sns; sns.set_style('dark')
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

In [2]:

```
df = pd.read_csv('turnstile_data_master_with_weather.csv')
df.index = pd.to_datetime(df.pop('DATEn') +' '+ df.pop('TIMEn'))
df.sort_index(inplace=True)
del df['Unnamed: 0']
```

In [10]:

```
df.head()
```

Out[10]:

- https://class.coursera.org/statistics-003
- https://www.udacity.com/course/intro-to-data-science--ud359
- http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables
- https://en.wikipedia.org/wiki/Coefficient_of_determination
- http://napitupulu-jon.appspot.com/posts/inference-diagnostic-mlr-coursera-statistics.html

In [6]:

```
df.groupby('rain',as_index=False).ENTRIESn_hourly.mean()
```

Out[6]:

In this data, we can see summary statistic of number of ridership hourly, represented by `ENTRIESn_hourly`

variable between rainy days and non-rainy days. So the independent variable is `rain`

that represented as non-rainy day in control group, and `non-rainy`

in experiment group. How rainy days affect the number of ridership, so the dependent variable is `ENTRIESn_hourly`

.

We can see that means of number ridership hourly of non-rainy days is 1090, where the means with rainy days is 1105. Such small difference, and we're going to test whether the difference is significantly higher, using independence test with one-tail p-value. I'm using 0.05 as p-critical value.

- H0 $ P_\mathbf{(rain > non-rain)} = 0.5$ : Population number of ridership in rainy days and non-rainy days is equal.
- HA $ P_\mathbf{(rain > non-rain)} \gt 0.5$ : Population number of ridership in rainy days is higher than non-rainy days.

The conditions within groups have validated. The sample size in this data is more than 30, and less than 10% population.

In [6]:

```
df.groupby('rain',as_index=False).ENTRIESn_hourly.mean()
```

Out[6]:

In [16]:

```
sp.mannwhitneyu(df.ix[df.rain==0,'ENTRIESn_hourly'],
df.ix[df.rain==1,'ENTRIESn_hourly'])
```

Out[16]:

- OLS using Statsmodels or Scikit Learn
- Gradient descent using Scikit Learn
- Or something different?*

In [6]:

```
length = df.shape[0]
subset = df.take(np.random.permutation(length)[:int(length*0.1)]).reset_index()
dummy_hours = pd.get_dummies(subset['Hour'], prefix='hour')
dummy_units = pd.get_dummies(subset['UNIT'], prefix='unit')
# features = subset.join(dummy_units).join(dummy_hours)
features = subset
banned = ['ENTRIESn_hourly','UNIT','Hour','DESCn','EXITSn_hourly','index']
candidates = [e for e in features.columns if e not in banned]
```

In [7]:

```
def test_adjusted_R_squared(col):
"""Testing one variable with already approved predictors"""
reg = sm.OLS(features['ENTRIESn_hourly'],features[predictors + [col]])
result = reg.fit()
return result.rsquared_adj
```

In [8]:

```
predictors = []
topr2 = 0
for i in xrange(len(candidates)):
filtered = filter(lambda x: x not in predictors, candidates)
list_r2 = map(test_adjusted_R_squared,filtered)
highest,curr_topr2 = max(zip(filtered,list_r2),key=lambda x: x[1])
if curr_topr2 > topr2:
topr2 = round(curr_topr2,10)
else:
print("Adjusted R Squared can't go any higher. Stopping")
break
predictors.append(highest)
print('Step {}: Adjusted R-squared = {} + {}'.format(i,topr2,highest))
```

These are non dummy features after I perform forward selection

In [9]:

```
predictors
```

Out[9]:

To test collinearity that may happen in my numerical features, I use scatter matrix.

In [176]:

```
print('Scatter Matrix of features and predictors to test collinearity');
pd.scatter_matrix(features[numerics],figsize=(10,10));
```

I can see that there are no collinearity among the predictors.

Next I join non-dummy features and dummy features to `features_dummy`

and create the model.

In [10]:

```
features_dummy = features[predictors].join(dummy_units).join(dummy_hours)
model = sm.OLS(features['ENTRIESn_hourly'],features_dummy).fit()
```

In [11]:

```
filter_cols = lambda col: not col.startswith('unit') and not col.startswith('hour')
model.params[model.params.index.map(filter_cols)]
```

Out[11]:

In [12]:

```
model.rsquared
```

Out[12]:

*R2 is often interpreted as the proportion of response variation "explained" by the regressors in the model.* So we can say 61.67% of the variability in the % number of ridership subway hourly can be explained by the model.

In [92]:

```
fig,axes = plt.subplots(nrows=1,ncols=2,sharex=True,sharey=True,squeeze=False)
filtered = df.ix[df.ENTRIESn_hourly < 10000]
for i in xrange(1):
axes[0][i].set_xlabel('Number of ridership hourly')
axes[0][i].set_ylabel('Frequency')
filtered.ix[filtered.rain == 0,'ENTRIESn_hourly'].hist(ax=axes[0][0],bins=50)
axes[0][0].set_title('Non-rainy days')
filtered.ix[filtered.rain == 1,'ENTRIESn_hourly'].hist(ax=axes[0][1],bins=50)
axes[0][1].set_title('Rainy days')
fig.set_size_inches((15,5))
```

In [96]:

```
(df
.resample('1D',how='mean')
.groupby(lambda x : 1 if pd.datetools.isBusinessDay(x) else 0)
.ENTRIESn_hourly
.plot(legend=True))
plt.legend(['Not Business Day', 'Business Day'])
plt.xlabel('By day in May 2011')
plt.ylabel('Average number of ridership hourly')
plt.title('Average number of ridership every day at in May 2011');
```

In [5]:

```
df['BusinessDay'] = df.index.map(lambda x : 0 if pd.datetools.isBusinessDay(x) else 1)
```

In [153]:

```
df.resample('1D').rain.value_counts()
```

Out[153]:

Using Statistical Test, If in fact there's no different of average number of ridership hourly of non-rainy days and rainy days, the probability of getting a sample with size 44104 for rainy days and 87847 sample size for non-rainy days with average difference of 15 ridership, is 0.025. Such a small probability could means that rain is a significant predictor, and the difference it's not due to chance.

Using Linear Regression, we can say that all else held constant, the model predicts number of ridership in one hour for non-rainy days is 117 people higher than rainy days, on average.

So where will this lead us? Well we could see that in average day, number of ridership still following some pattern. But it's not clear how this affect through season since we only have limited data. The data itself could expand to through one year. As we can see that this data only include May 2011, and we have no idea how winter, autumn, summer, and spring affecting the number of ridership.

That's more analysis that can be done. With statistical test, I just analyze how rain is not significant different. The different is just due to chance, or it could be other factor than the rain. Fog may be significantly different, or you also that in Visualization section, the number of ridership is different between business day and non business day.

We also have seen that the distribution of number in hour (ENTRIESn_hourly) is right skewed, so we could do some tranformation to make it more normal. The number of ridership between business day and non business day also not linear, it follows what seems to be cyclical.

The model predict not really linear. To test the performance of our model we can do following things:

- linear relationship between every numerical explanatory and response
- Nearly normal residuals wih mean 0
- Constant variability of residuals
- Independent residuals

Our model is not a good fit if at least one this diagnostics failed, which it does.

In [150]:

```
fig,axes = plt.subplots(nrows=1,ncols=3,sharey=True,squeeze=False)
numerics = ['maxpressurei', 'mintempi', 'precipi']
for i in xrange(len(numerics)):
axes[0][i].scatter(x=features[numerics[i]],y=model.resid,alpha=0.1)
axes[0][i].set_xlabel(numerics[i])
axes[0][0].set_ylabel('final model residuals')
axes[0][1].set_title('linear relationships between features and residual, alpha 0.1')
fig.set_size_inches(12,5);
```

`maxpressurei`

and `mintempi`

is random scatter. But `precipi`

is not a good candidate for linear relationship of the model. It seems it's not randomly scattered.

In [145]:

```
fig,axes = plt.subplots(nrows=1,ncols=2,squeeze=False)
sp.probplot(model.resid,plot=axes[0][0])
model.resid.hist(bins=20,ax=axes[0][1]);
axes[0][1].set_title('Histogram of residuals')
axes[0][1].set_xlabel('Residuals')
axes[0][1].set_ylabel('Frequency');
```

In [154]:

```
fig,axes = plt.subplots(nrows=1,ncols=2,squeeze=False)
axes[0][0].scatter(x=model.fittedvalues, y=model.resid, alpha=0.1)
axes[0][1].scatter(x=model.fittedvalues, y=abs(model.resid), alpha=0.1);
axes[0][0].set_xlabel('fitted_values')
axes[0][1].set_xlabel('fitted_values')
axes[0][0].set_ylabel('Abs(residuals)')
axes[0][1].set_ylabel('residuals');
fig.set_size_inches(13,5)
```

In [107]:

```
resids = pd.DataFrame(model.resid.copy())
resids.columns = ['residuals']
resids.index = pd.to_datetime(features['index'])
resids.sort_index(inplace=True)
plt.plot_date(x=resids.resample('1H',how='mean').index,
y=resids.resample('1H',how='mean').residuals);
plt.xlabel('Time Series')
plt.ylabel('residuals')
plt.title('Residuals Variability across time');
```

Finally, our model should be independent across time. We can plot this by residuals through time series, checking whether the residuals is constant variability, randomly scatter around zero. In this plot, it's pretty constant across May 2011. But since we only have limited data, 1 month and 1 year, we can't be sure whether the model predict accurately in any other month and year.

As linear regression is not a good model, there's could be another model, and some additional dependent variables, that can be used to better fit for this problem.