The goal for this project is to identify who's the person of interest. In other words, people who actually comitting the fraud in Enron. Their crimes include selling assets to shell companies at the end of each month, and buying them at the beginning of each month to avoid accounting losses. Hopefully if there are any other person that are not in the dataset, the machine learning can identify them based on the financial features and emails, whether the person is actually POI.

There are 146 person in the dataset, 18 of those are a person of interest (there are actually 35 persons). Since email data is just a sample, there are missing POI data. It may cause the prediction to a little worse. There are 21 features in the dataset.

The dataset is not without an error. Especially the financial features. Because not all POI in the dataset, we might want to add it by hand, and just put missing value for financial information. But this itself could lead an error, because machine learning could predict whether a person POI or not based on NaN value. So financial features is still being considered. This is the proportion of no-NaN features for each column.

In [29]:
nan_summary = pd.DataFrame({'size': df.count(),
                            'no-nan': df.applymap(lambda x: pd.np.nan if x=='NaN' else x).count()},
                           index=df.columns)
nan_summary['no-nan-proportion'] = nan_summary['no-nan'] / nan_summary['size']
nan_summary
Out[29]:
no-nan size no-nan-proportion
salary 95 146 0.650685
to_messages 86 146 0.589041
deferral_payments 39 146 0.267123
total_payments 125 146 0.856164
exercised_stock_options 102 146 0.698630
bonus 82 146 0.561644
restricted_stock 110 146 0.753425
shared_receipt_with_poi 86 146 0.589041
restricted_stock_deferred 18 146 0.123288
total_stock_value 126 146 0.863014
expenses 95 146 0.650685
loan_advances 4 146 0.027397
from_messages 86 146 0.589041
other 93 146 0.636986
from_this_person_to_poi 86 146 0.589041
poi 146 146 1.000000
director_fees 17 146 0.116438
deferred_income 49 146 0.335616
long_term_incentive 66 146 0.452055
email_address 111 146 0.760274
from_poi_to_this_person 86 146 0.589041

we can see that of all the features in the dataset, only poi feature, the label of this machine learning doesn't have any missing value. This is good, since the machine learning need the feature otherwise we the data is meaningless without label. On the other hand, feature that has too many missing values, like loan_advances, would not benefit the model.

In the dataset, there's an outlier which is 'TOTAL'. This should be total of numerical features that every person in ENRON dataset has, but counted as a person. This is an outlier. we should exclude this because it's not a data that we have attention too. Next I begin to observe an outlier, and I have 2 out of 4 outlier that identified as POI. Since this is the data that we're paying attention, we don't exclude the outlier.

I add new features such as fraction in which this person sent email to POI persons, and fraction POI persons send emails to this persons. The reason behind this is because there's could be that if this person have higher frequencies of sending and receiving email with POI persons, this person could end up being POI himself. But this turns out filtered itself in SelectPercentile, therefore have no effect on the performance. I also add feature such as text words, based on the email of a person.

Without text feature I achieve: Precision: 0.27753 Recall: 0.24700

With text feature I achieve: Precision: 0.36853 Recall: 0.35950

I scaled any numerical features. The reason behind this because the algorithm that I'm using SGDClassifier consider the features to both dependent of each other. It doesn't like linear regression where features is independent of each other (based on coefficient). SGDClassifier also has l2 penalty, but since I see that scaling makes the model better, I decide to scale it.

I select numerical features based on the 21 percentile using SelectPercentile. I tried variety of percentiles that maximize both precision and recall. When both are deliver some trade-off, I determine the highest based on given F1 score.

Range of percentiles used and the corresponding precision and recall:

  • 10% : Precision: 0.34855 Recall: 0.34350
  • 20% : Precision: 0.34731 Recall: 0.34800
  • 21% : Precision: 0.36853 Recall: 0.35950 BEST F1 Score!
  • 30% : Precision: 0.35031 Recall: 0.37150
  • 40% : Precision: 0.34158 Recall: 0.35900
  • 50% : Precision: 0.34586 Recall: 0.36350

Final features used are:

['deferred_income',
 'bonus',
 'total_stock_value',
 'salary',
 'exercised_stock_options']

I ended up choosing Gaussian Naive Bayes, as it gives the default best performance compared to any other classifier that I tried. The performance default for each of the algorithm are as follows:

from sklearn.naive_bayes import GaussianNB ##Default(Tuned): Precision: 0.29453    Recall: 0.43650
from sklearn.tree import DecisionTreeClassifier ##Default: Precision: 0.14830    Recall: 0.05450
from sklearn.ensemble import RandomForestClassifier ##Default: Precision: 0.47575 Recall: 0.20600, Longer time
from sklearn.linear_model import SGDClassifier ##Tuned: Precision: 0.35534    Recall: 0.34450, BEST!

Since the algorithm that I use now are SGDClassifier, I tune its parameters. Tuning an algorithm is important since all of the estimator method and its parameters could be vary depend on the problem that we have. By tuning the algorithm, we will fit the parameters to our specific problem. By default the estimator take hinge which would be the linear SVM. the alpha is the learning_rate. Too small will make the machine learning learning very slow. Too high for the learning rate, it will make overshooting, the model can't make it further to the best parameter.

I use GridSearchCV for tuning the algorithm. Not all of the parameters I hand over to GridSearchCV. For the text learning l2 penalty is must since it regularized sparse features. cv parameter in default is StratifiedKFold, which confirm with what tester.py used. StratifiedKFold is used when we have skew data, and we can bootstrap by resampling with folds. The scoring method used is F1 score.

Validation is importance when we want to test the model against future data. While the drawback is we have smaller to trained, but it's useful to the the performance. We can't train the model using whole data and test it with the same one, as the model will already know what it's against and will perform excellently, and this called cheating in machine learning. I will use train test split with 70:30, and validate the performance again precision and recall.

I will use precision and recall for my evaluation metrics. As this metrics can identify the accuracy of skewed data. From the performance that I got, I have good precision and good recall. That means the model is able to identify the when the real POI comes out, and have good probability of flagging POI person.

StratifiedShuffleSplit is used when we take advantage of skew data but still keeping proportion of labels If we using usual train test split, it could be there's no POI labels in the test set, or even worse in train set which would makes the model isn't good enough. If for example the StratifiedShuffleSplit have ten folds, then every folds will contains equal proportions of POI vs non-POI