Random Forest#

Use the Random Forest Algorithm to train and test a model.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

nba_data = pd.read_csv('../output/new_nba_data.csv')
nba_data.sample(10, random_state=50)

	Name	GamesPlayed	MinutesPlayed	PointsPerGame	FieldGoalsMade	FieldGoalAttempts	FieldGoalPercent	3PointMade	3PointAttempts	3PointAttemptsPercent	...	FreeThrowAttempts	FreeThrowPercent	OffensiveRebounds	DefensiveRebounds	Rebounds	Assists	Steals	Blocks	Turnovers	CareerLongerThan5Years
456	Brandon Davies	51	11.3	2.8	1.1	2.5	42.2	0.0	0.2	20.0	...	1.0	64.2	0.7	1.4	2.1	0.5	0.5	0.2	0.7	0.0
303	Chris Washburn	35	11.0	3.8	1.6	4.1	39.3	0.0	0.0	0.0	...	1.5	35.3	1.0	1.9	2.9	0.5	0.2	0.2	1.1	0.0
772	Devin Booker	76	27.7	13.8	4.8	11.4	42.3	1.3	3.8	34.3	...	3.4	84.0	0.4	2.1	2.5	2.6	0.6	0.3	2.1	0.0
1298	Corie Blount	67	10.3	3.0	1.1	2.6	43.7	0.0	0.0	0.0	...	1.1	61.3	1.1	1.8	2.9	0.8	0.3	0.5	0.8	1.0
829	Allan Ray	47	15.1	6.2	2.1	5.5	38.6	1.0	2.5	41.4	...	1.2	76.4	0.4	1.1	1.5	0.9	0.4	0.1	0.9	0.0
262	Milt Wagner	40	9.5	3.8	1.6	3.7	42.2	0.1	0.3	20.0	...	0.7	89.7	0.1	0.6	0.7	1.5	0.1	0.1	0.6	0.0
1147	Adreian Payne	32	23.1	6.7	2.8	6.9	41.4	0.0	0.3	11.1	...	1.4	65.2	1.5	3.6	5.1	0.9	0.6	0.3	1.4	0.0
1181	Ray Owes	57	10.4	3.1	1.3	3.2	41.7	0.0	0.1	20.0	...	0.8	56.5	1.1	1.7	2.9	0.3	0.3	0.3	0.4	0.0
196	Tom Garrick	71	21.1	6.4	2.5	5.1	49.0	0.0	0.2	0.0	...	1.8	80.3	0.5	1.7	2.2	3.4	1.1	0.1	1.6	0.0
854	Charlie Villanueva	81	29.1	13.0	5.4	11.6	46.3	0.9	2.6	32.7	...	2.0	70.6	2.2	4.2	6.4	1.1	0.7	0.8	1.2	1.0

10 rows × 21 columns

The names of the players are not required for decisions, so we will be ignoring it

nba_data.drop('Name', inplace=True, axis=1)

X = nba_data[nba_data.columns[:-1]]
y = nba_data['CareerLongerThan5Years']
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=50,stratify=y)
dt_model = RandomForestClassifier(random_state=50)
dt_model.fit(train_X, train_y)
pred_y = dt_model.predict(test_X)

print('Accuracy: {:.2%}'.format(metrics.accuracy_score(test_y, pred_y))) 
print('Recall: {:.2%}'.format(metrics.recall_score(test_y, pred_y))) 
print('Precision: {:.2%}'.format(metrics.precision_score(test_y, pred_y))) 
print('F1 Score: {:.2%}'.format(metrics.f1_score(test_y, pred_y)))

Accuracy: 69.25%
Recall: 78.85%
Precision: 73.54%
F1 Score: 76.10%

The accuracy of 69% is pretty low. So let’s try using a Cross-Validation to see if the random split affects our test.

scores = cross_val_score(dt_model, X, y, scoring='accuracy', cv=5)
scores.mean()

0.6813432835820896

By using cross validation, the score is 68%. Let’s take a look at the Feature Importances.

feature_importances = pd.Series(dt_model.feature_importances_, index=nba_data.columns[:-1])
feature_importances.sort_values(ascending=False)

GamesPlayed              0.120035
PointsPerGame            0.072762
FieldGoalPercent         0.069790
FreeThrowPercent         0.063547
MinutesPlayed            0.060657
FieldGoalsMade           0.060537
FreeThrowMade            0.054399
FieldGoalAttempts        0.054145
DefensiveRebounds        0.051616
FreeThrowAttempts        0.050959
Rebounds                 0.048327
Assists                  0.046434
3PointAttemptsPercent    0.046243
OffensiveRebounds        0.044278
Turnovers                0.039989
3PointAttempts           0.033868
Blocks                   0.030166
Steals                   0.030091
3PointMade               0.022158
dtype: float64

feature_importances.nlargest(5).plot(kind='bar')

<AxesSubplot: >

First, let’s try the train-validation-test method to play around with the hyperparameters. we’ll use a 70%-15%-15% split.

X_train, X_test_full, y_train, y_test_full = train_test_split(X, y, random_state=50, test_size=0.3, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_test_full, y_test_full, random_state=50, test_size=0.5, stratify=y_test_full)

pd.DataFrame([len(y_train), len(y_val), len(y_test)], index=['train','validation', 'test']).plot(kind='pie', subplots=True)

array([<AxesSubplot: ylabel='0'>], dtype=object)

Let’s verify if the data has been split correctly.

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12,6))
y_train.value_counts().plot(kind='pie', ax=ax1, title='train')
y_val.value_counts().plot(kind='pie', ax=ax2, title='val')
y_test.value_counts().plot(kind='pie', ax=ax3, title='test')

<AxesSubplot: title={'center': 'test'}, ylabel='CareerLongerThan5Years'>

dt_model = RandomForestClassifier(random_state=50) 
dt_model.fit(X_train, y_train)
y_pred = dt_model.predict(X_val)
accuracy_score = metrics.accuracy_score(y_val, y_pred)
print('Accuracy: {:.2%}'.format(accuracy_score)) 

Accuracy: 69.65%

Let’s play with the hyperparameters and see if we can get a better result. Let’s start with max depth.

def train_and_find_best_depth(X_train, X_val, y_train, y_val, do_print): 
    result = None
    accuracy_max = -1
    for curr_max_depth in range(1, 20):
        dt_model = RandomForestClassifier(max_depth=curr_max_depth,random_state=50) 
        dt_model.fit(X_train, y_train)
        y_pred = dt_model.predict(X_val)
        accuracy_score = metrics.accuracy_score(y_val, y_pred)

        if accuracy_score >= accuracy_max: 
            accuracy_max = accuracy_score 
            result = curr_max_depth
            if do_print:
                print('max depth {}: {:.2%} accuracy on validation set.'.format(curr_max_depth, accuracy_score))
            if do_print: 
                print('-' * 20)
    print('best max depth {} has {:.2%} accuracy.'.format(result,accuracy_max))

    return result
    
best_max_depth = train_and_find_best_depth(X_train, X_val, y_train, y_val, True)

max depth 1: 71.64% accuracy on validation set.
--------------------

max depth 2: 72.14% accuracy on validation set.
--------------------

max depth 3: 73.13% accuracy on validation set.
--------------------

max depth 4: 73.63% accuracy on validation set.
--------------------

max depth 5: 73.63% accuracy on validation set.
--------------------

best max depth 5 has 73.63% accuracy.

Let’s continue with n_estimators. AKA the number of trees in the forest.

def train_and_find_n_estimators(X_train, X_val, y_train, y_val, do_print): 
    result = None
    accuracy_max = -1
    for n_estimators in range(1, 100):
        dt_model = RandomForestClassifier(n_estimators=n_estimators,random_state=50) 
        dt_model.fit(X_train, y_train)
        y_pred = dt_model.predict(X_val)
        accuracy_score = metrics.accuracy_score(y_val, y_pred)

        if accuracy_score >= accuracy_max: 
            accuracy_max = accuracy_score 
            result = n_estimators
            if do_print:
                print('n estimators {}: {:.2%} accuracy on validation set.'.format(n_estimators, accuracy_score))
            if do_print: 
                print('-' * 20)
    print('n estimators {} has {:.2%} accuracy.'.format(result,accuracy_max))

    return result
    
best_n_estimators = train_and_find_n_estimators(X_train, X_val, y_train, y_val, True)

n estimators 1: 59.20% accuracy on validation set.
--------------------
n estimators 2: 61.19% accuracy on validation set.
--------------------
n estimators 3: 68.66% accuracy on validation set.
--------------------
n estimators 7: 68.66% accuracy on validation set.
--------------------
n estimators 8: 72.14% accuracy on validation set.
--------------------

n estimators 10: 73.13% accuracy on validation set.
--------------------

n estimators 10 has 73.13% accuracy.

Now let’s try to adjust the min_samples_split parameter which is the minimal number of different data placed in a node before the node is split

def train_and_find_min_samples_split(X_train, X_val, y_train, y_val, do_print): 
    result = None
    accuracy_max = -1
    for min_samples_split in range(2, 20):
        dt_model = RandomForestClassifier(min_samples_split=min_samples_split,random_state=50) 
        dt_model.fit(X_train, y_train)
        y_pred = dt_model.predict(X_val)
        accuracy_score = metrics.accuracy_score(y_val, y_pred)

        if accuracy_score >= accuracy_max: 
            accuracy_max = accuracy_score 
            result = min_samples_split
            if do_print:
                print('min samples split {}: {:.2%} accuracy on validation set.'.format(min_samples_split, accuracy_score))
            if do_print: 
                print('-' * 20)
    print('min samples split {} has {:.2%} accuracy.'.format(result,accuracy_max))

    return result

min_samples_split = train_and_find_min_samples_split(X_train, X_val, y_train, y_val, True)

min samples split 2: 69.65% accuracy on validation set.
--------------------

min samples split 4: 70.15% accuracy on validation set.
--------------------

min samples split 8: 70.65% accuracy on validation set.
--------------------

min samples split 9: 71.14% accuracy on validation set.
--------------------

min samples split 11: 72.14% accuracy on validation set.
--------------------

min samples split 19: 72.64% accuracy on validation set.
--------------------
min samples split 19 has 72.64% accuracy.

Let’s try the max_features parameter. It is the max number of features who are considered when splitting a node.

def train_and_find_max_features(X_train, X_val, y_train, y_val, do_print): 
    result = None
    accuracy_max = -1
    for max_features in range(1, 20):
        dt_model = RandomForestClassifier(max_features=max_features,random_state=50) 
        dt_model.fit(X_train, y_train)
        y_pred = dt_model.predict(X_val)
        accuracy_score = metrics.accuracy_score(y_val, y_pred)

        if accuracy_score >= accuracy_max: 
            accuracy_max = accuracy_score 
            result = max_features
            if do_print:
                print('max features split {}: {:.2%} accuracy on validation set.'.format(max_features, accuracy_score))
            if do_print: 
                print('-' * 20)
    print('max features split {} has {:.2%} accuracy.'.format(result,accuracy_max))

    return result

max_features = train_and_find_max_features(X_train, X_val, y_train, y_val, True)

max features split 1: 68.66% accuracy on validation set.
--------------------

max features split 2: 70.65% accuracy on validation set.
--------------------

max features split 2 has 70.65% accuracy.

dt_model = RandomForestClassifier(max_depth=best_max_depth,n_estimators=best_n_estimators, random_state=50, min_samples_split=min_samples_split,max_features=max_features)
dt_model.fit(X_train, y_train)
y_pred = dt_model.predict(X_test)
accuracy_score_imputed = metrics.accuracy_score(y_test, y_pred) 
print('Accuracy: {:.2%}'.format(metrics.accuracy_score(y_test, y_pred))) 
print('Recall: {:.2%}'.format(metrics.recall_score(y_test, y_pred))) 
print('Precision: {:.2%}'.format(metrics.precision_score(y_test, y_pred))) 
print('F1 Score: {:.2%}'.format(metrics.f1_score(y_test, y_pred)))

Accuracy: 70.15%
Recall: 84.68%
Precision: 71.92%
F1 Score: 77.78%

Imputing data is not required. The only data that we missed were the Percent of successful 3-Throws of players who did not have any 3 Throws (0)

By using the train-validation test method without any hyperparameters, an accuracy of 65% has been reached. By adding hyperparameters, it rose up to ca. 72%. Which is a little higher than the train-test split method (69.3%).

NBA Player Classification

Random Forest

Random Forest#