Improving Model Performance

In this blog post, we'll explore various techniques to improve the performance of a machine learning model using the famous Titanic dataset. We'll walk through each step of the process, from feature engineering to hyperparameter tuning, and analyze the results.

1. Introduction to the Dataset

The Titanic dataset is a classic in the world of machine learning. It contains information about passengers on the Titanic, including whether they survived or not. Our goal is to predict survival based on various features like age, sex, passenger class, and more.

2. Feature Engineering

One of the most crucial steps in improving model performance is feature engineering. We created several new features:

FamilySize: The total number of family members aboard (including the passenger)
IsAlone: A binary feature indicating if the passenger is traveling alone
Age*Class: The product of age and passenger class

These new features can help capture important relationships that might not be evident in the original features.

# Feature Engineering
def create_new_features(X):
    X['FamilySize'] = X['Siblings/Spouses Aboard'] + X['Parents/Children Aboard'] + 1
    X['IsAlone'] = (X['FamilySize'] == 1).astype(int)
    X['Age*Class'] = X['Age'] * X['Pclass']
    return X

X = create_new_features(X)

3. Data Preprocessing

Handling Missing Values

We used SimpleImputer to fill in missing values with the median of each feature. This ensures we don't lose data points due to missing information.

Encoding Categorical Variables

We used one-hot encoding for the 'Sex' feature, converting it into numerical format that our model can understand.

Feature Scaling

We applied StandardScaler to normalize our features. This is particularly important for algorithms that are sensitive to the scale of input features, ensuring all features contribute equally to the model's decisions.

# Handling missing values
imputer = SimpleImputer(strategy='median')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Encoding categorical variables
X = pd.get_dummies(X, columns=['Sex'], drop_first=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4. Handling Class Imbalance

The Titanic dataset is slightly imbalanced, with more passengers who didn't survive than those who did. We used the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to oversample the minority class, ensuring our model learns equally from both outcomes.

# Handle Class Imbalance
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

5. Model Selection and Hyperparameter Tuning

We chose a Random Forest Classifier for this task due to its ability to handle complex relationships and its robustness to overfitting. To find the optimal hyperparameters, we used GridSearchCV, which performs an exhaustive search over a specified parameter grid.

The hyperparameters we tuned include:

Number of trees (n_estimators)
Maximum depth of trees (max_depth)
Minimum number of samples required to split an internal node (min_samples_split)
Minimum number of samples required to be at a leaf node (min_samples_leaf)

# Hyperparameter Tuning with Cross-Validation
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_resampled, y_train_resampled)

# Get best model
best_rf = grid_search.best_estimator_

6. Cross-Validation

We used 5-fold cross-validation throughout our process to ensure our model's performance is robust and not overly dependent on a particular split of the data.

# Cross-validation on the best model
cv_scores = cross_val_score(best_rf, X_train_resampled, y_train_resampled, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean()}")

7. Results and Evaluation

# Final evaluation on test set
y_pred = best_rf.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

8. Feature Importance

One of the advantages of using a Random Forest is that we can easily interpret which features are most important for making predictions. Here are the top 10 most important features:

# Feature Importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))

9. Conclusions and Next Steps

Through this process, we've significantly improved our model's performance on the Titanic dataset. The most impactful techniques were:

Feature engineering, especially the creation of the 'FamilySize' and 'IsAlone' features
Handling class imbalance with SMOTE
Hyperparameter tuning with GridSearchCV

For further improvements, we could:

Experiment with other algorithms (e.g., XGBoost, Neural Networks)
Perform more advanced feature engineering
Collect more data or use more sophisticated data augmentation techniques

Remember, the goal of machine learning is not just to achieve high accuracy, but to gain insights that can drive decision-making. In this case, our model has helped us understand the factors that were most important in determining survival on the Titanic, providing valuable historical insights.

Happy modeling!

Complete Code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
train_data = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']

# Feature Engineering
def create_new_features(X):
    X['FamilySize'] = X['Siblings/Spouses Aboard'] + X['Parents/Children Aboard'] + 1
    X['IsAlone'] = (X['FamilySize'] == 1).astype(int)
    X['Age*Class'] = X['Age'] * X['Pclass']
    return X

X = create_new_features(X)

# Handling missing values
imputer = SimpleImputer(strategy='median')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Encoding categorical variables
X = pd.get_dummies(X, columns=['Sex'], drop_first=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Handle Class Imbalance
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Initialize model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter Tuning with Cross-Validation
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_resampled, y_train_resampled)

# Get best model
best_rf = grid_search.best_estimator_

# Cross-validation on the best model
cv_scores = cross_val_score(best_rf, X_train_resampled, y_train_resampled, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean()}")

# Final evaluation on test set
y_pred = best_rf.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# Feature Importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()