1. Data Loading and Exploration
Objective: Learn how to load and explore datasets in scikit-learn.
Examples:
Loading Built-in Datasets:
from sklearn.datasets import load_iris, load_boston, load_wine import pandas as pd # Load Iris Dataset iris = load_iris() X_iris = pd.DataFrame(iris.data, columns=iris.feature_names) y_iris = iris.target # Load Boston Housing Dataset boston = load_boston() X_boston = pd.DataFrame(boston.data, columns=boston.feature_names) y_boston = boston.target # Load Wine Dataset wine = load_wine() X_wine = pd.DataFrame(wine.data, columns=wine.feature_names) y_wine = wine.target
Exploring the Dataset:
print(X_iris.head()) print(y_iris[:5])
Notes:
Scikit-learn provides several built-in datasets, including
load_iris
,load_boston
, andload_wine
.You can load datasets using
load_
functions and convert them to pandas DataFrames for easier manipulation.
Exercise:
- Explore the features of different datasets and understand their structure.
2. Data Preprocessing
Objective: Learn data preprocessing techniques including handling missing values, encoding categorical variables, and scaling features.
Examples:
Handling Missing Values:
from sklearn.impute import SimpleImputer import numpy as np # Create sample data with missing values data = np.array([[1, 2, np.nan], [3, np.nan, 4], [5, 6, 7]]) # Impute missing values with mean imputer = SimpleImputer(strategy='mean') data_imputed = imputer.fit_transform(data) print(data_imputed)
Encoding Categorical Variables:
from sklearn.preprocessing import LabelEncoder # Sample data df = pd.DataFrame({'color': ['red', 'blue', 'green'], 'size': ['S', 'M', 'L']}) # Encode categorical variables le_color = LabelEncoder() df['color_encoded'] = le_color.fit_transform(df['color']) print(df)
Scaling Features:
from sklearn.preprocessing import StandardScaler # Sample data data = np.array([[1, 2], [3, 4], [5, 6]]) # Scale features scaler = StandardScaler() data_scaled = scaler.fit_transform(data) print(data_scaled)
Notes:
Missing values can be handled using
SimpleImputer
.Categorical variables can be encoded using
LabelEncoder
orOneHotEncoder
.Feature scaling can be performed using
StandardScaler
orMinMaxScaler
.
Exercise:
- Apply preprocessing steps to a real dataset and analyze their impact on model performance.
3. Feature Selection and Engineering
Objective: Understand feature selection techniques and feature engineering.
Examples:
Feature Selection with SelectKBest:
from sklearn.feature_selection import SelectKBest, f_classif from sklearn.datasets import load_iris # Load dataset iris = load_iris() X = iris.data y = iris.target # Feature selection selector = SelectKBest(score_func=f_classif, k=2) X_new = selector.fit_transform(X, y) print(X_new)
Feature Engineering:
# Create new feature: ratio of two existing features X_new_feature = X[:, 0] / (X[:, 1] + 1e-10) # Adding small value to avoid division by zero
Notes:
SelectKBest
is used for selecting the best features based on statistical tests.Feature engineering involves creating new features from existing ones to improve model performance.
Exercise:
- Experiment with different feature selection methods and engineered features to see their impact on model performance.
4. Model Training and Validation
Objective: Learn how to train and validate machine learning models.
Examples:
Train-Test Split:
from sklearn.model_selection import train_test_split # Load dataset iris = load_iris() X = iris.data y = iris.target # Split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Model Training:
from sklearn.ensemble import RandomForestClassifier # Train model model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
Model Validation:
from sklearn.metrics import accuracy_score, classification_report # Make predictions y_pred = model.predict(X_test) # Evaluate model accuracy = accuracy_score(y_test, y_pred) cr = classification_report(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}') print('Classification Report:\n', cr)
Notes:
train_test_split
is used to divide the dataset into training and testing subsets.Models are trained using
.fit()
and validated using metrics like accuracy and classification report.
Exercise:
- Compare the performance of different models (e.g., RandomForestClassifier vs. LogisticRegression) using cross-validation.
5. Hyperparameter Tuning
Objective: Optimize model performance through hyperparameter tuning.
Examples:
Grid Search for Hyperparameter Tuning:
from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC # Define parameter grid param_grid = { 'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf'] } # Create model model = SVC() # Perform grid search grid_search = GridSearchCV(model, param_grid, cv=5) grid_search.fit(X, y) print(f'Best parameters: {grid_search.best_params_}') print(f'Best score: {grid_search.best_score_}')
Random Search for Hyperparameter Tuning:
from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint # Define parameter grid param_dist = { 'n_estimators': randint(50, 200), 'max_features': ['auto', 'sqrt', 'log2'], 'bootstrap': [True, False] } # Create model model = RandomForestClassifier() # Perform random search random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, random_state=42) random_search.fit(X, y) print(f'Best parameters: {random_search.best_params_}') print(f'Best score: {random_search.best_score_}')
Notes:
GridSearchCV
andRandomizedSearchCV
are used to find the best hyperparameters for a model.Grid search evaluates all parameter combinations, while random search samples a fixed number of parameter settings.
Exercise:
- Experiment with different hyperparameters for various models and compare their performance.
6. Model Evaluation
Objective: Evaluate model performance using various metrics.
Examples:
Confusion Matrix and Classification Report:
from sklearn.metrics import confusion_matrix, classification_report # Compute confusion matrix cm = confusion_matrix(y_test, y_pred) print('Confusion Matrix:\n', cm) # Classification report cr = classification_report(y_test, y_pred) print('Classification Report:\n', cr)
Cross-Validation:
from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier # Load dataset iris = load_iris() X = iris.data y = iris.target # Model model = RandomForestClassifier() # Cross-validation scores = cross_val_score(model, X, y, cv=5) print(f'Cross-validation scores: {scores}') print(f'Mean score: {scores.mean():.2f}')
Notes:
The confusion matrix and classification report provide detailed insights into model performance.
Cross-validation helps to assess the model's performance across different subsets of the dataset.
Exercise:
- Use different evaluation metrics such as ROC-AUC for binary classification tasks.
7. Advanced Topics
Objective: Explore advanced techniques such as ensemble methods, pipelines, and clustering.
Examples:
Ensemble Methods:
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier # Load dataset iris = load_iris() X = iris.data y = iris.target # Create models clf1 = GradientBoostingClassifier(n_estimators=100) clf2 = RandomForestClassifier(n_estimators=100) # Voting Classifier voting_clf = VotingClassifier(estimators=[('gb', clf1), ('rf', clf2)], voting='soft') voting_clf.fit(X, y) # Evaluate model scores = cross_val_score(voting_clf, X, y, cv=5) print(f'Voting Classifier scores: {scores}')
Pipelines:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC(kernel='linear')) ]) # Train model pipeline.fit(X_train, y_train) # Predict and evaluate y_pred = pipeline.predict(X_test) print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
Clustering:
from sklearn.cluster import KMeans # Load dataset df = pd.read_csv('https://raw.githubusercontent.com/raphaelmw/mall-customer-segmentation/master/Mall_Customers.csv') X = df[['Annual Income (k$)', 'Spending Score (1-100)']] # Apply K-Means kmeans = KMeans(n_clusters=5, random_state=42) y_kmeans = kmeans.fit_predict(X) # Plot clusters plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1') plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2') plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3') plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.title('Customer Clusters') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend() plt.show()
Notes:
Ensemble methods combine the predictions of multiple models to improve performance.
Pipelines streamline workflows by chaining preprocessing and model training steps.
Clustering algorithms like K-Means group similar data points without labeled outputs.
Exercise:
- Experiment with different ensemble methods and clustering algorithms on various datasets.
This comprehensive guide covers various scikit-learn features and their applications in data science and machine learning. Each section provides examples and notes to help you understand and apply these techniques effectively. Feel free to explore and experiment with different datasets and models to deepen your understanding.