Comprehensive Guide to scikit-learn Features

Comprehensive Guide to scikit-learn Features


1. Data Loading and Exploration

Objective: Learn how to load and explore datasets in scikit-learn.

Examples:

  1. Loading Built-in Datasets:

     from sklearn.datasets import load_iris, load_boston, load_wine
     import pandas as pd
    
     # Load Iris Dataset
     iris = load_iris()
     X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
     y_iris = iris.target
    
     # Load Boston Housing Dataset
     boston = load_boston()
     X_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
     y_boston = boston.target
    
     # Load Wine Dataset
     wine = load_wine()
     X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
     y_wine = wine.target
    
  2. Exploring the Dataset:

     print(X_iris.head())
     print(y_iris[:5])
    

Notes:

  • Scikit-learn provides several built-in datasets, including load_iris, load_boston, and load_wine.

  • You can load datasets using load_ functions and convert them to pandas DataFrames for easier manipulation.

Exercise:

  • Explore the features of different datasets and understand their structure.

2. Data Preprocessing

Objective: Learn data preprocessing techniques including handling missing values, encoding categorical variables, and scaling features.

Examples:

  1. Handling Missing Values:

     from sklearn.impute import SimpleImputer
     import numpy as np
    
     # Create sample data with missing values
     data = np.array([[1, 2, np.nan], [3, np.nan, 4], [5, 6, 7]])
    
     # Impute missing values with mean
     imputer = SimpleImputer(strategy='mean')
     data_imputed = imputer.fit_transform(data)
     print(data_imputed)
    
  2. Encoding Categorical Variables:

     from sklearn.preprocessing import LabelEncoder
    
     # Sample data
     df = pd.DataFrame({'color': ['red', 'blue', 'green'], 'size': ['S', 'M', 'L']})
    
     # Encode categorical variables
     le_color = LabelEncoder()
     df['color_encoded'] = le_color.fit_transform(df['color'])
    
     print(df)
    
  3. Scaling Features:

     from sklearn.preprocessing import StandardScaler
    
     # Sample data
     data = np.array([[1, 2], [3, 4], [5, 6]])
    
     # Scale features
     scaler = StandardScaler()
     data_scaled = scaler.fit_transform(data)
     print(data_scaled)
    

Notes:

  • Missing values can be handled using SimpleImputer.

  • Categorical variables can be encoded using LabelEncoder or OneHotEncoder.

  • Feature scaling can be performed using StandardScaler or MinMaxScaler.

Exercise:

  • Apply preprocessing steps to a real dataset and analyze their impact on model performance.

3. Feature Selection and Engineering

Objective: Understand feature selection techniques and feature engineering.

Examples:

  1. Feature Selection with SelectKBest:

     from sklearn.feature_selection import SelectKBest, f_classif
     from sklearn.datasets import load_iris
    
     # Load dataset
     iris = load_iris()
     X = iris.data
     y = iris.target
    
     # Feature selection
     selector = SelectKBest(score_func=f_classif, k=2)
     X_new = selector.fit_transform(X, y)
     print(X_new)
    
  2. Feature Engineering:

     # Create new feature: ratio of two existing features
     X_new_feature = X[:, 0] / (X[:, 1] + 1e-10)  # Adding small value to avoid division by zero
    

Notes:

  • SelectKBest is used for selecting the best features based on statistical tests.

  • Feature engineering involves creating new features from existing ones to improve model performance.

Exercise:

  • Experiment with different feature selection methods and engineered features to see their impact on model performance.

4. Model Training and Validation

Objective: Learn how to train and validate machine learning models.

Examples:

  1. Train-Test Split:

     from sklearn.model_selection import train_test_split
    
     # Load dataset
     iris = load_iris()
     X = iris.data
     y = iris.target
    
     # Split dataset
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
  2. Model Training:

     from sklearn.ensemble import RandomForestClassifier
    
     # Train model
     model = RandomForestClassifier(n_estimators=100)
     model.fit(X_train, y_train)
    
  3. Model Validation:

     from sklearn.metrics import accuracy_score, classification_report
    
     # Make predictions
     y_pred = model.predict(X_test)
    
     # Evaluate model
     accuracy = accuracy_score(y_test, y_pred)
     cr = classification_report(y_test, y_pred)
     print(f'Accuracy: {accuracy:.2f}')
     print('Classification Report:\n', cr)
    

Notes:

  • train_test_split is used to divide the dataset into training and testing subsets.

  • Models are trained using .fit() and validated using metrics like accuracy and classification report.

Exercise:

  • Compare the performance of different models (e.g., RandomForestClassifier vs. LogisticRegression) using cross-validation.

5. Hyperparameter Tuning

Objective: Optimize model performance through hyperparameter tuning.

Examples:

  1. Grid Search for Hyperparameter Tuning:

     from sklearn.model_selection import GridSearchCV
     from sklearn.svm import SVC
    
     # Define parameter grid
     param_grid = {
         'C': [0.1, 1, 10],
         'kernel': ['linear', 'rbf']
     }
    
     # Create model
     model = SVC()
    
     # Perform grid search
     grid_search = GridSearchCV(model, param_grid, cv=5)
     grid_search.fit(X, y)
     print(f'Best parameters: {grid_search.best_params_}')
     print(f'Best score: {grid_search.best_score_}')
    
  2. Random Search for Hyperparameter Tuning:

     from sklearn.model_selection import RandomizedSearchCV
     from scipy.stats import randint
    
     # Define parameter grid
     param_dist = {
         'n_estimators': randint(50, 200),
         'max_features': ['auto', 'sqrt', 'log2'],
         'bootstrap': [True, False]
     }
    
     # Create model
     model = RandomForestClassifier()
    
     # Perform random search
     random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
     random_search.fit(X, y)
     print(f'Best parameters: {random_search.best_params_}')
     print(f'Best score: {random_search.best_score_}')
    

Notes:

  • GridSearchCV and RandomizedSearchCV are used to find the best hyperparameters for a model.

  • Grid search evaluates all parameter combinations, while random search samples a fixed number of parameter settings.

Exercise:

  • Experiment with different hyperparameters for various models and compare their performance.

6. Model Evaluation

Objective: Evaluate model performance using various metrics.

Examples:

  1. Confusion Matrix and Classification Report:

     from sklearn.metrics import confusion_matrix, classification_report
    
     # Compute confusion matrix
     cm = confusion_matrix(y_test, y_pred)
     print('Confusion Matrix:\n', cm)
    
     # Classification report
     cr = classification_report(y_test, y_pred)
     print('Classification Report:\n', cr)
    
  2. Cross-Validation:

     from sklearn.model_selection import cross_val_score
     from sklearn.ensemble import RandomForestClassifier
    
     # Load dataset
     iris = load_iris()
     X = iris.data
     y = iris.target
    
     # Model
     model = RandomForestClassifier()
    
     # Cross-validation
     scores = cross_val_score(model, X, y, cv=5)
     print(f'Cross-validation scores: {scores}')
     print(f'Mean score: {scores.mean():.2f}')
    

Notes:

  • The confusion matrix and classification report provide detailed insights into model performance.

  • Cross-validation helps to assess the model's performance across different subsets of the dataset.

Exercise:

  • Use different evaluation metrics such as ROC-AUC for binary classification tasks.

7. Advanced Topics

Objective: Explore advanced techniques such as ensemble methods, pipelines, and clustering.

Examples:

  1. Ensemble Methods:

     from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
    
     # Load dataset
     iris = load_iris()
     X = iris.data
     y = iris.target
    
     # Create models
     clf1 = GradientBoostingClassifier(n_estimators=100)
     clf2 = RandomForestClassifier(n_estimators=100)
    
     # Voting Classifier
     voting_clf = VotingClassifier(estimators=[('gb', clf1), ('rf', clf2)], voting='soft')
     voting_clf.fit(X, y)
    
     # Evaluate model
     scores = cross_val_score(voting_clf, X, y, cv=5)
     print(f'Voting Classifier scores: {scores}')
    
  2. Pipelines:

     from sklearn.pipeline import Pipeline
     from sklearn.preprocessing import StandardScaler
     from sklearn.svm import SVC
    
     # Create pipeline
     pipeline = Pipeline([
         ('scaler', StandardScaler()),
         ('svc', SVC(kernel='linear'))
     ])
    
     # Train model
     pipeline.fit(X_train, y_train)
    
     # Predict and evaluate
     y_pred = pipeline.predict(X_test)
     print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
    
  1. Clustering:

     from sklearn.cluster import KMeans
    
     # Load dataset
     df = pd.read_csv('https://raw.githubusercontent.com/raphaelmw/mall-customer-segmentation/master/Mall_Customers.csv')
     X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
    
     # Apply K-Means
     kmeans = KMeans(n_clusters=5, random_state=42)
     y_kmeans = kmeans.fit_predict(X)
    
     # Plot clusters
     plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
     plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
     plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
     plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')
     plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5')
     plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
     plt.title('Customer Clusters')
     plt.xlabel('Annual Income (k$)')
     plt.ylabel('Spending Score (1-100)')
     plt.legend()
     plt.show()
    

Notes:

  • Ensemble methods combine the predictions of multiple models to improve performance.

  • Pipelines streamline workflows by chaining preprocessing and model training steps.

  • Clustering algorithms like K-Means group similar data points without labeled outputs.

Exercise:

  • Experiment with different ensemble methods and clustering algorithms on various datasets.

This comprehensive guide covers various scikit-learn features and their applications in data science and machine learning. Each section provides examples and notes to help you understand and apply these techniques effectively. Feel free to explore and experiment with different datasets and models to deepen your understanding.