Comprehensive Guide to scikit-learn Features

1. Data Loading and Exploration

Objective: Learn how to load and explore datasets in scikit-learn.

Examples:

Loading Built-in Datasets:

 from sklearn.datasets import load_iris, load_boston, load_wine
 import pandas as pd

 # Load Iris Dataset
 iris = load_iris()
 X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
 y_iris = iris.target

 # Load Boston Housing Dataset
 boston = load_boston()
 X_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
 y_boston = boston.target

 # Load Wine Dataset
 wine = load_wine()
 X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
 y_wine = wine.target

Exploring the Dataset:

 print(X_iris.head())
 print(y_iris[:5])

Notes:

Scikit-learn provides several built-in datasets, including load_iris, load_boston, and load_wine.
You can load datasets using load_ functions and convert them to pandas DataFrames for easier manipulation.

Exercise:

Explore the features of different datasets and understand their structure.

2. Data Preprocessing

Objective: Learn data preprocessing techniques including handling missing values, encoding categorical variables, and scaling features.

Examples:

Handling Missing Values:

 from sklearn.impute import SimpleImputer
 import numpy as np

 # Create sample data with missing values
 data = np.array([[1, 2, np.nan], [3, np.nan, 4], [5, 6, 7]])

 # Impute missing values with mean
 imputer = SimpleImputer(strategy='mean')
 data_imputed = imputer.fit_transform(data)
 print(data_imputed)

Encoding Categorical Variables:

 from sklearn.preprocessing import LabelEncoder

 # Sample data
 df = pd.DataFrame({'color': ['red', 'blue', 'green'], 'size': ['S', 'M', 'L']})

 # Encode categorical variables
 le_color = LabelEncoder()
 df['color_encoded'] = le_color.fit_transform(df['color'])

 print(df)

Scaling Features:

 from sklearn.preprocessing import StandardScaler

 # Sample data
 data = np.array([[1, 2], [3, 4], [5, 6]])

 # Scale features
 scaler = StandardScaler()
 data_scaled = scaler.fit_transform(data)
 print(data_scaled)

Notes:

Missing values can be handled using SimpleImputer.
Categorical variables can be encoded using LabelEncoder or OneHotEncoder.
Feature scaling can be performed using StandardScaler or MinMaxScaler.

Exercise:

Apply preprocessing steps to a real dataset and analyze their impact on model performance.

3. Feature Selection and Engineering

Objective: Understand feature selection techniques and feature engineering.

Examples:

Feature Selection with SelectKBest:

 from sklearn.feature_selection import SelectKBest, f_classif
 from sklearn.datasets import load_iris

 # Load dataset
 iris = load_iris()
 X = iris.data
 y = iris.target

 # Feature selection
 selector = SelectKBest(score_func=f_classif, k=2)
 X_new = selector.fit_transform(X, y)
 print(X_new)

Feature Engineering:

 # Create new feature: ratio of two existing features
 X_new_feature = X[:, 0] / (X[:, 1] + 1e-10)  # Adding small value to avoid division by zero

Notes:

SelectKBest is used for selecting the best features based on statistical tests.
Feature engineering involves creating new features from existing ones to improve model performance.

Exercise:

Experiment with different feature selection methods and engineered features to see their impact on model performance.

4. Model Training and Validation

Objective: Learn how to train and validate machine learning models.

Examples:

Train-Test Split:

 from sklearn.model_selection import train_test_split

 # Load dataset
 iris = load_iris()
 X = iris.data
 y = iris.target

 # Split dataset
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Model Training:

 from sklearn.ensemble import RandomForestClassifier

 # Train model
 model = RandomForestClassifier(n_estimators=100)
 model.fit(X_train, y_train)

Model Validation:

 from sklearn.metrics import accuracy_score, classification_report

 # Make predictions
 y_pred = model.predict(X_test)

 # Evaluate model
 accuracy = accuracy_score(y_test, y_pred)
 cr = classification_report(y_test, y_pred)
 print(f'Accuracy: {accuracy:.2f}')
 print('Classification Report:\n', cr)

Notes:

train_test_split is used to divide the dataset into training and testing subsets.
Models are trained using .fit() and validated using metrics like accuracy and classification report.

Exercise:

Compare the performance of different models (e.g., RandomForestClassifier vs. LogisticRegression) using cross-validation.

5. Hyperparameter Tuning

Objective: Optimize model performance through hyperparameter tuning.

Examples:

Grid Search for Hyperparameter Tuning:

 from sklearn.model_selection import GridSearchCV
 from sklearn.svm import SVC

 # Define parameter grid
 param_grid = {
     'C': [0.1, 1, 10],
     'kernel': ['linear', 'rbf']
 }

 # Create model
 model = SVC()

 # Perform grid search
 grid_search = GridSearchCV(model, param_grid, cv=5)
 grid_search.fit(X, y)
 print(f'Best parameters: {grid_search.best_params_}')
 print(f'Best score: {grid_search.best_score_}')

Random Search for Hyperparameter Tuning:

 from sklearn.model_selection import RandomizedSearchCV
 from scipy.stats import randint

 # Define parameter grid
 param_dist = {
     'n_estimators': randint(50, 200),
     'max_features': ['auto', 'sqrt', 'log2'],
     'bootstrap': [True, False]
 }

 # Create model
 model = RandomForestClassifier()

 # Perform random search
 random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
 random_search.fit(X, y)
 print(f'Best parameters: {random_search.best_params_}')
 print(f'Best score: {random_search.best_score_}')

Notes:

GridSearchCV and RandomizedSearchCV are used to find the best hyperparameters for a model.
Grid search evaluates all parameter combinations, while random search samples a fixed number of parameter settings.

Exercise:

Experiment with different hyperparameters for various models and compare their performance.

6. Model Evaluation

Objective: Evaluate model performance using various metrics.

Examples:

Confusion Matrix and Classification Report:

 from sklearn.metrics import confusion_matrix, classification_report

 # Compute confusion matrix
 cm = confusion_matrix(y_test, y_pred)
 print('Confusion Matrix:\n', cm)

 # Classification report
 cr = classification_report(y_test, y_pred)
 print('Classification Report:\n', cr)

Cross-Validation:

 from sklearn.model_selection import cross_val_score
 from sklearn.ensemble import RandomForestClassifier

 # Load dataset
 iris = load_iris()
 X = iris.data
 y = iris.target

 # Model
 model = RandomForestClassifier()

 # Cross-validation
 scores = cross_val_score(model, X, y, cv=5)
 print(f'Cross-validation scores: {scores}')
 print(f'Mean score: {scores.mean():.2f}')

Notes:

The confusion matrix and classification report provide detailed insights into model performance.
Cross-validation helps to assess the model's performance across different subsets of the dataset.

Exercise:

Use different evaluation metrics such as ROC-AUC for binary classification tasks.

7. Advanced Topics

Objective: Explore advanced techniques such as ensemble methods, pipelines, and clustering.

Examples:

Ensemble Methods:

 from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier

 # Load dataset
 iris = load_iris()
 X = iris.data
 y = iris.target

 # Create models
 clf1 = GradientBoostingClassifier(n_estimators=100)
 clf2 = RandomForestClassifier(n_estimators=100)

 # Voting Classifier
 voting_clf = VotingClassifier(estimators=[('gb', clf1), ('rf', clf2)], voting='soft')
 voting_clf.fit(X, y)

 # Evaluate model
 scores = cross_val_score(voting_clf, X, y, cv=5)
 print(f'Voting Classifier scores: {scores}')

Pipelines:

 from sklearn.pipeline import Pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC

 # Create pipeline
 pipeline = Pipeline([
     ('scaler', StandardScaler()),
     ('svc', SVC(kernel='linear'))
 ])

 # Train model
 pipeline.fit(X_train, y_train)

 # Predict and evaluate
 y_pred = pipeline.predict(X_test)
 print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

Clustering:

 from sklearn.cluster import KMeans

 # Load dataset
 df = pd.read_csv('https://raw.githubusercontent.com/raphaelmw/mall-customer-segmentation/master/Mall_Customers.csv')
 X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

 # Apply K-Means
 kmeans = KMeans(n_clusters=5, random_state=42)
 y_kmeans = kmeans.fit_predict(X)

 # Plot clusters
 plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
 plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
 plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
 plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')
 plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5')
 plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
 plt.title('Customer Clusters')
 plt.xlabel('Annual Income (k$)')
 plt.ylabel('Spending Score (1-100)')
 plt.legend()
 plt.show()

Notes:

Ensemble methods combine the predictions of multiple models to improve performance.
Pipelines streamline workflows by chaining preprocessing and model training steps.
Clustering algorithms like K-Means group similar data points without labeled outputs.

Exercise:

Experiment with different ensemble methods and clustering algorithms on various datasets.

This comprehensive guide covers various scikit-learn features and their applications in data science and machine learning. Each section provides examples and notes to help you understand and apply these techniques effectively. Feel free to explore and experiment with different datasets and models to deepen your understanding.