Machine Learning Technique: Ensemble Learning in scikit-learn Library
In this article, we delve into the world of ensemble learning and demonstrate its application using the Titanic dataset from Kaggle (
**Data Loading and Splitting** First, we load the Titanic dataset and split it into training and test sets, essential for model evaluation.
**Preprocessing** Next, we define separate pipelines for numeric and categorical features, including imputation and encoding. The ColumnTransformer combines these steps to process all features in one go.
**Ensemble Models** We combine Random Forest and Gradient Boosting into a VotingClassifier, which can use either majority ('hard') or probability-based ('soft') voting for final predictions.
**Pipeline** The preprocessing and ensemble steps are combined into a single scikit-learn Pipeline, ensuring reproducibility and easy deployment.
**Evaluation** The pipeline is trained, predictions are made on the test set, and accuracy is reported.
**Customization** - Add More Ensemble Models: Include other sklearn-compatible classifiers in the VotingClassifier. - Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV with the pipeline for parameter optimization. - Feature Engineering: Add more complex feature engineering steps as needed for your specific dataset.
This approach is modular and scalable, leveraging scikit-learn's Pipeline for clean, maintainable, and reproducible machine learning workflows.
Here's a step-by-step example using the Titanic dataset. This code assumes you have already downloaded the data from Kaggle (train.csv) and that your DataFrame includes basic cleaning (e.g., dropping uninformative columns, handling missing values, encoding categorical features).
```python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import ( RandomForestClassifier, GradientBoostingClassifier, VotingClassifier ) from sklearn.preprocessing import StandardScaler, OneHotEncoder, SimpleImputer from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import accuracy_score
# Load data df = pd.read_csv('train.csv')
# Feature engineering and selection X = df.drop(['PassengerId', 'Name', 'Ticket', 'Survived'], axis=1) y = df['Survived']
# Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)[2]
# Define numeric and categorical features numeric_features = ['Age', 'Fare', 'SibSp', 'Parch'] categorical_features = ['Pclass', 'Sex', 'Embarked', 'Cabin']
# Preprocessing pipelines numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])
categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ])
preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ])
# Define ensemble models rf = RandomForestClassifier(n_estimators=100, random_state=42) gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
# Build a VotingClassifier with majority voting ensemble = VotingClassifier( estimators=[('rf', rf), ('gb', gb)], voting='hard' # or 'soft' for probability-based voting )
# Build full pipeline pipe = Pipeline([ ('preprocessor', preprocessor), ('ensemble', ensemble) ])
# Tune estimators tuned_estimators = []
for clf in [rf, gb]: param_grid = { 'n_estimators': [50, 100, 150], # Add more hyperparameters as needed }
grid = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5) grid.fit(X_train, y_train) best_clf = grid.best_estimator_ tuned_estimators.append((clf.__class__.__name__, best_clf))
# Train and evaluate the ensemble pipe.set_params(ensemble__estimators=tuned_estimators) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print(f"Ensemble Accuracy: {accuracy_score(y_test, y_pred):.4f}") ```
This article builds upon pipelines created in a previous article for streamlining machine learning workflow. By utilising ensemble learning in scikit-learn, we can improve our predictions on the Titanic dataset.
In this article, we apply data-and-cloud-computing technology to enhance our machine learning workflow by utilizing ensemble learning in scikit-learn and constructing pipelines for the Titanic dataset. Throughout the process, we leverage the technology of various machine learning models, such as Random Forest and Gradient Boosting, to make more accurate predictions.
In our scikit-learn pipeline, we customize the ensemble models by including other sklearn-compatible classifiers in the VotingClassifier and fine-tune the parameters using GridSearchCV or RandomizedSearchCV for optimal performance.