Scikit-learn Overview & Index
Scikit-learn is a premier Python library for machine learning, built on top of NumPy, SciPy, and matplotlib. It provides simple and efficient tools for predictive data analysis.
Core Capabilities
Scikit-learn is organized into several key areas:
1. Supervised Learning
- Classification: Identifying which category an object belongs to (e.g., SVM, Random Forest, Logistic Regression).
- Regression: Predicting a continuous-valued attribute (e.g., Ridge, Lasso). See also Tree and SVM for regressor variants.
- Ensemble Methods: Combining the predictions of several base estimators (e.g., Boosting, Bagging).
2. Unsupervised Learning
- Clustering: Automatic grouping of similar objects into sets (e.g., K-Means, Spectral Clustering).
- Decomposition: Reducing the number of random variables to consider (e.g., PCA, ICA).
- Covariance Estimation: Estimating the magnitude of relationship between features.
3. Model Building & Selection
- Preprocessing: Feature extraction and normalization.
- Model Selection: Comparing, validating and choosing parameters and models (e.g., Grid Search, Cross Validation).
- Pipeline/Compose: Chaining estimators and transformers.
Typical Workflow (The fit/predict Pattern)
Almost all objects in Scikit-learn share a uniform interface:
- Estimators:
model.fit(X_train, y_train)
- Predictors:
model.predict(X_test)
- Transformers:
transformer.transform(X) or transformer.fit_transform(X)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 1. Initialize
model = RandomForestClassifier()
# 2. Fit
model.fit(X_train, y_train)
# 3. Predict
predictions = model.predict(X_test)
# 4. Evaluate
print(accuracy_score(y_test, predictions))
Detailed Cheatsheets
For deeper dives into specific modules, see the following:
- Data Handling: Datasets, Preprocessing, Impute
- Supervised: Linear Models, SVM, Tree, Neighbors, Neural Networks
- Unsupervised: Manifold, Mixture Models, Biclustering
- Advanced: Feature Selection, Inspection, Kernel Approximation
- Meta-Estimators: Multiclass, Multioutput, Calibration
- Developer Guide: Developing Estimators
Maintained in the sklearn/ directory.