cheatsheet

Scikit-learn Cheatsheet: Clustering

Clustering is an unsupervised learning task that groups a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups.

What can be done?

Key Algorithms

  1. KMeans:
    • Simplest and most common. Minimizes within-cluster sum-of-squares.
    • MiniBatchKMeans: Faster version for large datasets.
  2. DBSCAN / HDBSCAN:
    • Density-based. Can find non-spherical clusters and marked outliers.
  3. AgglomerativeClustering:
    • Hierarchical. Can incorporate “connectivity constraints” to group only adjacent points.
  4. MeanShift:
    • Centroid-based. Finds peaks in a distribution; chooses number of clusters automatically.
  5. AffinityPropagation:
    • Based on message passing between data points.

Evaluation Metrics (Internal)

Computational Complexity

Code Snippet: Clustering & Evaluation

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Scaling is CRITICAL for distance-based clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 1. K-Means
kmeans = KMeans(n_clusters=3, n_init='auto', random_state=42)
labels_kmeans = kmeans.fit_predict(X_scaled)
print("K-Means Silhouette:", silhouette_score(X_scaled, labels_kmeans))

# 2. DBSCAN
# eps: maximum distance between two samples for one to be considered as in the neighborhood of the other.
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_dbscan = dbscan.fit_predict(X_scaled)

Credits: This cheatsheet is based on the scikit-learn documentation and examples, which are licensed under the BSD 3-Clause License. Copyright (c) 2007 - 2026 The scikit-learn developers. All rights reserved.