Skip to content

πŸͺ 10- DataMining- This repo implements the Mean Shift clustering algorithm, which finds clusters by shifting points toward higher density areas without needing a preset number of clusters. It includes implementation code, comparisons with K-Means, and applications like video tracking and face recognition.

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/10-DataMining_MeanShift

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡ΊπŸ‡Έ English]





Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.





Table of Contents




This repository contains a detailed study and implementation of the Mean Shift clustering algorithm focusing on its average displacement mechanism. Mean Shift is an unsupervised learning algorithm widely used for discovering clusters in a dataset, especially when the number of clusters is unknown a priori.




Mean Shift is a non-parametric, terative clustering algorithm that identifies cluster centers by locating peaks in the data’s probability density. Unlike K-Means, it does not require specifying the number of clusters, making it flexible for arbitrarily shaped clusters and robust to outliers.

The algorithm works by iteratively shifting data points toward regions of higher density, making it a powerful tool for unsupervised learning and exploratory data analysis.



The core concept of Mean Shift revolves around "average displacement" β€” iteratively computing the center of mass (mean) of data points within a window (kernel) around each candidate point and shifting the point to this new average position.




  1. Initialization:: Consider each data point as a potential cluster center.

  2. Density Estimation:: For each point, define a window (kernel) with a bandwidth radius and calculate the mean of points within this window.

  3. Shift:: Move the point to the calculated mean position, effectively shifting it towards higher data density.

  4. Convergence:: Repeat the steps of estimating and shifting until the movement (displacement) between iterations falls below a threshold, indicating convergence at a local density maximum.

  5. Clustering: Points settle around local density peaks and clusters are formed based on proximity.



  • No need to define the number of clusters in advance.

  • Can discover clusters of arbitrary shapes.

  • Robust to outliers and noise due to density-based approach.



Mathematically, given a point $x $, the mean shift vector $m(x)$ at iteration $t$ is:



$$ \Huge m(x^{(t)}) = \frac{\sum_{x_i \in N(x^{(t)})} K(x_i - x^{(t)}) x_i}{\sum_{x_i \in N(x^{(t)})} K(x_i - x^{(t)})} - x^{(t)} $$



Where: $N(x^{(t)})$ is the neighborhood within bandwidth, and $K$ is the kernel function weighting points by proximity. The new position is then:



$$ \Huge x^{(t+1)} = x^{(t)} + m(x^{(t)}) $$



Tip

The process groups points around local maxima producing clusters that naturally fit the data shape without predefined cluster numbers.




In this repository, we will implement the Mean Shift algorithm using Python. The process will include:


- Creating synthetic datasets for testing.

- Estimating the bandwidth parameter using heuristics.

- Running the Mean Shift iterations based on average displacement until convergence.

- Visualizing clusters and their centers.

- Comparing results with K-Means clustering to evaluate differences.





  • Image Segmentation: Clustering pixels based on color or intensity to segment images without fixed segment numbers.

  • Object Tracking in Video: Dynamically identifying and tracking objects in video streams, useful in surveillance and autonomous driving.

  • Customer Segmentation in Marketing: Analyzing behavioral and demographic data to identify customer groups for targeted marketing strategies.

  • Face Recognition: Segmenting facial features or tracking faces dynamically in video frames.



Important

Its robustness to noise and adaptability to complex cluster shapes make Mean Shift valuable in markets needing accurate, flexible clustering solutions in domains like computer vision, video analytics, and personalized marketing.





Feature Mean Shift K-Means
Number of clusters Automatically determined Must be predefined
Cluster shape Can find arbitrarily shaped clusters Assumes spherical clusters
Sensitivity to outliers Robust due to density-based approach Sensitive to outliers
Convergence Iterative shifting until displacement below threshold Iterative centroid update
Use case example Object tracking in video, image segmentation Market segmentation with known groups
Computational complexity Generally higher, depends on bandwidth and dataset size Efficient for large datasets



  • Adaptive clustering of unknown or complex data distributions

  • Video object tracking and dynamic recognition

  • Image processing for medical imaging, remote sensing

  • Customer behavior analysis without pre-labeled segments



Mean Shift excels in video analysis scenarios. Its dynamic window shift method can locate and follow moving objects frame-by-frame. This capability is essential in face recognition systems where the face position must be continuously updated despite changes in pose or lighting.




import pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Create a synthetic dataset
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=1, random_state=27)
df = pd.DataFrame(X, columns=["C1", "C2"])

# Estimate bandwidth
bandwidth = estimate_bandwidth(df, quantile=0.2, n_samples=500)

# Fit Mean Shift
model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
model.fit(df)
labels = model.labels_
centers = model.cluster_centers_

# Plot results
plt.scatter(df["C1"], df["C2"], c=labels, cmap="plasma", marker="p")
plt.scatter(centers[:, 0], centers[:, 1], s=250, c="blue", marker="X")
plt.title("Mean Shift")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()



1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.


  • THOMAS, C. Data Mining. IntechOpen, 2018.
  • HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
  • NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
  • RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
  • SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.







πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

πŸͺ 10- DataMining- This repo implements the Mean Shift clustering algorithm, which finds clusters by shifting points toward higher density areas without needing a preset number of clusters. It includes implementation code, comparisons with K-Means, and applications like video tracking and face recognition.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project