[π§π· PortuguΓͺs] [πΊπΈ English]
10- Data Mining / Main Shift
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.
- Introduction
- What Is Mean Shift?
- Step-by-Step Algorithm
- Key Advantages
- Applications
- Implementation Example
- References
This repository contains a detailed study and implementation of the Mean Shift clustering algorithm focusing on its average displacement mechanism. Mean Shift is an unsupervised learning algorithm widely used for discovering clusters in a dataset, especially when the number of clusters is unknown a priori.
Mean Shift is a non-parametric, terative clustering algorithm that identifies cluster centers by locating peaks in the dataβs probability density. Unlike K-Means, it does not require specifying the number of clusters, making it flexible for arbitrarily shaped clusters and robust to outliers.
The algorithm works by iteratively shifting data points toward regions of higher density, making it a powerful tool for unsupervised learning and exploratory data analysis.
The core concept of Mean Shift revolves around "average displacement" β iteratively computing the center of mass (mean) of data points within a window (kernel) around each candidate point and shifting the point to this new average position.
-
Initialization:: Consider each data point as a potential cluster center.
-
Density Estimation:: For each point, define a window (kernel) with a bandwidth radius and calculate the mean of points within this window.
-
Shift:: Move the point to the calculated mean position, effectively shifting it towards higher data density.
-
Convergence:: Repeat the steps of estimating and shifting until the movement (displacement) between iterations falls below a threshold, indicating convergence at a local density maximum.
-
Clustering: Points settle around local density peaks and clusters are formed based on proximity.
-
No need to define the number of clusters in advance.
-
Can discover clusters of arbitrary shapes.
-
Robust to outliers and noise due to density-based approach.
Mathematically, given a point $x $ , the mean shift vector $m(x)$ at iteration $t$ is:
Where: $N(x^{(t)})$ is the neighborhood within bandwidth, and $K$ is the kernel function weighting points by proximity. The new position is then:
Tip
The process groups points around local maxima producing clusters that naturally fit the data shape without predefined cluster numbers.
In this repository, we will implement the Mean Shift algorithm using Python. The process will include:
- Creating synthetic datasets for testing.
- Estimating the bandwidth parameter using heuristics.
- Running the Mean Shift iterations based on average displacement until convergence.
- Visualizing clusters and their centers.
- Comparing results with K-Means clustering to evaluate differences.
Mean Shift's capacity to discover clusters without prior assumptions makes it powerful in diverse real-world scenarios:
-
Image Segmentation: Clustering pixels based on color or intensity to segment images without fixed segment numbers.
-
Object Tracking in Video: Dynamically identifying and tracking objects in video streams, useful in surveillance and autonomous driving.
-
Customer Segmentation in Marketing: Analyzing behavioral and demographic data to identify customer groups for targeted marketing strategies.
-
Face Recognition: Segmenting facial features or tracking faces dynamically in video frames.
Important
Its robustness to noise and adaptability to complex cluster shapes make Mean Shift valuable in markets needing accurate, flexible clustering solutions in domains like computer vision, video analytics, and personalized marketing.
Feature | Mean Shift | K-Means |
---|---|---|
Number of clusters | Automatically determined | Must be predefined |
Cluster shape | Can find arbitrarily shaped clusters | Assumes spherical clusters |
Sensitivity to outliers | Robust due to density-based approach | Sensitive to outliers |
Convergence | Iterative shifting until displacement below threshold | Iterative centroid update |
Use case example | Object tracking in video, image segmentation | Market segmentation with known groups |
Computational complexity | Generally higher, depends on bandwidth and dataset size | Efficient for large datasets |
-
Adaptive clustering of unknown or complex data distributions
-
Video object tracking and dynamic recognition
-
Image processing for medical imaging, remote sensing
-
Customer behavior analysis without pre-labeled segments
Mean Shift excels in video analysis scenarios. Its dynamic window shift method can locate and follow moving objects frame-by-frame. This capability is essential in face recognition systems where the face position must be continuously updated despite changes in pose or lighting.
import pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Create a synthetic dataset
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=1, random_state=27)
df = pd.DataFrame(X, columns=["C1", "C2"])
# Estimate bandwidth
bandwidth = estimate_bandwidth(df, quantile=0.2, n_samples=500)
# Fit Mean Shift
model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
model.fit(df)
labels = model.labels_
centers = model.cluster_centers_
# Plot results
plt.scatter(df["C1"], df["C2"], c=labels, cmap="plasma", marker="p")
plt.scatter(centers[:, 0], centers[:, 1], s=250, c="blue", marker="X")
plt.title("Mean Shift")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence β A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
- THOMAS, C. Data Mining. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.