10- Data Mining / Main Shift

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.

Introduction

This repository contains a detailed study and implementation of the Mean Shift clustering algorithm focusing on its average displacement mechanism. Mean Shift is an unsupervised learning algorithm widely used for discovering clusters in a dataset, especially when the number of clusters is unknown a priori.

What is Mean Shift ?

Mean Shift is a non-parametric, terative clustering algorithm that identifies cluster centers by locating peaks in the data’s probability density. Unlike K-Means, it does not require specifying the number of clusters, making it flexible for arbitrarily shaped clusters and robust to outliers.

The algorithm works by iteratively shifting data points toward regions of higher density, making it a powerful tool for unsupervised learning and exploratory data analysis.

Average Displacement Explained

The core concept of Mean Shift revolves around "average displacement" — iteratively computing the center of mass (mean) of data points within a window (kernel) around each candidate point and shifting the point to this new average position.

Step-by-Step Algorithm:

Initialization:: Consider each data point as a potential cluster center.
Density Estimation:: For each point, define a window (kernel) with a bandwidth radius and calculate the mean of points within this window.
Shift:: Move the point to the calculated mean position, effectively shifting it towards higher data density.
Convergence:: Repeat the steps of estimating and shifting until the movement (displacement) between iterations falls below a threshold, indicating convergence at a local density maximum.
Clustering: Points settle around local density peaks and clusters are formed based on proximity.

Key Advantages

No need to define the number of clusters in advance.
Can discover clusters of arbitrary shapes.
Robust to outliers and noise due to density-based approach.

Mathematically, given a point $x $, the mean shift vector $m(x)$ at iteration $t$ is:

$$ \Huge m(x^{(t)}) = \frac{\sum_{x_i \in N(x^{(t)})} K(x_i - x^{(t)}) x_i}{\sum_{x_i \in N(x^{(t)})} K(x_i - x^{(t)})} - x^{(t)} $$

Where: $N(x^{(t)})$ is the neighborhood within bandwidth, and $K$ is the kernel function weighting points by proximity. The new position is then:

$$ \Huge x^{(t+1)} = x^{(t)} + m(x^{(t)}) $$

Tip

The process groups points around local maxima producing clusters that naturally fit the data shape without predefined cluster numbers.

Implementation Overview

In this repository, we will implement the Mean Shift algorithm using Python. The process will include:

- Creating synthetic datasets for testing.

- Estimating the bandwidth parameter using heuristics.

- Running the Mean Shift iterations based on average displacement until convergence.

- Visualizing clusters and their centers.

- Comparing results with K-Means clustering to evaluate differences.

Applications and Market Uses

Mean Shift's capacity to discover clusters without prior assumptions makes it powerful in diverse real-world scenarios:

Image Segmentation: Clustering pixels based on color or intensity to segment images without fixed segment numbers.
Object Tracking in Video: Dynamically identifying and tracking objects in video streams, useful in surveillance and autonomous driving.
Customer Segmentation in Marketing: Analyzing behavioral and demographic data to identify customer groups for targeted marketing strategies.
Face Recognition: Segmenting facial features or tracking faces dynamically in video frames.

Important

Its robustness to noise and adaptability to complex cluster shapes make Mean Shift valuable in markets needing accurate, flexible clustering solutions in domains like computer vision, video analytics, and personalized marketing.

Comparison with K-Means

Feature	Mean Shift	K-Means
Number of clusters	Automatically determined	Must be predefined
Cluster shape	Can find arbitrarily shaped clusters	Assumes spherical clusters
Sensitivity to outliers	Robust due to density-based approach	Sensitive to outliers
Convergence	Iterative shifting until displacement below threshold	Iterative centroid update
Use case example	Object tracking in video, image segmentation	Market segmentation with known groups
Computational complexity	Generally higher, depends on bandwidth and dataset size	Efficient for large datasets

Use Cases

Adaptive clustering of unknown or complex data distributions
Video object tracking and dynamic recognition
Image processing for medical imaging, remote sensing
Customer behavior analysis without pre-labeled segments

Video Application: Object Tracking and Face Recognition

Mean Shift excels in video analysis scenarios. Its dynamic window shift method can locate and follow moving objects frame-by-frame. This capability is essential in face recognition systems where the face position must be continuously updated despite changes in pose or lighting.

Implementation Example

import pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Create a synthetic dataset
X, _ = make_blobs(n_samples=500, centers=5, cluster_std=1, random_state=27)
df = pd.DataFrame(X, columns=["C1", "C2"])

# Estimate bandwidth
bandwidth = estimate_bandwidth(df, quantile=0.2, n_samples=500)

# Fit Mean Shift
model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
model.fit(df)
labels = model.labels_
centers = model.cluster_centers_

# Plot results
plt.scatter(df["C1"], df["C2"], c=labels, cmap="plasma", marker="p")
plt.scatter(centers[:, 0], centers[:, 1], s=250, c="blue", marker="X")
plt.title("Mean Shift")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.

Complementary Bibliography

THOMAS, C. Data Mining. IntechOpen, 2018.
HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
1-Simulation_MeanShitt_Implementation		1-Simulation_MeanShitt_Implementation
2_Simulation_MeaShift_KMeans_Monkey_Spiral_Design		2_Simulation_MeaShift_KMeans_Monkey_Spiral_Design
3-MeanShift_KMeans_Monkey_SpiraL_LightM.ipynb/Code_LightMode		3-MeanShift_KMeans_Monkey_SpiraL_LightM.ipynb/Code_LightMode
3a_MeanShift_KMeans_Monkey_Spiral_DarkMode/Code_DarktMode		3a_MeanShift_KMeans_Monkey_Spiral_DarkMode/Code_DarktMode
Datasets		Datasets
Workbook		Workbook
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

10- Data Mining / Main Shift

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

Introduction

What is Mean Shift ?

Average Displacement Explained

Step-by-Step Algorithm:

Key Advantages

Mathematically, given a point $x $, the mean shift vector $m(x)$ at iteration $t$ is:

Where: $N(x^{(t)})$ is the neighborhood within bandwidth, and $K$ is the kernel function weighting points by proximity. The new position is then:

Implementation Overview

Applications and Market Uses

Mean Shift's capacity to discover clusters without prior assumptions makes it powerful in diverse real-world scenarios:

Comparison with K-Means

Use Cases

Video Application: Object Tracking and Face Recognition

Implementation Example

Bibliography

Complementary Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/10-DataMining_MeanShift

Folders and files

Latest commit

History

Repository files navigation

10- Data Mining / Main Shift

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

Step-by-Step Algorithm:

Mathematically, given a point $x $, the mean shift vector $m(x)$ at iteration $t$ is:

Where: $N(x^{(t)})$ is the neighborhood within bandwidth, and $K$ is the kernel function weighting points by proximity. The new position is then:

Mean Shift's capacity to discover clusters without prior assumptions makes it powerful in diverse real-world scenarios:

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages