An intelligent data profiling tool powered by LLMs that provides deep, contextual analysis of your datasets beyond traditional statistical metrics.
This tool performs comprehensive data profiling through a 7-step workflow:
- Duplicate Detection - Identifies and analyzes duplicate rows with recommendations
- Table Summary - Generates high-level description of what your data represents
- Column Descriptions - Analyzes each column with meaningful descriptions and naming suggestions
- Data Type Analysis - Recommends optimal data types for each column
- Missing Values Analysis - Categorizes missing values as meaningful vs problematic
- Uniqueness Analysis - Identifies potential unique identifier columns
- Unusual Values Detection - Detects outliers, anomalies, and data quality issues
- Install dependencies:
pip install -r requirements.txt
- Set up your LLM:
The tool uses OpenAI by default. Set your API key:
export OPENAI_API_KEY="your-key-here"
To use your own LLM or different providers, check out the PocketFlow LLM documentation and modify utils/call_llm.py
accordingly.
Test your LLM setup:
python utils/call_llm.py
python main.py
By default, it analyzes the sample patient dataset in test/patients.csv
. To analyze your own data, modify main.py
:
# Replace this line:
df = pd.read_csv("test/patients.csv")
# With your data:
df = pd.read_csv("path/to/your/data.csv")
The tool generates:
- Console summary with key statistics
- Markdown report saved as
data_profiling_report.md
with comprehensive analysis
From the sample patient dataset (60 rows, 27 columns):
- ✅ Detected invalid SSN formats (test data with "999" prefix)
- ✅ Identified name contamination (numeric suffixes in names)
- ✅ Found meaningful missing patterns (83% missing death dates = living patients)
- ✅ Recommended data type conversions (dates to datetime64, categories for demographics)
- ✅ Identified unique identifiers (UUID primary key, SSN)
Built with PocketFlow - a minimalist LLM framework:
- Workflow pattern for sequential processing pipeline
- BatchNode for efficient parallel column analysis
- YAML-based structured outputs with validation
- Intelligent LLM analysis for contextual understanding
├── main.py # Entry point
├── flow.py # Flow orchestrator
├── nodes.py # All profiling nodes
├── utils/
│ └── call_llm.py # LLM utility (customize for your provider)
├── test/
│ └── patients.csv # Sample dataset
└── docs/
└── design.md # Design documentation
Edit utils/call_llm.py
to use your preferred LLM:
- Claude (Anthropic)
- Google Gemini
- Azure OpenAI
- Local models (Ollama)
See the PocketFlow LLM guide for examples.
The tool works with any pandas DataFrame. You can:
- Load from CSV, Excel, JSON, Parquet
- Connect to databases
- Use API data
Just ensure your data is loaded as a pandas DataFrame before running the flow.
This project demonstrates Agentic Coding with PocketFlow. Want to learn more?
- Check out the Agentic Coding Guidance
- Watch the YouTube Tutorial
This project is a tutorial example for PocketFlow.