A simple Python tool I built to help read PDF and DOCX files directly in Cursor IDE. I was tired of not being able to process documents in my AI workflows, so I created this tool to extract text content from PDFs and Word documents.
- Reads PDF files (using pdfplumber and PyPDF2 as backup)
- Reads DOCX files (using python-docx)
- Extracts metadata like title, author, creation date
- Outputs in JSON or plain text format
- Handles errors gracefully
- Preserves document structure (pages/paragraphs)
- Clone this repo
- Install the Python dependencies:
pip install -r requirements.txtThat's it! No complex setup needed.
Just run it with a PDF or DOCX file:
# Read a PDF (defaults to JSON output)
python pdf_docx_reader.py document.pdf
# Read a DOCX file
python pdf_docx_reader.py document.docx
# Get plain text instead of JSON
python pdf_docx_reader.py document.pdf --output-format textYou can also pipe the output to files or use it in scripts:
# Save to file
python pdf_docx_reader.py document.pdf > output.json
# Process multiple files
for file in *.pdf; do
python pdf_docx_reader.py "$file" > "${file%.pdf}.txt"
donefile_path: The PDF or DOCX file to read (required)--output-format: Choosejson(default) ortext--help: Show help--version: Show version
{
"file_path": "/path/to/document.pdf",
"file_type": "PDF",
"pages": [
{
"page_number": 1,
"text": "Page content here...",
"char_count": 150
}
],
"full_text": "Complete document text...",
"metadata": {
"title": "Document Title",
"author": "Author Name",
"creation_date": "2024-01-01",
"modification_date": "2024-01-02"
},
"page_count": 1
}File: /path/to/document.pdf
Type: PDF
Pages: 1
Metadata:
title: Document Title
author: Author Name
creation_date: 2024-01-01
Content:
Complete document text here...
I built this specifically for Cursor IDE, so it works great there. Just open the terminal in Cursor and run:
python pdf_docx_reader.py your_document.pdfYou can also create a simple wrapper script if you want:
#!/bin/bash
# pdf_reader.sh
python /path/to/pdf_docx_reader.py "$1" --output-format textOr use it in your Python code:
from pdf_docx_reader import FileReader
reader = FileReader()
data = reader.read_file("document.pdf")
print(data['full_text'])The extension now includes automatic detection and processing for AI models! AI can now seamlessly process PDF/DOCX files without manual intervention.
pdfDocxReader.autoDetectAndProcess- Automatically detects and processes any PDF/DOCX filepdfDocxReader.processForAI- Returns AI-optimized data structure with enhanced contextpdfDocxReader.getAIReadyContent- Returns clean, AI-ready text content
// AI can now automatically process documents
const result = await vscode.commands.executeCommand(
'pdfDocxReader.processForAI',
'/path/to/document.pdf'
);
const data = JSON.parse(result);
if (data.ai_ready) {
console.log(`Processing ${data.file_type} document:`);
console.log(`Summary: ${data.summary}`);
console.log(`Content: ${data.content}`);
}The new AI commands return data in this enhanced format:
{
"ai_ready": true,
"file_path": "/path/to/document.pdf",
"file_type": "PDF",
"content": "Full document text...",
"summary": "This is a PDF document with 5 pages containing 1,234 words...",
"metadata": { "title": "Document Title", "author": "Author" },
"structure": {
"page_count": 5,
"char_count": 12345,
"word_count": 1234
},
"processed_at": "2024-01-01T12:00:00.000Z"
}Perfect for AI workflows! 🚀
If something goes wrong:
-
"PDF reading libraries not available"
- Run
pip install -r requirements.txt
- Run
-
"File is not a PDF/DOCX"
- Make sure the file has
.pdfor.docxextension
- Make sure the file has
-
Empty text extraction
- Some PDFs are just images - you'll need OCR for those
- Try the other PDF reader (it switches between pdfplumber and PyPDF2)
-
Permission errors
- Make sure the file isn't locked by another app
pdfplumber>=0.9.0- Main PDF readerPyPDF2>=3.0.0- Backup PDF readerpython-docx>=0.8.11- DOCX reader
Reading a research paper:
python pdf_docx_reader.py research_paper.pdf --output-format textBatch processing:
for pdf in *.pdf; do
echo "Processing: $pdf"
python pdf_docx_reader.py "$pdf" > "${pdf%.pdf}.txt"
doneGet just the metadata:
python pdf_docx_reader.py document.pdf | jq '.metadata'- Open VS Code or Cursor IDE
- Go to Extensions (Ctrl/Cmd + Shift + X)
- Search for "PDF/DOCX Reader"
- Click Install
- Download the latest VSIX from Releases
- Install from VSIX file
MIT License - feel free to use it however you want.
Found a bug? Have an idea? Open an issue or send a PR. I'm always looking to improve this tool.
Available on VS Code Marketplace: PDF/DOCX Reader