The Web Page Content Query System is an advanced application designed to facilitate the retrieval and querying of web page content using natural language. Leveraging state-of-the-art technologies such as LangChain, Ollama, and Chroma, this system provides both a command-line interface and an interactive Streamlit web interface. Users can load web pages, process their content, and ask questions to receive AI-generated answers. The system is built to handle complex queries efficiently, making it a powerful tool for information retrieval and analysis.
- Load and analyze any web page content
- Split content into manageable chunks for processing
- Generate embeddings using Ollama's local embedding model
- Store and retrieve relevant content using Chroma vector database
- Query content using natural language questions
- Get AI-generated answers using local Ollama models
- Interactive web interface with Streamlit
- Colorful and intuitive UI design
- LangChain v0.2+: Framework for building LLM applications
- Ollama: Local Large Language Model for generating responses
- Chroma: Vector database for storing and retrieving embeddings
- Streamlit: Web interface framework
- BeautifulSoup4: Web scraping and HTML parsing
- Python 3.x: Programming language
- Python 3.x installed
- Ollama installed and running locally
- Git (for cloning the repository)
- Clone the repository:
git clone <repository-url>
cd <repository-name>
- Install required packages:
pip install -r requirements.txt
- Ensure Ollama is running locally with the required models:
ollama pull llama3.1
Run the command-line version:
python rag_app.py
Follow the prompts to:
- Enter a webpage URL
- Ask questions about the content
- Type 'new' to analyze a different webpage
- Type 'quit' to exit
Run the Streamlit interface:
streamlit run streamlit_app.py
The web interface provides:
- URL input field for loading web pages
- Question input for querying content
- Clear button to reset the application
- Visual feedback for successful/failed operations
rag_app.py
: Core RAG functionality and CLI interfacestreamlit_app.py
: Streamlit web interfacerequirements.txt
: Project dependencieschroma_db/
: Directory for vector database storage
- Web Page Loading: The application fetches and parses web page content using WebBaseLoader
- Content Processing: Text is split into chunks using RecursiveCharacterTextSplitter
- Embedding Generation: Content chunks are converted to embeddings using Ollama's local embedding model
- Vector Storage: Embeddings are stored in a Chroma vector database
- Query Processing: User questions trigger relevant content retrieval
- Answer Generation: Ollama generates answers based on retrieved content and user questions
The application includes error handling for:
- Failed webpage loading
- API errors
- Invalid URLs
- Query processing issues
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.