DataTalk

DataTalk is an intelligent document analysis system built using Retrieval-Augmented Generation (RAG). It enables users to interact with documents through natural language while ensuring context-aware and accurate responses.

SYSTEM ARCHITECTURE

Document Ingestion

Supports PDF and text files using PyPDFLoader and TextLoader for structured data extraction.

Preprocessing Pipeline

Cleans and normalizes text while preserving semantic structure for downstream processing.

Chunking Strategy

Uses RecursiveCharacterTextSplitter (chunk size 300, overlap 30) for optimal context retention.

Embedding Generation

Generates dense vector embeddings using all-MiniLM-L6-v2 (384-dimensional semantic vectors).

Vector Database

Stores embeddings in ChromaDB with efficient similarity-based indexing.

Semantic Retrieval

Uses Max Marginal Relevance (MMR) to retrieve diverse and relevant document chunks.

Context Optimization

Refines queries and ensures high-quality contextual input before response generation.

Answer Generation

Generates responses using LLaMA 3.1 (via Groq) with context-aware prompting.

TECH STACK

Python

Reflex

LangChain

Groq

ChromaDB

Transformers

RETRIEVAL-AUGMENTED GENERATION

RAG enhances traditional language models by grounding responses in external knowledge sources. Instead of relying solely on model memory, it retrieves relevant document context and combines it with generation, resulting in more accurate and reliable outputs.

KEY FEATURES

Multi-Format

PDF and TXT support

Semantic Search

Context-aware retrieval

Data Privacy

Local processing only

Fast Response

Low-latency queries

System Online

Built with Reflex