Compliance Document Management System
A powerful document management system designed for regulatory compliance, using MongoDB for document storage and Azure OpenAI for intelligent document analysis and question answering
Compliance Document Management System
A powerful document management system designed for regulatory compliance, using MongoDB for document storage and Azure OpenAI for intelligent document analysis and question answering.
Overview
This application helps organizations manage, analyze, and extract insights from compliance-related documents. It combines robust document storage with advanced AI capabilities to provide:
- Intelligent Document Storage: Store and organize documents with automatic metadata extraction
- Semantic Search: Find relevant document sections using natural language queries
- Document Q&A: Ask questions about your compliance documents and get accurate answers with citations
- Document Analysis: Extract key concepts, entities, and relationships from documents
- Vector-based Similarity: Find similar documents and concepts using embedded vector representations
- Table Parsing: Automatically detect and parse tabular compliance data
The system uses MongoDB for document storage and Azure OpenAI for AI capabilities, with a modular architecture that allows for flexible deployment and scaling.
Key Features
- Document Processing Pipeline: Parse various document formats (PDF, DOCX, CSV, etc.)
- Smart Chunking: Break documents into semantic chunks for better retrieval
- Table Detection: Automatically identify and parse tabular compliance data
- Vector Embeddings: Store document embeddings for semantic similarity search
- MongoDB Integration: Scalable document storage with full-text search capabilities
- Azure OpenAI Integration: Advanced LLM capabilities for document understanding
- Type-Safe Codebase: Fully type-annotated Python with MyPy support
- Modular Architecture: Easily extensible with new document types and features
Project Structure
compliance/
āāā app/
ā āāā __init__.py # Main package initialization
ā āāā config/ # Configuration settings
ā ā āāā __init__.py
ā ā āāā env.py
ā āāā database/ # Database connection and models
ā ā āāā __init__.py
ā ā āāā connection.py
ā ā āāā documents.py # Document storage models
ā ā āāā vector_store.py # Vector embedding storage
ā ā āāā test_connection.py
ā ā āāā types.py
ā āāā schemas/ # Data validation and normalization
ā ā āāā __init__.py
ā ā āāā document.py # Document schema definitions
ā ā āāā user.py
ā āāā services/ # Business logic
ā ā āāā __init__.py
ā ā āāā question_service.py # Document Q&A service
ā ā āāā document_service.py # Document processing service
ā āāā utils/ # Utility functions
ā āāā __init__.py
ā āāā text_processor.py # Document chunking and processing
āāā scripts/ # Utility scripts
ā āāā setup_uv.sh # Setup script for Unix/macOS
ā āāā setup_uv.bat # Setup script for Windows
āāā main.py # Application entry point
āāā requirements.txt # Project dependencies
āāā requirements-dev.txt # Development dependencies
āāā setup.py # Package installation script
āāā mypy.ini # Type checking configuration
āāā README.md # Project documentation
Prerequisites
- Python 3.11 or higher
- MongoDB running locally or accessible via connection string
- Azure OpenAI API access
- uv package manager (recommended)
Installation
Using the Setup Scripts (Recommended)
The easiest way to set up the project is to use the provided setup scripts:
On Unix/macOS:
./scripts/setup_uv.sh
On Windows:
scripts\setup_uv.bat
This will:
- Install uv if not already installed
- Create a virtual environment
- Install dependencies and development dependencies
- Install the package in development mode
Using uv Manually
If you prefer to set up manually:
-
Install uv if you haven't already:
pip install uv
-
Create and activate a virtual environment:
uv venv source .venv/bin/activate # On Unix/macOS # or .venv\Scripts\activate # On Windows
-
Install dependencies:
uv pip install -r requirements.txt
-
For development, install additional dependencies:
uv pip install -r requirements-dev.txt
-
Install the package in development mode:
uv pip install -e .
Configuration
The application uses the following environment variables:
Required
OPENAI_API_KEY
: Azure OpenAI API keyOPENAI_ENDPOINT
: Azure OpenAI endpointOPENAI_DEPLOYMENT_NAME
: Azure OpenAI model deployment name
Optional
MONGO_URI
: MongoDB connection string (default:mongodb://localhost:27017
)MONGO_DB
: MongoDB database name (default:compliance_db
)DEBUG
: Enable debug mode (default:False
)ENVIRONMENT
: Application environment (default:development
)
Usage
Running the Application
python main.py
Document Processing and Storage
from app.database import DocumentStorage, VectorStore
from app.services.question_service import DocumentQA
from app.utils.text_processor import TextProcessor
# Initialize storage
doc_storage = DocumentStorage()
vector_store = VectorStore()
# Initialize text processor and QA service
text_processor = TextProcessor(chunk_size=1000, chunk_overlap=200)
qa_service = DocumentQA(doc_storage, vector_store)
# Process documents
results = qa_service.process_documents(
["path/to/document1.pdf", "path/to/document2.docx", "path/to/compliance_table.txt"],
parser,
text_processor
)
# Ask questions about your documents
answer = qa_service.answer_question(
"What are the key compliance requirements for Identity & Access Management?",
n_results=5
)
print(answer["answer"])
# Access source information
for source in answer["sources"]:
print(f"Source: {source['source']}, Relevance: {source['relevance']}")
Table Parsing
The system can automatically detect and parse compliance tables in this format:
1 | Identity & Access Management | YES | Sybill uses Role-Based Access Control and the principle of least privilege... | access-control-policy
2 | Data Protection | PARTIAL | Encryption is implemented for data in transit, but at-rest encryption... | data-protection-policy
Each table row is stored as a separate chunk with structured metadata:
# Retrieve documents with tabular data
tabular_docs = doc_storage.find_chunks_by_metadata({"is_table_row": True})
# Filter by compliance category
identity_controls = doc_storage.find_chunks_by_metadata({
"is_table_row": True,
"control_category": "Identity & Access Management"
})
# Find all fully compliant controls
compliant_controls = doc_storage.find_chunks_by_metadata({
"is_table_row": True,
"compliance_status": "YES"
})
Getting Structured JSON Responses
# Define a schema for the response
schema = {
"type": "object",
"properties": {
"answer": {"type": "string"},
"key_points": {
"type": "array",
"items": {"type": "string"}
},
"citations": {
"type": "array",
"items": {"type": "integer"}
}
}
}
# Get a structured response
result = qa_service.answer_question(
"What are the key compliance requirements for data protection?",
n_results=5,
json_response=True,
json_schema=schema
)
Using the Database Module
To use the MongoDB connection in other parts of your application:
# Import the db instance
from app.database import db
# Get a collection
collection = db.get_collection("your_collection_name")
# Perform operations
collection.insert_one({"key": "value"})
documents = collection.find({"key": "value"})
# Close connection when done (typically at application exit)
db.close()
Development
Type Checking
The project includes type annotations and can be type-checked using mypy:
mypy app
Running Tests
pytest
License
This project is licensed under the MIT License - see the LICENSE file for details.
About
A powerful document management system designed for regulatory compliance, using MongoDB for document storage and Azure OpenAI for intelligent document analysis and question answering.