Document Technical Notes
This page covers advanced topics for document management including the processing pipeline, embedding system, graph-based access control, and troubleshooting common issues.
Document Processing Pipeline
The document ingestion pipeline handles multiple file formats through specialized extractors.
Text Extraction
- PDF documents - Uses
pdf-parsefor text extraction and page number preservation - Microsoft Word - Uses
mammothto extract text while maintaining document structure - Plain text and Markdown - Processed directly with UTF-8 encoding
Chunking
After extraction, the document is chunked into overlapping segments using a token-based approach. Each chunk maintains metadata about its position in the original document, including page numbers (for PDFs) and character offsets.
Embedding Generation
Embeddings are generated using OpenAI's text-embedding-3-small model (1536 dimensions). These vector representations capture the semantic meaning of each chunk, enabling similarity-based search that goes beyond keyword matching.
Chunk Storage
Document chunks are stored as nodes in Neo4j with the following properties:
chunkId- Unique identifier for the chunkdocumentId- Parent document identifiercontent- The actual text content of the chunkposition- Sequential position in the document (0-indexed)pageNumber- Page number (for PDFs)charOffset- Character offset in the original documentembedding- 1536-dimensional vector embeddingcreatedAt- Timestamp of chunk creation
Chunks are connected to their parent document via a CHUNK_OF relationship, and documents are connected to graphs via IN_GRAPH relationships. This structure enables efficient traversal and access control queries.
Graph-Based Access Control
All document operations respect graph-based permissions:
Graph Types
- Personal Graph (
user:{userId}) - Private documents accessible only by the owner - Group Graphs (
group:{groupId}) - Documents shared with group members - User Contact Graphs (
user_contact:{contactId}) - Documents shared in direct conversations
Sharing Model
Documents can exist in multiple graphs through the sharing mechanism:
- Documents are initially uploaded to a single "primary" graph
- Documents can be shared to additional graphs
- Users can only access documents in graphs they are members of
- Deleting a document removes it from all graphs
Search Quality
Similarity Scoring
Search results are ranked by cosine similarity (0.0 to 1.0):
0.9-1.0- Highly relevant, very similar semantic meaning0.8-0.9- Relevant, similar concepts0.7-0.8- Somewhat relevant, related topics< 0.7- May not be relevant, adjust threshold
Query Optimization
Good queries:
- "What were the key revenue drivers in Q4 2024?"
- "Describe the customer acquisition strategy"
- "How does the pricing model work?"
Poor queries:
- "revenue" (too short, lacks context)
- "What is everything about sales and marketing and customer success?" (too broad)
- Exact keyword matching (use semantic understanding instead)
Supported File Formats
MIME Type: application/pdf
Limitations:
- Encrypted or password-protected PDFs cannot be extracted
- Scanned documents (images only) require OCR (not currently supported)
- Very old PDF versions (pre-1.4) may fail extraction
Microsoft Word
MIME Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Limitations:
- Only .docx format supported (not .doc)
- Complex formatting may not be preserved
- Embedded images are ignored
Plain Text
MIME Type: text/plain
Encoding: UTF-8
Markdown
MIME Type: text/markdown
Features:
- Syntax preserved for code blocks and headings
- Links and formatting maintained
Troubleshooting
Upload Failures
When document upload fails, verify:
- File encoding - Ensure file is properly Base64 encoded without newlines
- File size - Check that file is under 10MB limit
- MIME type - Verify MIME type matches file format exactly
- Graph permissions - Confirm you have write access to target graph
- File corruption - Test that original file opens correctly
Extraction Errors
When text extraction fails:
- PDF protection - Remove password protection before upload
- Scanned documents - Use external OCR tools to generate text first
- Corrupt files - Verify file is not corrupted by opening in native application
- Unsupported versions - Convert old formats to newer versions
Search Quality Issues
When search results are not relevant:
- Threshold too high - Lower similarity threshold (try 0.6 or 0.65)
- Query too short - Use more descriptive queries with 5-10 words
- Wrong graph - Verify you're searching the correct graph ID
- Recent upload - Allow a few seconds for embedding generation
- Language mismatch - Embedding model works best with English text
Permission Errors
When you cannot access a document:
- Graph membership - Verify you are a member of the graph
- Document removed - Document may have been unshared or deleted
- Incorrect graph ID - Double-check graph ID format
- Authentication - Ensure API key is valid and not expired
Performance Optimization
Batch Operations
Upload multiple documents concurrently for better performance:
Pagination
Always use limit and offset when listing large document collections:
Caching
Cache frequently accessed document metadata locally:
Best Practices
Document Lifecycle
- Upload - Add documents to personal graph first
- Verify - Check processing status before sharing
- Share - Share selectively to appropriate graphs
- Search - Use semantic search to find relevant content
- Unshare - Remove access when collaboration completes
- Archive - Delete outdated documents to keep search relevant
Security
- Never upload sensitive documents without proper access controls
- Use personal graphs for private documents
- Audit sharing regularly using
listGraphs() - Remove documents completely when no longer needed
- Tag documents appropriately for easier access control