Document Technical Notes

This page covers advanced topics for document management including the processing pipeline, embedding system, graph-based access control, and troubleshooting common issues.

Document Processing Pipeline

The document ingestion pipeline handles multiple file formats through specialized extractors.

Text Extraction

PDF documents - Uses pdf-parse for text extraction and page number preservation
Microsoft Word - Uses mammoth to extract text while maintaining document structure
Plain text and Markdown - Processed directly with UTF-8 encoding

After extraction, the document is chunked into overlapping segments using a token-based approach. Each chunk maintains metadata about its position in the original document, including page numbers (for PDFs) and character offsets.

Embedding Generation

Embeddings are generated using OpenAI's text-embedding-3-small model (1536 dimensions). These vector representations capture the semantic meaning of each chunk, enabling similarity-based search that goes beyond keyword matching.

Chunk Storage

Document chunks are stored as nodes in Neo4j with the following properties:

chunkId - Unique identifier for the chunk
documentId - Parent document identifier
content - The actual text content of the chunk
position - Sequential position in the document (0-indexed)
pageNumber - Page number (for PDFs)
charOffset - Character offset in the original document
embedding - 1536-dimensional vector embedding
createdAt - Timestamp of chunk creation

Chunks are connected to their parent document via a CHUNK_OF relationship, and documents are connected to graphs via IN_GRAPH relationships. This structure enables efficient traversal and access control queries.

Graph-Based Access Control

All document operations respect graph-based permissions:

Graph Types

Personal Graph (user:{userId}) - Private documents accessible only by the owner
Group Graphs (group:{groupId}) - Documents shared with group members
User Contact Graphs (user_contact:{contactId}) - Documents shared in direct conversations

Documents can exist in multiple graphs through the sharing mechanism:

Documents are initially uploaded to a single "primary" graph
Documents can be shared to additional graphs
Users can only access documents in graphs they are members of
Deleting a document removes it from all graphs

Search Quality

Similarity Scoring

Search results are ranked by cosine similarity (0.0 to 1.0):

0.9-1.0 - Highly relevant, very similar semantic meaning
0.8-0.9 - Relevant, similar concepts
0.7-0.8 - Somewhat relevant, related topics
< 0.7 - May not be relevant, adjust threshold

Query Optimization

Good queries:

"What were the key revenue drivers in Q4 2024?"
"Describe the customer acquisition strategy"
"How does the pricing model work?"

Poor queries:

"revenue" (too short, lacks context)
"What is everything about sales and marketing and customer success?" (too broad)
Exact keyword matching (use semantic understanding instead)

Supported File Formats

PDF

MIME Type: application/pdf

Limitations:

Encrypted or password-protected PDFs cannot be extracted
Scanned documents (images only) require OCR (not currently supported)
Very old PDF versions (pre-1.4) may fail extraction

Microsoft Word

MIME Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Limitations:

Only .docx format supported (not .doc)
Complex formatting may not be preserved
Embedded images are ignored

Plain Text

MIME Type: text/plain

Encoding: UTF-8

Markdown

MIME Type: text/markdown

Features:

Syntax preserved for code blocks and headings
Links and formatting maintained

Troubleshooting

Upload Failures

When document upload fails, verify:

File encoding - Ensure file is properly Base64 encoded without newlines
File size - Check that file is under 10MB limit
MIME type - Verify MIME type matches file format exactly
Graph permissions - Confirm you have write access to target graph
File corruption - Test that original file opens correctly

Extraction Errors

When text extraction fails:

PDF protection - Remove password protection before upload
Scanned documents - Use external OCR tools to generate text first
Corrupt files - Verify file is not corrupted by opening in native application
Unsupported versions - Convert old formats to newer versions

Search Quality Issues

When search results are not relevant:

Threshold too high - Lower similarity threshold (try 0.6 or 0.65)
Query too short - Use more descriptive queries with 5-10 words
Wrong graph - Verify you're searching the correct graph ID
Recent upload - Allow a few seconds for embedding generation
Language mismatch - Embedding model works best with English text

Permission Errors

When you cannot access a document:

Graph membership - Verify you are a member of the graph
Document removed - Document may have been unshared or deleted
Incorrect graph ID - Double-check graph ID format
Authentication - Ensure API key is valid and not expired

Performance Optimization

Batch Operations

Upload multiple documents concurrently for better performance:

const results = await Promise.all(
  files.map(file => mirra.documents.upload(file))
);

Pagination

Always use limit and offset when listing large document collections:

const docs = await mirra.documents.list({
  graphId: 'group:team',
  limit: 20,
  offset: 0
});

Caching

Cache frequently accessed document metadata locally:

const docCache = new Map();
 
async function getDoc(id) {
  if (docCache.has(id)) return docCache.get(id);
  const doc = await mirra.documents.get(id);
  docCache.set(id, doc);
  return doc;
}

Best Practices

Document Lifecycle

Upload - Add documents to personal graph first
Verify - Check processing status before sharing
Share - Share selectively to appropriate graphs
Search - Use semantic search to find relevant content
Unshare - Remove access when collaboration completes
Archive - Delete outdated documents to keep search relevant

Security

Never upload sensitive documents without proper access controls
Use personal graphs for private documents
Audit sharing regularly using listGraphs()
Remove documents completely when no longer needed
Tag documents appropriately for easier access control