Mirra
Mirra APIDocuments

Document Technical Notes

This page covers advanced topics for document management including the processing pipeline, embedding system, graph-based access control, and troubleshooting common issues.

Document Processing Pipeline

The document ingestion pipeline handles multiple file formats through specialized extractors.

Text Extraction

  • PDF documents - Uses pdf-parse for text extraction and page number preservation
  • Microsoft Word - Uses mammoth to extract text while maintaining document structure
  • Plain text and Markdown - Processed directly with UTF-8 encoding

Chunking

After extraction, the document is chunked into overlapping segments using a token-based approach. Each chunk maintains metadata about its position in the original document, including page numbers (for PDFs) and character offsets.

Embedding Generation

Embeddings are generated using OpenAI's text-embedding-3-small model (1536 dimensions). These vector representations capture the semantic meaning of each chunk, enabling similarity-based search that goes beyond keyword matching.


Chunk Storage

Document chunks are stored as nodes in Neo4j with the following properties:

  • chunkId - Unique identifier for the chunk
  • documentId - Parent document identifier
  • content - The actual text content of the chunk
  • position - Sequential position in the document (0-indexed)
  • pageNumber - Page number (for PDFs)
  • charOffset - Character offset in the original document
  • embedding - 1536-dimensional vector embedding
  • createdAt - Timestamp of chunk creation

Chunks are connected to their parent document via a CHUNK_OF relationship, and documents are connected to graphs via IN_GRAPH relationships. This structure enables efficient traversal and access control queries.


Graph-Based Access Control

All document operations respect graph-based permissions:

Graph Types

  • Personal Graph (user:{userId}) - Private documents accessible only by the owner
  • Group Graphs (group:{groupId}) - Documents shared with group members
  • User Contact Graphs (user_contact:{contactId}) - Documents shared in direct conversations

Sharing Model

Documents can exist in multiple graphs through the sharing mechanism:

  1. Documents are initially uploaded to a single "primary" graph
  2. Documents can be shared to additional graphs
  3. Users can only access documents in graphs they are members of
  4. Deleting a document removes it from all graphs

Search Quality

Similarity Scoring

Search results are ranked by cosine similarity (0.0 to 1.0):

  • 0.9-1.0 - Highly relevant, very similar semantic meaning
  • 0.8-0.9 - Relevant, similar concepts
  • 0.7-0.8 - Somewhat relevant, related topics
  • < 0.7 - May not be relevant, adjust threshold

Query Optimization

Good queries:

  • "What were the key revenue drivers in Q4 2024?"
  • "Describe the customer acquisition strategy"
  • "How does the pricing model work?"

Poor queries:

  • "revenue" (too short, lacks context)
  • "What is everything about sales and marketing and customer success?" (too broad)
  • Exact keyword matching (use semantic understanding instead)

Supported File Formats

PDF

MIME Type: application/pdf

Limitations:

  • Encrypted or password-protected PDFs cannot be extracted
  • Scanned documents (images only) require OCR (not currently supported)
  • Very old PDF versions (pre-1.4) may fail extraction

Microsoft Word

MIME Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Limitations:

  • Only .docx format supported (not .doc)
  • Complex formatting may not be preserved
  • Embedded images are ignored

Plain Text

MIME Type: text/plain

Encoding: UTF-8

Markdown

MIME Type: text/markdown

Features:

  • Syntax preserved for code blocks and headings
  • Links and formatting maintained

Troubleshooting

Upload Failures

When document upload fails, verify:

  • File encoding - Ensure file is properly Base64 encoded without newlines
  • File size - Check that file is under 10MB limit
  • MIME type - Verify MIME type matches file format exactly
  • Graph permissions - Confirm you have write access to target graph
  • File corruption - Test that original file opens correctly

Extraction Errors

When text extraction fails:

  • PDF protection - Remove password protection before upload
  • Scanned documents - Use external OCR tools to generate text first
  • Corrupt files - Verify file is not corrupted by opening in native application
  • Unsupported versions - Convert old formats to newer versions

Search Quality Issues

When search results are not relevant:

  • Threshold too high - Lower similarity threshold (try 0.6 or 0.65)
  • Query too short - Use more descriptive queries with 5-10 words
  • Wrong graph - Verify you're searching the correct graph ID
  • Recent upload - Allow a few seconds for embedding generation
  • Language mismatch - Embedding model works best with English text

Permission Errors

When you cannot access a document:

  • Graph membership - Verify you are a member of the graph
  • Document removed - Document may have been unshared or deleted
  • Incorrect graph ID - Double-check graph ID format
  • Authentication - Ensure API key is valid and not expired

Performance Optimization

Batch Operations

Upload multiple documents concurrently for better performance:

const results = await Promise.all(
  files.map(file => mirra.documents.upload(file))
);

Pagination

Always use limit and offset when listing large document collections:

const docs = await mirra.documents.list({
  graphId: 'group:team',
  limit: 20,
  offset: 0
});

Caching

Cache frequently accessed document metadata locally:

const docCache = new Map();
 
async function getDoc(id) {
  if (docCache.has(id)) return docCache.get(id);
  const doc = await mirra.documents.get(id);
  docCache.set(id, doc);
  return doc;
}

Best Practices

Document Lifecycle

  1. Upload - Add documents to personal graph first
  2. Verify - Check processing status before sharing
  3. Share - Share selectively to appropriate graphs
  4. Search - Use semantic search to find relevant content
  5. Unshare - Remove access when collaboration completes
  6. Archive - Delete outdated documents to keep search relevant

Security

  • Never upload sensitive documents without proper access controls
  • Use personal graphs for private documents
  • Audit sharing regularly using listGraphs()
  • Remove documents completely when no longer needed
  • Tag documents appropriately for easier access control

See Also

On this page