Ingestion pipeline — bRRAIn Docs
How documents are processed: extraction, classification, graph connection, and custom pipelines.
Ingestion pipeline
Every document upload flows through the same deterministic pipeline. The pipeline guarantees exactly-once ingestion, atomic graph updates, and compliance-grade audit trails.
Pipeline overview
1. Client upload
↓
2. Envelope encryption (user-derived key)
↓
3. Metadata extraction
├─ Filename, size, mime type
├─ Content hash (SHA-256)
└─ Preview thumbnail generation
↓
4. Handler summarization + entity extraction
├─ Key-term extraction via domain adapter
├─ Classification (public/internal/confidential/restricted)
└─ POPE entity detection (people, orgs, places, events)
↓
5. Gate 1: policy check
├─ PII / credential detection
├─ LLM provenance validation
└─ Rate limit check
↓
6. Consolidator merge (graph update prepared)
↓
7. Gate 2: final policy check
├─ Schema validation
└─ Classification lock
↓
8. Atomic write
├─ Vault: encrypted document + metadata
├─ Graph: document node + entity edges
└─ Audit: immutable event log
↓
9. Notifier (subscribed workspaces)
SLA per stage
| Stage | Target p95 | Typical | | --- | --- | --- | | Encryption | < 50 ms / 10 MB | 20 ms | | Metadata extraction | < 200 ms | 100 ms | | Handler summarization | < 2 s | 800 ms | | Graph update | < 100 ms | 40 ms | | Total | < 3 s for files ≤ 10 MB | 1.5 s |
Large files (video, 100GB+) stream chunks through the pipeline; summarization runs on extracted transcripts or keyframes.
What the Handler extracts
For each document, the Handler produces:
{
"classification": "confidential",
"summary": "Q1 2026 portfolio review covering hedge strategy...",
"key_terms": ["hedge", "Q1 2026", "Vanguard", "portfolio"],
"entities": {
"people": ["alice@lawfirm.io", "Bob Smith"],
"organizations": ["Vanguard", "Acme Corp"],
"places": ["New York office"],
"events": ["Q1 2026 earnings call"]
},
"relationships": [
{"type": "about", "target": "project:portfolio-2026"},
{"type": "authored_by", "target": "user:alice"}
]
}
Custom pipelines
For specialized domains (legal, mining, healthcare) you can register a custom pipeline step that runs between stages 4 and 5.
Example: medical PHI redactor
portal.RegisterPipelineStep(ctx, &portal.PipelineStep{
Name: "phi-redactor",
RunAfter: portal.StageEntityExtraction,
RunBefore: portal.StageGate1,
Handler: func(doc *portal.Document) error {
return redactPHI(doc) // redacts MRN, SSN, DOB in place
},
})
Custom steps are sandboxed and must complete within 500 ms or the pipeline escalates to a Librarian review queue.
Retry and failure modes
| Failure | Retry | Escalation | | --- | --- | --- | | Encryption failure | 3x with exponential backoff | Tier 2 admin on all fail | | Handler timeout | 2x with reduced context | Manual classification queue | | Gate 1 PII detected | None — quarantined | Librarian review | | Graph write failure | Atomic rollback | Tier 2 escalation + incident | | Storage quota exceeded | None — rejected | User notification |
Observability
Every ingestion event produces metrics and traces visible in the observability dashboard:
portal.ingestion.duration(histogram, per stage)portal.ingestion.errors(counter, per stage)portal.ingestion.classification(counter, per class)portal.ingestion.handler_tokens(counter)
Drill-down links connect each event to its audit log entry and the resulting graph node.
Reingestion
Documents can be reingested to rerun Handler extraction (e.g., after a new domain adapter ships):
brrain docs reingest --workspace legal-team --since 2026-01-01
Reingestion is idempotent — existing graph nodes are updated in place, and audit entries record the reingest reason.