Ingestion pipeline — bRRAIn Docs

How documents are processed: extraction, classification, graph connection, and custom pipelines.

Ingestion pipeline

Every document upload flows through the same deterministic pipeline. The pipeline guarantees exactly-once ingestion, atomic graph updates, and compliance-grade audit trails.

Pipeline overview

1. Client upload
   ↓
2. Envelope encryption (user-derived key)
   ↓
3. Metadata extraction
   ├─ Filename, size, mime type
   ├─ Content hash (SHA-256)
   └─ Preview thumbnail generation
   ↓
4. Handler summarization + entity extraction
   ├─ Key-term extraction via domain adapter
   ├─ Classification (public/internal/confidential/restricted)
   └─ POPE entity detection (people, orgs, places, events)
   ↓
5. Gate 1: policy check
   ├─ PII / credential detection
   ├─ LLM provenance validation
   └─ Rate limit check
   ↓
6. Consolidator merge (graph update prepared)
   ↓
7. Gate 2: final policy check
   ├─ Schema validation
   └─ Classification lock
   ↓
8. Atomic write
   ├─ Vault: encrypted document + metadata
   ├─ Graph: document node + entity edges
   └─ Audit: immutable event log
   ↓
9. Notifier (subscribed workspaces)

SLA per stage

| Stage | Target p95 | Typical | | --- | --- | --- | | Encryption | < 50 ms / 10 MB | 20 ms | | Metadata extraction | < 200 ms | 100 ms | | Handler summarization | < 2 s | 800 ms | | Graph update | < 100 ms | 40 ms | | Total | < 3 s for files ≤ 10 MB | 1.5 s |

Large files (video, 100GB+) stream chunks through the pipeline; summarization runs on extracted transcripts or keyframes.

What the Handler extracts

For each document, the Handler produces:

{
    "classification": "confidential",
    "summary": "Q1 2026 portfolio review covering hedge strategy...",
    "key_terms": ["hedge", "Q1 2026", "Vanguard", "portfolio"],
    "entities": {
        "people":        ["alice@lawfirm.io", "Bob Smith"],
        "organizations": ["Vanguard", "Acme Corp"],
        "places":        ["New York office"],
        "events":        ["Q1 2026 earnings call"]
    },
    "relationships": [
        {"type": "about", "target": "project:portfolio-2026"},
        {"type": "authored_by", "target": "user:alice"}
    ]
}

Custom pipelines

For specialized domains (legal, mining, healthcare) you can register a custom pipeline step that runs between stages 4 and 5.

Example: medical PHI redactor

portal.RegisterPipelineStep(ctx, &portal.PipelineStep{
    Name:      "phi-redactor",
    RunAfter:  portal.StageEntityExtraction,
    RunBefore: portal.StageGate1,
    Handler: func(doc *portal.Document) error {
        return redactPHI(doc) // redacts MRN, SSN, DOB in place
    },
})

Custom steps are sandboxed and must complete within 500 ms or the pipeline escalates to a Librarian review queue.

Retry and failure modes

| Failure | Retry | Escalation | | --- | --- | --- | | Encryption failure | 3x with exponential backoff | Tier 2 admin on all fail | | Handler timeout | 2x with reduced context | Manual classification queue | | Gate 1 PII detected | None — quarantined | Librarian review | | Graph write failure | Atomic rollback | Tier 2 escalation + incident | | Storage quota exceeded | None — rejected | User notification |

Observability

Every ingestion event produces metrics and traces visible in the observability dashboard:

  • portal.ingestion.duration (histogram, per stage)
  • portal.ingestion.errors (counter, per stage)
  • portal.ingestion.classification (counter, per class)
  • portal.ingestion.handler_tokens (counter)

Drill-down links connect each event to its audit log entry and the resulting graph node.

Reingestion

Documents can be reingested to rerun Handler extraction (e.g., after a new domain adapter ships):

brrain docs reingest --workspace legal-team --since 2026-01-01

Reingestion is idempotent — existing graph nodes are updated in place, and audit entries record the reingest reason.