Hand-rolled scripts rot
pg_dump plus sed/UPDATE scripts break the moment the schema changes, silently leak new PII columns, and nobody owns them.
Self-hosted PostgreSQL data masking
PrivaCI is an open-source engine for postgres data masking. It streams your PostgreSQL database into a safe, referentially-intact copy for staging, CI, and demos — entirely inside your own VPC. No PII ever leaves your network.
Apache-grade engineering, Elastic License 2.0. Read the docs or browse the source.
The problem
Every team needs realistic data to test against — and most shortcuts to get it create risk you cannot take back.
pg_dump plus sed/UPDATE scripts break the moment the schema changes, silently leak new PII columns, and nobody owns them.
Naive masking randomizes each table independently, so joins dangle and staging stops mirroring production behavior.
Real customer data in lower environments is the most common avoidable compliance finding — and the hardest to undo.
Most managed tools exfiltrate rows to their cloud to mask them. That is the exact thing you are trying to prevent.
How it works
Read the source catalog — tables, columns, foreign keys, partitions, and implied keys — straight from pg_catalog.
Recreate the DDL on an empty target in dependency order, deferring constraints to break FK cycles.
COPY-binary rows through an in-memory three-tier masking pipeline and load them in foreign-key order.
Commit a per-table checkpoint every batch so an interrupted run resumes exactly where it stopped.
Write every masking decision to the _privaci schema on the target for compliance review.
See it run
Declare PII columns in YAML, dry-run against your schema, preview masked samples, then stream a referentially-intact copy — with verification and a signed audit trail on the way out.
Prefer Docker? docker run --rm ghcr.io/boundarylogic/privaci:latest --help · or generate a CI workflow with
privaci generate-ci.
Open source, auditable, yours to run
Streaming, foreign-key integrity, PII auto-detection, deterministic masking, crash-safe resume, and the audit log are all open source under ELv2. You can read every line.
Masking is a pure function of (config, salt, row). The same input always maps to the same output, so related rows stay consistent across every table.
COPY TO STDOUT (binary) → mask → COPY FROM STDIN, both legs concurrent. At most one batch (default 10k rows) is ever in RAM.
Scans column names, types, and pg_stats to classify PII with high/medium/low confidence. Ships patterns for email, SSN, phone, names, cards, secrets, and freeform text.
L1 deterministic rules plus L2 local SpaCy NER for freeform text — all driven by one YAML file. Pattern libraries cover email, SSN, phone, cards, secrets, and more.
dry-run --report produces a reviewable plan before any write. verify audits a completed run without re-reading PII. detect-drift (commercial) blocks CI when the schema changes.
Topological table ordering loads parents before children; cycles are broken with deferred constraints. Sequences and identity columns are re-synced.
Per-batch checkpoints mean privaci resume continues from the last committed row — and refuses to resume if the source schema, config, or salt drifted.
_privaci.runs and _privaci.audit_log record every run and every column decision, with stable run identity for evidence.
Salt is required at startup (no silent default), PII never appears in logs, nothing is written to disk mid-run, and all SQL is parameterized.
generate-ci emits ready-to-commit GitHub Actions, GitLab CI, or Kubernetes CronJob workflows with least-privilege secrets baked in.
How it compares
| Capability | pg_dump + scripts | SaaS maskers | PrivaCI OSS | PrivaCI Commercial |
|---|---|---|---|---|
| Runs entirely in your VPC | Partial | No | Yes | Yes |
| Preserves foreign keys | No | Partial | Yes | Yes |
| Streams 100 GB+ in bounded memory | No | Partial | Yes | Yes |
| Auto-detects PII columns | No | Yes | Yes | Yes |
| Audit log of every change | No | Partial | Yes | Yes |
| Crash-safe resume | No | No | Yes | Yes |
| Schema-drift detection | No | Partial | No | Yes |
| FK-aware data subsetting | No | Partial | No | Yes |
| JSONB path masking | No | Partial | No | Yes |
| Signed compliance reports | No | Partial | No | Yes |
| CI preview & policy diff | No | Partial | No | Yes |
Commercial v1 plugs into the same engine and adds tamper-evident compliance reports, FK-aware subsetting, JSONB path masking, schema-drift detection, and CI preview — billed through AWS Marketplace.
Answers
Yes. The engine runs entirely inside your network and never sends source data or PII to any external service. It does not phone home and works fully offline. The source is auditable under the Elastic License 2.0.
Yes. Masking is deterministic from a salt, so the same input always maps to the same output. Foreign keys, composite keys, and cross-table relationships stay consistent, and sequences are re-synced.
PrivaCI streams large databases with bounded memory, auto-detects PII, preserves referential integrity, resumes after a crash, and writes an audit log of exactly what was masked — with no bespoke scripts to maintain as your schema evolves.
It is built for 100 GB+ sources. COPY-binary streaming keeps at most one batch (default 10,000 rows) in memory, so memory stays flat regardless of table size.
Every batch commits a checkpoint inside the same transaction as the data write. privaci resume continues from the last committed row and refuses to resume if the source schema, config, or salt changed.
Auto-detect inspects column names, types, and pg_stats against a built-in pattern library, scoring each column high/medium/low. Run dry-run --report for a reviewable plan, or --strict-autodetect to fail CI on any uncovered PII column.
The full masking engine — streaming, FK integrity, auto-detect, L1/L2 masking, dry-run, verify, resume, and the audit log — is open source under ELv2. Commercial v1 adds signed compliance reports, drift detection, FK-aware subsetting, JSONB path masking, CI preview, notifiers, and Marketplace entitlement.
PostgreSQL today, including partitioned tables, identity/serial columns, and deferred constraints. The architecture is Postgres-native by design.