Self-hosted PostgreSQL data masking

Give every engineer production-realistic data. Leak nothing.

PrivaCI is an open-source engine for postgres data masking. It streams your PostgreSQL database into a safe, referentially-intact copy for staging, CI, and demos — entirely inside your own VPC. No PII ever leaves your network.

Apache-grade engineering, Elastic License 2.0. Read the docs or browse the source.

0 bytes
of PII leave your VPC — no phone-home, runs offline
100 GB+
databases stream in bounded memory (one batch in RAM)
100%
referential integrity — deterministic masking keeps FKs intact
Every column
recorded in a tamper-aware audit log on the target

The problem

Real customer data does not belong in staging

Every team needs realistic data to test against — and most shortcuts to get it create risk you cannot take back.

Hand-rolled scripts rot

pg_dump plus sed/UPDATE scripts break the moment the schema changes, silently leak new PII columns, and nobody owns them.

Broken foreign keys break tests

Naive masking randomizes each table independently, so joins dangle and staging stops mirroring production behavior.

Copying prod to staging is a breach waiting to happen

Real customer data in lower environments is the most common avoidable compliance finding — and the hardest to undo.

SaaS maskers want your data

Most managed tools exfiltrate rows to their cloud to mask them. That is the exact thing you are trying to prevent.

How it works

Source database in, safe database out — in five passes

01

Introspect

Read the source catalog — tables, columns, foreign keys, partitions, and implied keys — straight from pg_catalog.

02

Replicate

Recreate the DDL on an empty target in dependency order, deferring constraints to break FK cycles.

03

Stream & mask

COPY-binary rows through an in-memory three-tier masking pipeline and load them in foreign-key order.

04

Checkpoint

Commit a per-table checkpoint every batch so an interrupted run resumes exactly where it stopped.

05

Audit

Write every masking decision to the _privaci schema on the target for compliance review.

See it run

One command in your pipeline

Declare PII columns in YAML, dry-run against your schema, preview masked samples, then stream a referentially-intact copy — with verification and a signed audit trail on the way out.

PrivaCI terminal session: dry-run, preview masked samples, run the mask job, verify integrity, and export a signed compliance report

Prefer Docker? docker run --rm ghcr.io/boundarylogic/privaci:latest --help · or generate a CI workflow with privaci generate-ci.

Open source, auditable, yours to run

The whole engine is free. Forever.

Streaming, foreign-key integrity, PII auto-detection, deterministic masking, crash-safe resume, and the audit log are all open source under ELv2. You can read every line.

Deterministic, FK-safe faking

Masking is a pure function of (config, salt, row). The same input always maps to the same output, so related rows stay consistent across every table.

Constant-memory streaming

COPY TO STDOUT (binary) → mask → COPY FROM STDIN, both legs concurrent. At most one batch (default 10k rows) is ever in RAM.

PII auto-detection

Scans column names, types, and pg_stats to classify PII with high/medium/low confidence. Ships patterns for email, SSN, phone, names, cards, secrets, and freeform text.

Two-tier masking pipeline

L1 deterministic rules plus L2 local SpaCy NER for freeform text — all driven by one YAML file. Pattern libraries cover email, SSN, phone, cards, secrets, and more.

Dry-run, verify, and drift gates

dry-run --report produces a reviewable plan before any write. verify audits a completed run without re-reading PII. detect-drift (commercial) blocks CI when the schema changes.

Referential integrity by design

Topological table ordering loads parents before children; cycles are broken with deferred constraints. Sequences and identity columns are re-synced.

Crash-safe resume

Per-batch checkpoints mean privaci resume continues from the last committed row — and refuses to resume if the source schema, config, or salt drifted.

Audit trail for compliance

_privaci.runs and _privaci.audit_log record every run and every column decision, with stable run identity for evidence.

Secure by default

Salt is required at startup (no silent default), PII never appears in logs, nothing is written to disk mid-run, and all SQL is parameterized.

Drop into any CI

generate-ci emits ready-to-commit GitHub Actions, GitLab CI, or Kubernetes CronJob workflows with least-privilege secrets baked in.

How it compares

Why teams stop hand-rolling this

Capability pg_dump + scripts SaaS maskers PrivaCI OSS PrivaCI Commercial
Runs entirely in your VPC Partial No Yes Yes
Preserves foreign keys No Partial Yes Yes
Streams 100 GB+ in bounded memory No Partial Yes Yes
Auto-detects PII columns No Yes Yes Yes
Audit log of every change No Partial Yes Yes
Crash-safe resume No No Yes Yes
Schema-drift detection No Partial No Yes
FK-aware data subsetting No Partial No Yes
JSONB path masking No Partial No Yes
Signed compliance reports No Partial No Yes
CI preview & policy diff No Partial No Yes

Runs entirely in your VPC

  • pg_dump + scripts Partial
  • SaaS maskers No
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Preserves foreign keys

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Streams 100 GB+ in bounded memory

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Auto-detects PII columns

  • pg_dump + scripts No
  • SaaS maskers Yes
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Audit log of every change

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Crash-safe resume

  • pg_dump + scripts No
  • SaaS maskers No
  • PrivaCI OSS Yes
  • PrivaCI Commercial Yes

Schema-drift detection

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS No
  • PrivaCI Commercial Yes

FK-aware data subsetting

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS No
  • PrivaCI Commercial Yes

JSONB path masking

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS No
  • PrivaCI Commercial Yes

Signed compliance reports

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS No
  • PrivaCI Commercial Yes

CI preview & policy diff

  • pg_dump + scripts No
  • SaaS maskers Partial
  • PrivaCI OSS No
  • PrivaCI Commercial Yes
Commercial

Need signed reports, subsetting, and CI gates?

Commercial v1 plugs into the same engine and adds tamper-evident compliance reports, FK-aware subsetting, JSONB path masking, schema-drift detection, and CI preview — billed through AWS Marketplace.

Answers

Frequently asked questions

Is it safe to run in our own VPC?

Yes. The engine runs entirely inside your network and never sends source data or PII to any external service. It does not phone home and works fully offline. The source is auditable under the Elastic License 2.0.

Does it preserve foreign keys and referential integrity?

Yes. Masking is deterministic from a salt, so the same input always maps to the same output. Foreign keys, composite keys, and cross-table relationships stay consistent, and sequences are re-synced.

How is this different from pg_dump plus some scripts?

PrivaCI streams large databases with bounded memory, auto-detects PII, preserves referential integrity, resumes after a crash, and writes an audit log of exactly what was masked — with no bespoke scripts to maintain as your schema evolves.

How big a database can it handle?

It is built for 100 GB+ sources. COPY-binary streaming keeps at most one batch (default 10,000 rows) in memory, so memory stays flat regardless of table size.

What happens if a run is interrupted?

Every batch commits a checkpoint inside the same transaction as the data write. privaci resume continues from the last committed row and refuses to resume if the source schema, config, or salt changed.

How does it find PII we didn't configure?

Auto-detect inspects column names, types, and pg_stats against a built-in pattern library, scoring each column high/medium/low. Run dry-run --report for a reviewable plan, or --strict-autodetect to fail CI on any uncovered PII column.

What is open source versus commercial?

The full masking engine — streaming, FK integrity, auto-detect, L1/L2 masking, dry-run, verify, resume, and the audit log — is open source under ELv2. Commercial v1 adds signed compliance reports, drift detection, FK-aware subsetting, JSONB path masking, CI preview, notifiers, and Marketplace entitlement.

What database engines are supported?

PostgreSQL today, including partitioned tables, identity/serial columns, and deferred constraints. The architecture is Postgres-native by design.