Why most DLP rollouts fail
The standard DLP failure mode: deploy the tool in blocking mode, get 200 false positives on day one (marketing attaching logos to emails, finance exporting CSVs to approved destinations), watch the security team get flooded with appeals, watch the business demand the tool be turned off. Two weeks in, DLP is disabled or set to log-only indefinitely.
The root cause isn't the tool — it's deploying policy before you understand your data flows. You can't write rules for what you haven't observed.
System architecture
The DLP system is a Python engine with two detection layers: regex patterns for structured data like card numbers and ID formats, and YARA rules for content signatures that need more context. The two layers run in sequence on every file, clipboard snapshot, or outbound stream.
The pattern library
Regex patterns cover the structured formats that have fixed shapes. Context matching (at least 2 neighbouring fields or a label within 50 characters) cuts false positives on card numbers significantly — random strings that look like PANs without context don't fire.
YARA for custom signatures
YARA lets analysts write rules in natural language that describe file content, not just string patterns. This is useful for client-specific classifications: internal project names, contract language, proprietary data structures.
The phased rollout
Monitor only
All rules active, no blocking, no alerting. Log everything. Understand what's moving where before writing any enforcement policy.
Alert on critical
Secrets (API keys, tokens) and financial data move to alert mode. Teams are notified, not blocked. Build the whitelist from legitimate flows.
Block on high-confidence
Only the highest-confidence rules (verified true-positive rate >95%) enter blocking mode. Everything else stays at alert until the baseline is solid.
Managing false positives
False positive management is the real work. The engine supports per-path, per-user, and per-pattern exemptions. An exemption is not a permanent bypass — it's logged and reviewed monthly.
After 8 weeks, the engine was running in mixed mode: 6 rule categories in block, 12 in alert, with a stable false-positive rate under 3%. The project is linked on the projects page.