~/portfolio/ blog/ dlp-playbook

DLP without the drama: building and rolling out the system.

How I built the Python + YARA DLP engine, why most rollouts die by week two, and the phased approach that surfaces real exfiltration without breaking anyone's workflow.

Why most DLP rollouts fail

The standard DLP failure mode: deploy the tool in blocking mode, get 200 false positives on day one (marketing attaching logos to emails, finance exporting CSVs to approved destinations), watch the security team get flooded with appeals, watch the business demand the tool be turned off. Two weeks in, DLP is disabled or set to log-only indefinitely.

The root cause isn't the tool — it's deploying policy before you understand your data flows. You can't write rules for what you haven't observed.

System architecture

The DLP system is a Python engine with two detection layers: regex patterns for structured data like card numbers and ID formats, and YARA rules for content signatures that need more context. The two layers run in sequence on every file, clipboard snapshot, or outbound stream.

// engine structure
# dlp_engine.py — simplified class DLPEngine: def scan(self, content, source): findings = [] findings += self.regex_scan(content) # PAN, SSN, Aadhaar... findings += self.yara_scan(content) # custom rules if findings: self.enforce(findings, source) # monitor|alert|block return findings

The pattern library

Regex patterns cover the structured formats that have fixed shapes. Context matching (at least 2 neighbouring fields or a label within 50 characters) cuts false positives on card numbers significantly — random strings that look like PANs without context don't fire.

// patterns.py — excerpt
PATTERNS = { "credit_card": r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b', "aadhaar": r'\b[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}\b', "pan_india": r'\b[A-Z]{5}[0-9]{4}[A-Z]{1}\b', "aws_key": r'AKIA[0-9A-Z]{16}', "github_token": r'gh[pousr]_[A-Za-z0-9]{36}', }

YARA for custom signatures

YARA lets analysts write rules in natural language that describe file content, not just string patterns. This is useful for client-specific classifications: internal project names, contract language, proprietary data structures.

// sample.yar
rule ConfidentialContract { strings: $a = "STRICTLY CONFIDENTIAL" nocase $b = "Non-Disclosure Agreement" nocase $c = /Contract No\.\s*[A-Z]{2}-\d{6}/ condition: 2 of them }

The phased rollout

// week 1

Monitor only

All rules active, no blocking, no alerting. Log everything. Understand what's moving where before writing any enforcement policy.

// week 2

Alert on critical

Secrets (API keys, tokens) and financial data move to alert mode. Teams are notified, not blocked. Build the whitelist from legitimate flows.

// week 3+

Block on high-confidence

Only the highest-confidence rules (verified true-positive rate >95%) enter blocking mode. Everything else stays at alert until the baseline is solid.

The week-two trap: The temptation is to move everything to block after seeing the monitor data. Resist. You'll block a legitimate business flow you didn't know about and spend a week apologising. Build the whitelist first.

Managing false positives

False positive management is the real work. The engine supports per-path, per-user, and per-pattern exemptions. An exemption is not a permanent bypass — it's logged and reviewed monthly.

// exemptions.json
{ "credit_card": { "paths": ["/finance/exports/approved/"], "users": ["finance_team"], "expires": "2026-06-01", "approved_by": "ciso" } }

After 8 weeks, the engine was running in mixed mode: 6 rule categories in block, 12 in alert, with a stable false-positive rate under 3%. The project is linked on the projects page.

← iso policies next: packet sniffer →