Discovering Every Piece of PII Across Your Modern Data Stack

A mid-stage SaaS company with 50 microservices and three cloud providers typically stores PII in more than 200 distinct locations. Most privacy teams know about fewer than half of them. This guide covers why PII management is an infrastructure problem, what breaks when discovery is manual or periodic, and what becomes possible when classification and consent enforcement are built directly into the data stack.

Authors

Ethyca Team

Topic

Taxonomy & Ontology

Published

Mar 15, 2026

Discovering Every Piece of PII Across Your Modern Data Stack

A mid-stage SaaS company with 50 microservices, three cloud providers, and a data warehouse built over four years will typically store personally identifiable information in more than 200 distinct locations. Most privacy teams know about fewer than half of them. The rest sit in logging pipelines, analytics caches, staging databases, and third-party integrations that no single engineer fully maps. This is not a niche edge case. According to IBM, the average enterprise data breach in 2024 cost $4.88 million, with shadow data — data organizations did not know existed — being a significant contributing factor.

Personally identifiable information is the atomic unit of privacy engineering. Every consent decision, every data subject access request, every retention policy, and every AI training pipeline depends on knowing exactly where PII lives, what type it is, and how it flows. When that knowledge is incomplete, everything downstream breaks — not loudly, not immediately, but structurally.

This article examines why managing personally identifiable information is fundamentally an infrastructure question, what specifically goes wrong when organizations treat it as anything less, and what becomes possible when that awareness is built into the data stack itself.

What Is Personally Identifiable Information (PII)?

PII is any data element that can identify a specific individual, either on its own or when combined with other available information. The U.S. Office of Management and Budget defines it as information that can be used to distinguish or trace an individual's identity, either alone or when combined with other information that is linked or linkable to a specific individual.

That definition matters because it extends well beyond the obvious. A name qualifies. A Social Security number qualifies. But so does a device fingerprint, an IP address paired with a timestamp, or a behavioral pattern that narrows identification to a single user within a dataset.

What PII Includes

Direct identifiers include full names, email addresses, phone numbers, government-issued ID numbers, biometric records, and financial account numbers — data points that are identifiable in isolation.

Indirect identifiers become personally identifiable through combination: ZIP codes, dates of birth, job titles, device IDs, geolocation coordinates, and browsing histories. Research has shown that a significant portion of the U.S. population can be uniquely identified using only ZIP code, date of birth, and gender. Three fields — none of them looks personally identifiable on its own.

Sensitive PII carries elevated regulatory and ethical weight: health records, racial or ethnic origin, sexual orientation, religious beliefs, genetic data, and precise geolocation. Under the GDPR, these categories trigger additional processing requirements and demand explicit consent or a narrow set of legal bases.

What Is Not PII

Data that cannot identify an individual — even in combination with other datasets — falls outside the boundary. Fully anonymized datasets where re-identification is statistically infeasible, aggregated metrics such as average session duration across 10,000 users, and publicly available non-personal data like weather records or stock prices do not qualify.

The critical nuance: pseudonymized data is still personally identifiable under most regulatory frameworks, including the GDPR. If a re-identification key exists anywhere in the organization, the pseudonymized dataset retains its PII classification. This distinction catches many engineering teams off guard.

PII Under the GDPR and Across Regulatory Frameworks

The GDPR uses the term "personal data" rather than PII, but the conceptual overlap is substantial. Article 4(1) defines personal data as any information relating to an identified or identifiable natural person — a deliberately broad definition that captures online identifiers, location data, and factors specific to a person's physical, physiological, genetic, mental, economic, cultural, or social identity.

This breadth has practical consequences. Under the GDPR, cookie identifiers are personal data. Hashed email addresses are personal data. Internal user IDs that map to a person through any lookup table are personal data. The California Consumer Privacy Act (CCPA) and its successor the CPRA adopt a similarly expansive approach, as do Brazil's LGPD and newer U.S. state privacy laws in Texas, Oregon, and Montana.

For engineering teams operating across jurisdictions, the takeaway is clear: the definition of PII is a moving target that expands with each new regulation, and it is always broader than intuition suggests.

The Infrastructure Gap in Managing PII

Most organizations manage PII through one of two patterns. The first is manual data mapping: privacy teams interview system owners, build spreadsheets, and attempt to maintain a living inventory of where personal data resides. The second is point-solution scanning: a tool runs periodic scans against known databases and returns a report.

Both approaches share the same structural flaw. They produce a snapshot, not a system. The moment a new microservice deploys, a new vendor integration activates, or a data pipeline changes its schema, the inventory is stale. In organizations shipping code daily, that happens within hours.

This is not a workflow gap. It is an infrastructure gap. The organization lacks a persistent, automated mechanism for discovering personal data as it appears, classifying it according to a consistent taxonomy, and propagating that classification to every system that needs it.

Why Manual Discovery Breaks Down

Manual data mapping depends on institutional knowledge. It assumes that the engineer who built a service two years ago remembers every field that stores user data, that the analytics team documents every new event property, and that third-party vendors accurately disclose their data schemas.

At a company with 100 engineers and 30 SaaS integrations, these assumptions are unreliable. At a company with 500 engineers and 150 integrations, they are fiction. The result is an inventory that captures the data landscape as it existed at the time of the last audit — a landscape that no longer exists.

Why Periodic Scanning Falls Short

Periodic scanning improves on manual mapping by automating detection, but it introduces its own constraint: latency. A weekly scan means that personal data introduced on Monday is unclassified until the following Monday. During that window, the data may be replicated to a warehouse, fed into a model training pipeline, or shared with a third-party processor — each of those downstream movements happening without the governance context that classification provides.

The deeper issue is that periodic scans treat discovery as a batch job. Modern data infrastructure is continuous. Data flows through event streams, real-time pipelines, and API integrations that do not pause for weekly audits. Discovery needs to operate at the same cadence as the infrastructure it monitors.

Building PII Awareness Into the Data Stack

The alternative to snapshots and batch scans is continuous, infrastructure-level discovery and classification — embedding detection directly into the data layer so that every new data element is classified at the point of ingestion, and that classification persists as data moves through the stack.

Helios represents this approach in practice. It automates the discovery and classification of personal data across data systems, databases, and SaaS applications, maintaining a live data map that updates as the infrastructure changes. Rather than asking privacy teams to chase data across systems, Helios brings the data map to them — continuously and automatically.

The mechanics matter. Automated discovery works by inspecting data schemas, sampling data values, and applying classification models that identify patterns across structured and unstructured data. When a new column appears in a database table or a new field appears in an API payload, the system classifies it and adds it to the inventory without human intervention.

This continuous classification creates the foundation that every other privacy operation depends on. Consent enforcement, data subject request fulfillment, retention policy execution, and AI governance all require the same underlying knowledge: where is the personal data, what type is it, and what rules apply to it.

PII Removal and Data Subject Request Fulfillment

Removal of personally identifiable information is one of the most operationally demanding privacy requirements. Under the GDPR's right to erasure (Article 17), the CCPA's right to delete, and equivalent provisions in dozens of other laws, organizations must locate and remove an individual's personal data across every system where it resides — not just the primary database.

Without a live data map, fulfilling a deletion request becomes a manual investigation. Engineering teams query system by system, often relying on institutional knowledge to identify which services store user data.

Lethe automates this process by orchestrating data subject requests across connected systems, executing deletions and de-identification operations based on the classifications Helios maintains. A deletion request triggers an automated workflow that reaches every system where that individual's data exists, verified against the live data map.

Consent as a Governance Layer

Classification alone is necessary but not sufficient. The same data element may be permissible to process for one purpose and impermissible for another, depending on the consent the individual has provided. A user who consents to email marketing has not consented to behavioral profiling. A patient who consents to treatment has not consented to research use of their health data.

Consent decisions must therefore propagate to every system that processes personal data, in real time. Janus handles this by orchestrating consent management across data systems, ensuring that when a user updates their preferences, every downstream system respects the change immediately — not after a batch sync, not after a manual update.

What Infrastructure-Level Awareness Makes Possible

When discovery, classification, consent enforcement, and data subject request fulfillment operate as connected infrastructure, the organizational dynamics shift in specific, measurable ways.

Privacy engineering teams stop spending time on data archaeology. Instead of investigating where personal data lives before they can respond to a request or assess a new data use, they operate against a live, accurate map. The time between receiving a data subject request and completing it drops from days or weeks to hours.

Engineering teams ship faster because privacy review no longer requires a manual data audit for every new feature. When classification is automatic and continuous, the privacy implications of a new data flow are visible at deployment time — engineers can see exactly what personal data their service handles and what governance rules apply, before the code reaches production.

Fides provides the open-source framework that encodes these governance rules as infrastructure. Privacy policies are defined as code, version-controlled alongside the systems they govern, and enforced automatically. When privacy teams, governance specialists, and engineers all operate against the same taxonomy and the same policy definitions, governance becomes distributed and automatic rather than centralized and manual.

The Compounding Value of Accurate Data Maps

An accurate, continuously updated data map is not just a privacy asset — it is a data quality asset. It tells the organization exactly what personal data it holds, where it flows, and how it is used. That knowledge informs data architecture decisions, vendor evaluations, M&A due diligence, and AI model governance.

Organizations building AI systems face a particularly acute version of this need. Training data governance requires knowing whether a dataset contains personal data, what consent basis covers its use, and whether specific individuals have requested deletion. Without infrastructure-level awareness, every AI project begins with a manual data audit. With it, the governance context is already present in the data catalog.

From Awareness to Data Confidence

The question facing privacy engineering teams is not whether they need to manage personally identifiable information — every regulation, every user expectation, and every AI governance framework demands it. The question is whether that management operates at the same level of automation, accuracy, and continuity as the data infrastructure it governs.

When it does, the organization gains something more valuable than regulatory coverage. It gains data confidence: the ability to answer, at any moment, exactly what personal data exists in the system, who it belongs to, what consent governs its use, and how to act on any individual's request. That confidence is what allows engineering teams to build quickly, privacy teams to govern effectively, and the organization to treat personal data as a resource to be respected rather than a liability to be managed reactively.

Ethyca's platform — Helios for discovery, Fides for policy-as-code, Janus for consent orchestration, and Lethe for automated data subject requests — connects these capabilities into a unified privacy infrastructure. The goal is not the cost savings that follow, though those are real. The goal is an organization where every system knows what personal data it holds and every policy is enforced at the infrastructure level, continuously and automatically.

That is what infrastructure-level awareness makes possible. Not just compliance. Confidence.

Speak With Us

[X Twitter][Linkedin]

[4 articles]