Skip to main content
Build trusted data with Ethyca.

Subject to Ethyca’s Privacy Policy, you agree to allow Ethyca to contact you via the email provided for scheduling and marketing purposes.

Your Data Catalog Is Not a Privacy Tool (Yet)

Data catalogs are metadata indexes. They tell you what data exists and where to find it not what you're permitted to do with it. When organizations extend their catalog's mandate into privacy operations, they fill the gap between what the catalog provides and what compliance demands with manual process. That manual layer becomes the binding constraint as data volume and request volume grow. The catalog is an input layer. Privacy requires an execution layer built on top of it.

Authors
Ethyca Team
Topic
Data Engineering
Published
May 22, 2026
Your Data Catalog Is Not a Privacy Tool (Yet)

Key takeaways

  • Data catalogs are metadata indexes. They make data findable and understandable. They were not designed to enforce privacy policy, execute data subject requests, or classify personal data at the field-level precision that regulations require.
  • Most organizations extend their catalog's mandate into privacy operations by default, then fill the gap between what the catalog provides and what compliance demands with manual process. That manual layer becomes the binding constraint as data volume and request volume grow.
  • Privacy regulations require execution, not documentation. A deletion request under GDPR Article 17 must propagate to every system holding the individual's data, be verified, and be documented. A catalog can help identify where to look. It cannot do the rest.
  • The data map and the data catalog are related but distinct. A catalog is a general-purpose metadata inventory. A privacy data map is purpose-built, enriched with consent context, legal basis, processing purpose, and individual-level linkage that catalogs do not capture.
  • The right architecture uses the catalog as an input layer, not a control plane. Privacy infrastructure consumes catalog metadata, enriches it with automated classification, applies policy-as-code controls, and executes data subject operations without manual intervention.

By 2033, organizations will spend over $12 billion annually on data catalog platforms. According to Grand View Research, the data catalog market was valued at $736.2 million in 2022 and is projected to reach $3.86 billion by 2030 at a 23.2% compound annual growth rate. Every major cloud provider now ships a catalog service. Every data team has one on the roadmap if not already in production.

Yet privacy incidents keep pace. Regulatory enforcement actions are climbing. Data subject requests often require significant manual effort to fulfill. The catalog is everywhere, but the privacy infrastructure it was supposed to enable remains absent in most organizations.

The gap is a structural consequence of treating the data catalog as something it was never designed to be: a privacy control plane.

The data catalog: Ubiquitous, but misunderstood

The adoption curve for data catalog tools tells a clear story about demand. Organizations want visibility into their data estates. They want to know what data exists, where it lives, who owns it, and how it flows. These are reasonable goals, and a well-implemented catalog delivers on them.

The misunderstanding begins when organizations extend that mandate. Because the catalog knows where data lives, teams assume it can also govern how that data is used, who can access it, and whether it complies with privacy regulations. This assumption is widespread, and it is also incorrect.

A data catalog is an index. It organizes metadata, surfaces lineage, and enables search and discovery across distributed systems. What it does not do, by design, is enforce policy, execute data subject requests, or apply privacy-specific classifications at the granularity that regulations demand.

The distinction matters because organizations are making architectural decisions based on the conflation. They invest in catalog platforms expecting privacy outcomes. When those outcomes do not materialize, the response is typically to add more manual processes on top of the catalog rather than to question whether the catalog was ever the right layer for privacy operations.

What a data catalog is and what it is not

A data catalog is a centralized metadata management system that inventories an organization's data assets: schemas, ownership, lineage, and usage patterns. Think of it as a library card catalog for an organization's entire data estate — it tells you what exists and where to find it, but not what you are permitted to do with it.

In practice, a catalog ingests metadata from databases, data lakes, warehouses, APIs, and SaaS applications, applies tags, tracks lineage from source to consumption, and provides a search interface for analysts and engineers. A data catalog example in a mid-size SaaS company might look like this: the engineering team connects PostgreSQL databases, a Snowflake warehouse, and Salesforce to the catalog. The catalog crawls these systems, extracts schema information, and presents a unified view. An analyst searching for 'customer email' can see every table and system where that field appears, who owns it, and how it flows downstream.

This is genuinely useful for data governance, analytics, and operational efficiency. The catalog can tell you that a field called user_email exists in fourteen systems. It cannot tell you whether that field is being processed lawfully under GDPR Article 6, whether a specific data subject has requested its deletion, or whether the retention period has expired.

A data catalog differs from a data dictionary in scope and dynamism. A data dictionary defines the structure and meaning of data elements: column names, data types, valid values, and business definitions. It is typically static. A catalog is dynamic: it discovers data across systems, tracks lineage, records usage, and evolves as the data estate changes. The dictionary tells you what a field means. The catalog tells you where it lives and how it moves. Neither tells you what you are permitted to do with it under a given regulatory framework.

Within a broader data governance framework, the catalog serves as the metadata backbone supporting data stewardship, quality monitoring, and access management. These are governance functions. Privacy governance overlaps with general data governance but adds categorically different requirements: individual-level data operations, consent management, purpose limitation enforcement, and cross-border transfer controls. A catalog that serves general governance well may still be entirely insufficient for privacy governance.

Why data catalogs are not privacy infrastructure

The distinction between cataloging and privacy operations becomes concrete when you examine what privacy regulations actually require.

GDPR, CCPA, and their successors require organizations to act on personal data in specific, auditable, time-bound ways: they demand execution, not documentation . A data subject access request under GDPR Article 15 requires an organization to locate all personal data related to an individual, compile it, and deliver it within one month. A deletion request under Article 17 requires the organization to remove that data from every system where it exists, verify the deletion, and document the action.

A data catalog can support the first step: identifying where personal data might reside. Catalogs operate at the schema level. They know that a table called users has a column called email. They do not reliably know that a specific individual's email address also appears in log files, analytics events, third-party integrations, backup systems, and cached API responses.

Privacy operations require field-level, record-level, and sometimes value-level precision. They require the ability to traverse not just the catalog's metadata but the actual data across every system, including systems the catalog does not index. And they require the ability to take action: retrieve, redact, delete, anonymize, or restrict processing.

No major data catalog platform provides this execution layer natively. The catalog is a map, but privacy requires a map, a navigation system, and an engine.

What is a data catalog in data governance?

Within a data governance framework, the catalog serves as the metadata backbone. It supports data stewardship by making ownership visible, enables quality monitoring by tracking lineage, and facilitates access management by documenting who uses what. These are governance functions.

Privacy governance overlaps with general data governance but adds requirements that are categorically different: individual-level data operations, consent management, purpose limitation enforcement, and cross-border transfer controls. A catalog that serves general governance well may still be entirely insufficient for privacy governance.

Four reasons why catalog-centric approaches don’t work at scale

Most organizations build the same architecture: a data catalog at the center, connected to major data stores, with privacy and compliance teams manually identifying relevant systems when a DSR arrives. An engineer queries each system individually, compiles the results, reviews them, and delivers the response. This works at low request volumes. It does not hold as volume grows.

  • Coverage is never complete

Shadow IT, legacy systems, third-party processors, and unstructured data stores routinely fall outside the catalog's scope. A single missed system produces an incomplete or non-compliant response, and the catalog has no mechanism to alert teams to what it does not know about.

  • Identity resolution is manual

The catalog knows that users.email exists in Snowflake. It does not know that the same email address, hashed differently, exists in an analytics pipeline, a marketing automation platform, and a machine learning feature store. Resolving identity across systems requires identity graph capabilities that catalogs do not provide.

  • Execution is human-dependent

Every DSR requires an engineer to write queries, verify results, and coordinate across teams. As request volume grows, this becomes a headcount problem. Organizations receiving high DSR volumes face the equivalent of several full-time employees dedicated solely to request fulfillment, with accuracy degrading as volume increases.

  • Auditability degrades

Manual processes produce audit trails made of tickets, emails, and spreadsheets. Demonstrating compliance to a regulator requires reconstructing the entire chain of actions for each request, a manual exercise that compounds the cost of every fulfilled request.

The catalog provides a starting point. The distance between that starting point and a compliant, auditable privacy operation is filled entirely by manual work, and that manual layer becomes the binding constraint as data volume, system count, and request volume grow.

Scaling the catalog itself is a well-understood engineering exercise: add connectors, automate metadata ingestion, improve search. Scaling catalog-dependent privacy operations is a fundamentally different undertaking. It requires automated execution, which means, the ability to programmatically locate, retrieve, modify, and delete personal data across every connected system without human intervention at each step. Organizations that attempt to scale privacy by scaling their catalog alone find they have built a very detailed map of a territory they still cannot navigate

Infrastructure-first privacy: Moving beyond the catalog with Ethyca

The alternative is to treat privacy as an infrastructure concern rather than a catalog extension. This means building systems that can discover personal data automatically, classify it against privacy-specific taxonomies, enforce policies programmatically, and execute data subject operations at machine speed.

This is the architecture Ethyca builds, and it operates across four distinct but integrated layers.

The first layer is automated discovery and classification. Helios continuously scans an organization's data estate, identifying personal data at the field level across databases, SaaS applications, data warehouses, and cloud storage. Unlike catalog-level schema tagging, this classification operates on actual data values, detecting personal information even when column names are ambiguous or systems are undocumented. The result is a living, privacy-specific data map that updates as the data estate evolves.

The second layer is policy-as-code. Fides provides a framework for defining privacy policies in code, not in documents. These policies specify what data can be collected, for what purposes, under what legal bases, and with what retention periods. Because the policies are code, they can be version-controlled, tested, reviewed in pull requests, and enforced automatically. Engineering teams work within clearly defined boundaries that are technically enforced, not just documented in policy manuals.

The third layer is automated execution. Lethe handles the operational fulfillment of data subject requests: access, deletion, rectification, and portability. When a request arrives, Lethe traverses the data map, identifies every system containing the individual's data, executes the appropriate action, and generates an auditable record. The entire process runs programmatically, with no engineer writing a query and no analyst compiling a spreadsheet.

The fourth layer is AI governance. Astralis extends privacy controls into AI and machine learning systems, enforcing policies on training data, model inputs, and inference outputs. As organizations adopt AI, the surface area for personal data processing expands dramatically. Astralis ensures that privacy policies apply consistently across traditional data systems and AI pipelines alike, at the infrastructure level rather than as an afterthought.

Together, these layers form a privacy infrastructure stack that uses catalog-level metadata as an input but does not depend on the catalog as a control mechanism. The catalog tells you what exists, and the infrastructure acts on it.

Speak with us to learn how Ethyca can help.

FAQs

What is a data catalog and what are its limitations for privacy?

A data catalog is a metadata management system that inventories data assets, tracks lineage, and enables search across an organization's data estate. Its limitation for privacy is structural: it operates at the schema level, not the record level, and it was not designed to enforce policy, execute data subject requests, or classify personal data with the granularity privacy regulations require.

What is the difference between a data catalog and a data map?

A data catalog is a general-purpose metadata inventory serving analytics, engineering, and governance teams. A privacy data map is purpose-built, enriched with consent context, legal basis, processing purpose, and individual-level linkage. The catalog can feed the data map, but the map requires privacy-specific enrichment and operational capability that catalog tools do not provide natively.

Why can a data catalog not fulfill a GDPR deletion request on its own?

A deletion request under GDPR Article 17 requires locating every instance of an individual's data across all systems, executing the deletion, verifying it, and documenting the action within one month. A catalog identifies where data schemas exist. It does not execute operations against live data stores, resolve identity across multiple systems, or generate the audit trail regulators require.

What is the difference between a data catalog and a data dictionary?

A data dictionary defines the structure and meaning of data elements: column names, data types, and business definitions. It is typically static and descriptive. A data catalog is dynamic: it discovers data across systems, tracks how it moves, records usage, and updates as the data estate changes. Both describe data. Neither governs what you are permitted to do with it.

How should a data catalog fit into a privacy infrastructure architecture?

The catalog should serve as an input layer, not a control plane. Privacy infrastructure consumes catalog metadata, enriches it with automated field-level classification, applies policy-as-code controls, and executes data subject operations programmatically. The catalog does what it does best. Purpose-built privacy infrastructure handles execution, enforcement, and audit.

Share