Skip to main content
Build trusted data with Ethyca.

Subject to Ethyca’s Privacy Policy, you agree to allow Ethyca to contact you via the email provided for scheduling and marketing purposes.

Glossary

Large Language Model(LLM)

Last reviewed

A type of AI system trained on massive text datasets to generate, summarize, classify, and reason over natural language. Examples include GPT, Claude, Gemini, and Llama. LLMs ingest personal data at training time and can output personal data at inference time, creating distinct privacy obligations at each stage.

A Large Language Model is a neural network trained on massive corpora of text to predict the next token given preceding context. Modern LLMs (GPT, Claude, Gemini, Llama, and others) are built on the transformer architecture and run to hundreds of billions of parameters, allowing them to perform a wide range of language tasks — generation, summarization, classification, translation, code synthesis, reasoning — without task-specific training.

From a data-protection standpoint, LLMs create privacy obligations at multiple stages of their lifecycle. Training data may contain personal data scraped from the web, raising questions about lawful basis, data subject rights, and the ability to erase. Inputs at inference time — prompts, retrieved context — often contain personal data the user, employee, or customer is sharing in real time. Outputs can contain personal data, either factually correct or hallucinated, about individuals who never consented to be the subject of the output.

Treating an LLM as a "model" rather than a "data system" understates the compliance surface. Each stage above has its own retention, lawful basis, and rights story. Organizations deploying LLMs need a governance posture that covers all three: training data governance, prompt-time data minimization, and output controls.