Anthropic Research

Publications & Experiments

97 research papers, safety analyses, and frontier experiments from Anthropic. Filterable by category with 5 featured highlights.

HubPublicationsAsset Map

Featured Research

PolicyDec 18, 2025

Project Vend: Phase two

In June, we revealed that we'd set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks.

InterpretabilityOct 29, 2025

Signs of introspection in large language models

Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models.

InterpretabilityMar 27, 2025

Tracing the thoughts of a large language model

Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another.

AlignmentFeb 3, 2025

Constitutional Classifiers: Defending against universal jailbreaks

These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

AlignmentDec 18, 2024

Alignment faking in large language models

This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences.

All Publications

97 publications

These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

Alignment faking in large language models

AlignmentDec 18, 2024

An update on our model deprecation commitments for Claude Opus 3

AlignmentFeb 25, 2026

The persona selection model

AlignmentFeb 23, 2026

Anthropic Education Report: The AI Fluency Index

AnnouncementsFeb 23, 2026

Measuring AI agent autonomy in practice

Societal ImpactsFeb 18, 2026

India Country Brief: The Anthropic Economic Index

Economic ResearchFeb 16, 2026

How AI assistance impacts the formation of coding skills

AlignmentJan 29, 2026

Disempowerment patterns in real-world AI usage

AlignmentJan 28, 2026

Claude's new constitution

AnnouncementsJan 22, 2026

The assistant axis: situating and stabilizing the character of large language models

Publications & Experiments

Featured Research

Project Vend: Phase two

Signs of introspection in large language models

Tracing the thoughts of a large language model

Constitutional Classifiers: Defending against universal jailbreaks

Alignment faking in large language models

All Publications

Project Vend: Phase two

Signs of introspection in large language models

Tracing the thoughts of a large language model

Constitutional Classifiers: Defending against universal jailbreaks

Alignment faking in large language models

An update on our model deprecation commitments for Claude Opus 3

The persona selection model

Anthropic Education Report: The AI Fluency Index

Measuring AI agent autonomy in practice

India Country Brief: The Anthropic Economic Index

How AI assistance impacts the formation of coding skills

Disempowerment patterns in real-world AI usage

Claude's new constitution

The assistant axis: situating and stabilizing the character of large language models

Anthropic Economic Index: New building blocks for understanding AI use

Anthropic Economic Index report: economic primitives

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Introducing Bloom: an open source tool for automated behavioral evaluations

Introducing Anthropic Interviewer: What 1,250 professionals told us about working with AI

How AI is transforming work at Anthropic

Estimating AI productivity gains from Claude conversations

Mitigating the risk of prompt injections in browser use

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Project Fetch: Can Claude train a robot dog?

Commitments on model deprecation and preservation

Preparing for AI's economic impact: exploring policy responses

A small number of samples can poison LLMs of any size

Petri: An open-source auditing tool to accelerate AI safety research

Building AI for cyber defenders

Anthropic Economic Index report: Uneven geographic and enterprise AI adoption

Anthropic Economic Index: Tracking AI's role in the US and global economy

Claude Opus 4 and 4.1 can now end a rare subset of conversations

Persona vectors: Monitoring and controlling character traits in language models

Project Vend: Can Claude run a small shop? (And why does that matter?)

Agentic Misalignment: How LLMs could be insider threats

Confidential Inference via Trusted Virtual Machines

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Open-sourcing circuit tracing tools

Anthropic Economic Index: AI's impact on software development

Exploring model welfare

Values in the wild: Discovering and analyzing values in real-world language model interactions

Reasoning models don't always say what they think

Auditing language models for hidden objectives

Forecasting rare language model behaviors

Claude's extended thinking

Insights on Crosscoder Model Diffing

Building effective agents

Clio: A system for privacy-preserving insights into real-world AI use

A statistical approach to model evaluations

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

Evaluating feature steering: A case study in mitigating social biases

Sabotage evaluations for frontier models

Using dictionary learning features as classifiers

Circuits Updates – September 2024

Circuits Updates – August 2024

Circuits Updates – July 2024

Circuits Updates – June 2024

Sycophancy to subterfuge: Investigating reward tampering in language models

The engineering challenges of scaling interpretability

Claude's Character

Mapping the Mind of a Large Language Model

Circuits Updates – April 2024

Simple probes can catch sleeper agents

Measuring the Persuasiveness of Language Models

Many-shot jailbreaking

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evaluating and Mitigating Discrimination in Language Model Decisions

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Collective Constitutional AI: Aligning a Language Model with Public Input

Decomposing Language Models Into Understandable Components

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning