Anthropic Research

Publications & Experiments

97 research papers, safety analyses, and frontier experiments from Anthropic. Filterable by category with 5 featured highlights.

HubPublicationsAsset Map

Featured Research

All Publications

97 publications

Project Vend: Phase two

F
PolicyDec 18, 2025

In June, we revealed that we'd set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks.

Signs of introspection in large language models

F
InterpretabilityOct 29, 2025

Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models.

Tracing the thoughts of a large language model

F
InterpretabilityMar 27, 2025

Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another.

Constitutional Classifiers: Defending against universal jailbreaks

F
AlignmentFeb 3, 2025

These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

Alignment faking in large language models

F
AlignmentDec 18, 2024

This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences.

An update on our model deprecation commitments for Claude Opus 3

AlignmentFeb 25, 2026

The persona selection model

AlignmentFeb 23, 2026

Anthropic Education Report: The AI Fluency Index

AnnouncementsFeb 23, 2026

Measuring AI agent autonomy in practice

Societal ImpactsFeb 18, 2026

India Country Brief: The Anthropic Economic Index

Economic ResearchFeb 16, 2026

How AI assistance impacts the formation of coding skills

AlignmentJan 29, 2026

Disempowerment patterns in real-world AI usage

AlignmentJan 28, 2026

Claude's new constitution

AnnouncementsJan 22, 2026

The assistant axis: situating and stabilizing the character of large language models

InterpretabilityJan 19, 2026

Anthropic Economic Index: New building blocks for understanding AI use

Economic ResearchJan 15, 2026

Anthropic Economic Index report: economic primitives

Other

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Other

Introducing Bloom: an open source tool for automated behavioral evaluations

Other

Introducing Anthropic Interviewer: What 1,250 professionals told us about working with AI

Other

How AI is transforming work at Anthropic

Other

Estimating AI productivity gains from Claude conversations

Other

Mitigating the risk of prompt injections in browser use

Other

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Other

Project Fetch: Can Claude train a robot dog?

Other

Commitments on model deprecation and preservation

Other

Preparing for AI's economic impact: exploring policy responses

Other

A small number of samples can poison LLMs of any size

Other

Petri: An open-source auditing tool to accelerate AI safety research

Other

Building AI for cyber defenders

Other

Anthropic Economic Index report: Uneven geographic and enterprise AI adoption

Other

Anthropic Economic Index: Tracking AI's role in the US and global economy

Other

Claude Opus 4 and 4.1 can now end a rare subset of conversations

Other

Persona vectors: Monitoring and controlling character traits in language models

Other

Project Vend: Can Claude run a small shop? (And why does that matter?)

Other

Agentic Misalignment: How LLMs could be insider threats

Other

Confidential Inference via Trusted Virtual Machines

Other

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Other

Open-sourcing circuit tracing tools

Other

Anthropic Economic Index: AI's impact on software development

Other

Exploring model welfare

Other

Values in the wild: Discovering and analyzing values in real-world language model interactions

Other

Reasoning models don't always say what they think

Other

Auditing language models for hidden objectives

Other

Forecasting rare language model behaviors

Other

Claude's extended thinking

Other

Insights on Crosscoder Model Diffing

Other

Building effective agents

Other

Clio: A system for privacy-preserving insights into real-world AI use

Other

A statistical approach to model evaluations

Other

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

Other

Evaluating feature steering: A case study in mitigating social biases

Other

Sabotage evaluations for frontier models

Other

Using dictionary learning features as classifiers

Other

Circuits Updates – September 2024

Interpretability

Circuits Updates – August 2024

Interpretability

Circuits Updates – July 2024

Interpretability

Circuits Updates – June 2024

Interpretability

Sycophancy to subterfuge: Investigating reward tampering in language models

Other

The engineering challenges of scaling interpretability

Interpretability

Claude's Character

Other

Mapping the Mind of a Large Language Model

Interpretability

Circuits Updates – April 2024

Interpretability

Simple probes can catch sleeper agents

Other

Measuring the Persuasiveness of Language Models

Other

Many-shot jailbreaking

Alignment

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Alignment

Evaluating and Mitigating Discrimination in Language Model Decisions

Other

Specific versus General Principles for Constitutional AI

Other

Towards Understanding Sycophancy in Language Models

Other

Collective Constitutional AI: Aligning a Language Model with Public Input

Other

Decomposing Language Models Into Understandable Components

Interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Interpretability

Challenges in evaluating AI systems

Other

Tracing Model Outputs to the Training Data

Other

Studying Large Language Model Generalization with Influence Functions

Other

Measuring Faithfulness in Chain-of-Thought Reasoning

Other

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Other

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Other

Circuits Updates — May 2023

Interpretability

Interpretability Dreams

Interpretability

Distributed Representations: Composition & Superposition

Interpretability

Privileged Bases in the Transformer Residual Stream

Interpretability

The Capacity for Moral Self-Correction in Large Language Models

Other

Superposition, Memorization, and Double Descent

Interpretability

Discovering Language Model Behaviors with Model-Written Evaluations

Other

Constitutional AI: Harmlessness from AI Feedback

Alignment

Measuring Progress on Scalable Oversight for Large Language Models

Other

Toy Models of Superposition

Interpretability

Red Teaming Language Models to Reduce Harms

Other

Language Models (Mostly) Know What They Know

Other

Softmax Linear Units

Interpretability

Scaling Laws and Interpretability of Learning from Repeated Data

Other

Training a Helpful and Harmless Assistant with RLHF

Other

In-context Learning and Induction Heads

Interpretability

Predictability and Surprise in Large Generative Models

Other

A Mathematical Framework for Transformer Circuits

Interpretability

A General Language Assistant as a Laboratory for Alignment

Other

Research by Category