Field Notes · Chapter III

Chapter 3: From GPT-2 to Hermes — The Road LLMs Walked

Syngha Han · MAY 11, 2026

AI is, in my view, the fastest-mutating technology humanity has ever built. The frontier model shifts month to month. API prices halve quarter to quarter. The default pair-programmer baked into every IDE swaps out every six months. In an industry like this, working through a textbook chapter on “what is AI” or memorizing the current Claude Code manual is — and I want to be honest — not a useful thing to do. Six months from now, that manual is deprecated. A year from now, the tool itself has been acquired, rebranded, or just buried. The reigning paradigm changes roughly every quarter.

So what I actually want to recommend to anyone trying to learn AI is this: read the history. Starting from the prelude — transformers, BERT, GPT-2 — through ChatGPT’s public release on November 30, 2022, all the way to the Harness Engineering era we’re standing in right now. Which models came out, from which lab, at which moment. What part of the market each one shook. What the people who shipped it were looking toward as the next frontier when they did. If you carry that coordinate system in your head, the next new model or tool or paradigm you bump into becomes legible: you can see, almost automatically, where in some current it sits, what came before, and what it’s pulling toward. Without that coordinate system, all you have is tool literacy — and the day the tool disappears, you go to zero.

This essay is an attempt to compress that coordinate system onto a single page. The vertical timeline below stitches together the events I think mattered, in order, from BERT (October 2018) to Claude Opus 4.7 (April 2026). Click any node and the longer context unfolds. I included model releases, but also tools, cultural moments, policy turning points, and market shocks — they all belong on the same axis. By the time you reach the bottom, I hope what jumps out is that what looks like a chaotic weekly drumbeat is actually a small number of thick current-lines flowing through everything.

Filter

Prelude · 2018 – 2022 · 「I. Birth of the Models」

When transformers quietly ate language

Four years before the public knew the word ChatGPT, transformers had already settled academic NLP, and OpenAI was so cautious about GPT-2 that they refused to release the full weights at first. The largest single event of this era, in my view, isn't a model release at all — it's the founding of Anthropic as a spinoff. RLHF and Constitutional AI, the two great branches of alignment philosophy, diverge from a single trunk right here.

2018 · 10 BERT — the moment transformers became the default of NLP model

Google's BERT (Bidirectional Encoder Representations from Transformers) absorbed almost the entire NLP task zoo — question answering, sentiment, NER, classification — into a single pretrain-then-finetune paradigm. From this point forward, "language model = transformer" became academia's baseline assumption. BERT was encoder-only, so it couldn't generate text; that gap got filled four months later by GPT-2.

2019 · 02 GPT-2 — the model OpenAI said was "too dangerous to release" model

OpenAI initially refused to release the full 1.5B-parameter weights of GPT-2, citing risk of misuse. From today's vantage point that looks comically conservative, but the decision itself created a new category: "releasing an AI model is a political event." They staged the release, opening progressively larger weights until November. This was the first moment lay readers felt that something had crossed a threshold — that a machine could actually compose sentences. A bridge to GPT-3.

2020 · 05 GPT-3 — 175 billion parameters, the first shock of the scaling law model

"175B parameters" started circulating outside academic papers for the first time. Weights kept private, API-only. Even though most users never directly touched the model, the *way of using* it — few-shot prompting — became the standard from this point on. And the people who couldn't get behind the way this model was aligned left OpenAI a year later and started something else.

2021 · Spring Anthropic founded — the Amodei siblings and a wave of OpenAI's core research staff culture

Dario Amodei (OpenAI VP of Research) and his sister Daniela Amodei (VP of Safety & Policy), along with Tom Brown (lead author of the GPT-3 paper), Jared Kaplan (scaling laws), Chris Olah (interpretability), and a cohort of OpenAI's alignment-and-research core, walked out together and founded Anthropic. The stated reason was "directional disagreement." The substantive reason was: they wanted to build models with safety as the first axis, not the second.

From this moment, the LLM industry splits into two branches: OpenAI = product-led, scale-first; Anthropic = safety-led, character-first. Almost everything that comes later is a downstream consequence of this fork.

2022 · 01 InstructGPT — the first production-grade RLHF, and ChatGPT's actual technical parent model

"GPT-3 doesn't answer well" — that weakness was the problem InstructGPT solved, by applying Reinforcement Learning from Human Feedback to a production model for the first time. ChatGPT, which would explode ten months later, is essentially "InstructGPT plumbed into a chat UI." OpenAI internally didn't expect ChatGPT to be a major event when they shipped it — which makes more sense once you realize the *technical* leap had already happened here, in January.

2022 · 06 GitHub Copilot GA — AI coding becomes a mainstream tool for the first time tool

GitHub Copilot, which had been in technical preview since June 2021, went GA on June 21, 2022. Backend: OpenAI Codex (a code-specialized fine-tune of GPT-3). It slipped into VS Code as a sidekick, and from this point on, *"asking an AI to write code"* became normal behavior. Worth noting that this happened five months *before* ChatGPT — the coding world had already put one foot into AI by the time the rest of the world found out.

2022 · 07 FIM (Fill-in-the-Middle) paper — the technical foundation of AI + coding tool

Bavarian et al. at OpenAI publish "Efficient Training of Language Models to Fill in the Middle" on July 28, 2022. The trick is almost embarrassingly simple: split each document into (prefix, suffix, middle), rearrange it with sentinel tokens <PRE> <SUF> <MID>, and train left-to-right as usual. The model learns — for free, with no degradation in standard generation — the ability to look at what's before *and* after the cursor and fill in between.

This single technique becomes the technical mother of the entire code-autocompletion industry. Before FIM, autocomplete was either statically-typed IntelliSense or naïve left-to-right GPT-3 generation. After FIM — and after Code Llama, StarCoder, DeepSeek Coder, Qwen2.5-Coder, and Codestral all adopt it as default in 2023 — every modern IDE's ghost-text autocompletion is, structurally, a FIM model. GitHub Copilot's inline suggestions, Cursor's Tab, Windsurf's Cascade previews, Codeium's everything. The lived experience of "AI writes code with me" runs through this paper.

Act I · 2022.11 – 2023.12 · 「II. Model Refinement (Early)」

ChatGPT shock, peak scaling law, a year of fierce parallel emergence

From the day ChatGPT hit one million users in five — to roughly a year later — the AI market dropped a year's worth of news every month. Claude, Llama, GPT-4, Gemini, Mistral, Qwen, ChatGLM all surfaced in sequence, and scaling law (more parameters + more data + more compute → predictably better) hardened into the field's absolute premise. Cursor reframed what an IDE looks like, and Perplexity did the same for search — the first generation of AI tools showed their shape here. In November, OpenAI's board fired Sam Altman and undid it five days later, in a drama that shook the entire industry. And in December, Mixtral and Mamba arrived within weeks of each other and, in two completely different ways, cracked the transformer's complacency.

2022 · 11.30 ChatGPT launches — 1M users in 5 days, 100M in two months model

OpenAI called ChatGPT a "research preview" and shipped it without much marketing. It became the fastest consumer product in history to reach 100M users (by January 2023). The underlying model was GPT-3.5 (a chat-tuned variant of the InstructGPT line); the decisive thing wasn't the model. It was "free, chat-based UI, anyone can use it." From this moment forward, the word "AI" entered non-engineering daily speech.

2022 · 12.07 Perplexity ships — "search + LLM" as a brand-new category tool

Aravind Srinivas (formerly OpenAI / DeepMind / Google Brain) shipped the main Perplexity search engine one week after ChatGPT launched. The new category was "an LLM whose answers cite their sources." Google felt this product more sharply than any other early threat — but it would take Google until 2024 to respond with AI Overviews.

2023 · 02.24 LLaMA 1 announced — and one week later, weights leaked to 4chan model

Meta announced LLaMA 1 in four sizes (7B / 13B / 33B / 65B), initially gated behind an academic-only application process. On March 3, the entire weight set leaked through 4chan. Meta sent takedown requests, but it was too late. In hindsight, this is the Big Bang of the open-weights LLM ecosystem — llama.cpp, Alpaca, Vicuna all stand on those leaked weights.

2023 · 03 Claude 1 first opens — "an AI that talks like a person, trained on books" model

Anthropic opens Claude and Claude Instant to invite-only access. At this point, Anthropic is openly playing a different game than OpenAI: building a model that talks the way a person talks. Not simply a model that answers well — a model with a long breath, a coherent persona. To get there, they introduced Constitutional AI as the alignment method, weighted book-length text heavily in pretraining, and ran a separate character training step where Claude's tone, values, and self-conception were shaped explicitly.

Looking back, this early choice becomes the DNA running through everything Anthropic ships later — Claude Code, Cowork, Skills. The first lab to differentiate on "model character" rather than "model capability."

2023 · 03 Cursor launches — forking VS Code to build an AI-native IDE tool

Anysphere, an OpenAI accelerator alum, ships Cursor. The decisive move wasn't writing an autocomplete plugin — it was forking the IDE itself. The implicit bet: "an IDE where AI is the partner has to be a different shape from the start." Where GitHub Copilot was a guest in someone else's host, Cursor redesigned the host.

智谱 2023 · 03.14 ChatGLM-6B open-sourced — the opening of China's open-source LLM era model

Zhipu AI (now Z.ai) and Tsinghua's KEG group announce ChatGLM-130B and, on the same day, release ChatGLM-6B as open source. The first Chinese chat LLM small enough to run on consumer GPUs. From this point on, China's AI camp settles into a clear strategy: "frontier models will come from the US; we'll generalize through open-source." GLM continues through 4.5, 4.6, 4.7 as one of the more durable open-source code-tuned families.

2023 · 03.14 GPT-4 released — the first model that really *felt* like reasoning model

OpenAI declined to publish a paper — only a technical report, with parameter count, architecture, and training data all withheld. Even so, GPT-4 was the first model to score above the human average on tests like the SAT, BAR, and USMLE. From here, the sentiment "AGI might not be far" moves into the mainstream conversation in earnest.

⚙ 2023 · 03 AutoGPT goes viral — the first agent fever, mostly swinging at air tool

A single-person GitHub project hits 100k stars in two weeks. The pitch — "give GPT-4 a goal and it'll execute" — was viral enough that the word *agent* entered the general developer vocabulary right here. In reality, most AutoGPT runs got stuck in infinite loops or burned API credits on nonsense, but the category was now named. It would take about eighteen months before agents that actually work — Devin, Claude Code, Cowork — arrived.

2023 · 08.03 Qwen-7B open-sourced — Alibaba enters China's open-source race in earnest model

Alibaba Cloud open-sources the first model in its Tongyi Qianwen (通义千问) line, Qwen-7B. Over the next two years, Qwen builds the most diverse open-source lineup by size and domain (coding, vision, audio) and earns a reputation for unusually natural Korean and other non-Latin-script output. Qwen3 ships in April 2025 under Apache 2.0.

2023 · 07 Claude 2 — 100k context window, Anthropic's first public model model

Claude 2 is the first Anthropic model open to everyone. The headline number: 100k token context window, more than 12× what GPT-4 was offering (8k / 32k) at the time. Long-context becomes Anthropic's permanent differentiator from here — eventually growing to 1M with Opus 4.6 (Feb 2026) and "Infinite Chats" with Opus 4.5 onwards.

2023 · 07.18 Llama 2 — Meta makes open weights official policy model

The Meta that sent takedowns when Llama 1 leaked, five months later, deliberately released Llama 2 with open weights and a permissive commercial license. Microsoft was named as a distribution partner. Open-source LLMs go from "leaked, picked up quietly" to "officially supplied by a Big Tech." A genuine paradigm flip.

2023 · 09 / 12 Mistral 7B → Mixtral 8x7B — Europe's answer, and the start of the MoE era model

Ex-Meta and ex-DeepMind researchers founded Mistral AI in Paris and shipped Mistral 7B in September and Mixtral 8x7B (sparse MoE) in December. Mixtral has 46.7B total parameters but only activates 12.9B per token, beats Llama 2 70B on most benchmarks, and serves at 6× lower inference cost. "Frontier doesn't have to be one giant dense model — it can be a sparse committee of small experts" moved from research curiosity to industry default. Every open-source frontier model that follows — DeepSeek-V3, Qwen3 235B, MiniMax-M1/M2, Kimi K2 — is essentially a descendant of this turn.

🐍 2023 · 12 Mamba paper — the first serious attempt to escape the transformer model

Albert Gu (CMU) and Tri Dao (Princeton, FlashAttention author) publish "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." The core trick: throw out attention entirely, replace it with a *selective state space model*, and the transformer's quadratic cost in sequence length collapses to linear. Mamba beats same-size transformers on language modeling, matches transformers twice its size, and serves 5× higher inference throughput.

This was the first serious answer to a long-running suspicion: "the transformer can't be the last architecture." The downstream wave — Jamba (Mamba + Transformer hybrid), Striped Mamba, Falcon Mamba — followed in 2024. By 2026, frontier commercial models are still largely transformer-shaped, but elements of the Mamba lineage are quietly threaded into hybrid attention designs in long-context models like MiniMax's lightning attention. An important reminder that architectural frontiers don't freeze.

2023 · 11.06 OpenAI DevDay — GPT-4 Turbo, GPTs, Assistants API culture

OpenAI's first developer conference. GPT-4 Turbo (128k context, 3× cheaper), GPTs (custom GPT store), Assistants API (early agent scaffold). The implicit declaration: "OpenAI is going to be a platform." The fact that the board fired Sam Altman eleven days later makes the timing read, in retrospect, like a strange split-screen.

2023 · 11.17 – 22 Sam Altman fired → reinstated in 5 days — the OpenAI board coup that failed culture

On November 17, OpenAI's board fired Altman, citing that the board "no longer has confidence in his ability to lead." The company nearly disintegrated over the following five days. More than 700 of OpenAI's ~770 employees signed a letter saying they'd quit if Altman didn't return. Microsoft offered to absorb him. By November 22, Altman was back, with a reshuffled board. This was the event that showed everyone how an AI company actually works in terms of internal power. It was also the moment the alignment faction (Ilya Sutskever and the people closest to him) effectively lost their grip inside OpenAI.

2023 · 12.06 Gemini 1.0 — Google drops the Bard brand model

Pichai and Hassabis announced Gemini 1.0 (Ultra / Pro / Nano) in a virtual press conference. The whole Google AI product line — formerly Bard — collapsed into the Gemini brand. After getting dragged by ChatGPT for a year and a week, this was Google's first proper counter. The launch was hurt by reporting that the demo video had been edited (Gemini hadn't actually answered as fluently and quickly as the video suggested), and the trust damage stuck. Google's *real* counter started in 2025 with Gemini 2.5 Pro and nano-banana.

Act II · 2024 · 「II. Refinement (Late) → III. Agentic methodology arrives」

The reasoning age, and the first standards for agentic methodology

Scaling law had its last full year of glory through the first half of 2024. But the moment OpenAI didn't go straight to GPT-5 from GPT-4 and instead detoured sideways into GPT-4o (May), the market started to ask the question for the first time: "Is scaling law saturating?" The answer came in September with OpenAI o1 — the formal opening of the reasoning era. Techniques like CoT, which had previously been prompt-side tricks, were absorbed into the model's own internal mechanism. In July, Llama 3.1 baked tool calling into the prompt template, fixing the open-source standard for tool use. On November 25, Anthropic published MCP — and from there, the industry's frontier started visibly migrating from "model IQ" toward "agentic methodology."

2024 · 03.04 Claude 3 Opus / Sonnet / Haiku — the first Claude to clearly beat GPT-4 at the frontier model

Anthropic ships the first three-tier family (most-capable Opus / balanced Sonnet / fast Haiku). Opus is the first Claude to clearly overtake GPT-4 on core benchmarks like MMLU and GPQA. From this moment on, "Claude vs ChatGPT" becomes a real user-side comparison, not a vendor pitch. Sonnet quietly settles in as the daily driver everyone actually loves.

D 2024 · 03.12 Devin appears — "the first AI software engineer" lands as marketing tool

Cognition Labs comes out of stealth with Devin. SWE-Bench 13.86% (3× the prior SOTA), a demo showing the agent actually completing an Upwork job. The video went viral, and "developers are about to be replaced" became a broadly held anxiety for the first time. General access took much longer to arrive, and Claude Code eventually took the actual market, but Devin manufactured the category — that's the part of the legacy that matters.

2024 · 05.13 GPT-4o — instead of jumping to GPT-5, OpenAI sidesteps into multimodal model

In hindsight this is a very interesting inflection. Instead of pushing straight to GPT-5, OpenAI stays in the 4-series and expands sideways into multimodal native (text, voice, vision in one model). The ChatGPT daily UX — voice conversations, point-the-camera-and-ask — gets redesigned around this. This is also when the market first starts quietly wondering, "Can they actually build a 5?" Four months later, o1 answers that question by changing the question.

2024 · 06.20 Claude 3.5 Sonnet — Claude crushes GPT at coding for the first time model

Same price, Opus-level capability. On coding benchmarks (HumanEval, SWE-Bench) it doesn't just win — it leads. From this point onward, "if it's code, it's Claude" hardens as the developer consensus. Cursor and every other IDE start defaulting to Claude.

SSI 2024 · 06 Ilya Sutskever leaves OpenAI → founds Safe Superintelligence (SSI) culture

OpenAI co-founder and the central figure of the failed board coup, Ilya Sutskever, resigned in May and founded SSI in June. The mission, in one line: build *safe* superintelligence. After Anthropic's 2021 spinoff, this is the second mass departure of OpenAI's alignment camp. At this point the alignment team inside OpenAI is effectively dismantled.

2024 · 07.23 Llama 3.1 (405B) — the first open-source model to put tool calling in the prompt template model

Meta releases Llama 3.1 in three sizes (8B / 70B / 405B). The 405B is the first open-weight model that reaches GPT-4 parity — proof that frontier is reproducible in the open. But the bigger industry consequence was elsewhere. Llama 3.1's prompt template standardized tool calling as a first-class primitive with special tokens like <|python_tag|>, <|eom_id|>, <|eot_id|>.

Before this, tool use lived in OpenAI Function Calling or in LangChain's ad-hoc wrappers. Once Llama 3.1's prompt format dropped, and once vLLM, llama.cpp, and every other inference server supported it natively, tool use became a standard primitive in the open-source world. If MCP (four months later) standardized "how a tool is defined," Llama 3.1 standardized "how a tool is called." The format keeps evolving through 3.2 (Sep) and 3.3 (Dec).

2024 · 09.12 OpenAI o1 (Strawberry) — reasoning era opens for real model

Long known internally as "Strawberry," the model ships as o1-preview / o1-mini. The new paradigm: generate a long chain-of-thought *before* responding. On AIME math, GPT-4o scored 13%; o1-preview scored 83%. A new axis of scaling law — not just pretraining compute, but inference-time compute — opens. This discovery becomes the direct seed for DeepSeek R1 four months later.

xAI 2024 · 09 xAI Colossus goes live — 100K-GPU single cluster culture

Elon Musk's xAI brings up a 100,000-H100 single cluster in Memphis in 122 days. World's largest single cluster at the time. After this, "compute is the frontier" hardens as the industry's reigning belief, and every frontier lab starts treating single-cluster scale as a competitive axis of its own. Stargate (January 2025) is, in essence, the federal-scale escalation of this.

2024 · 10 Apple Intelligence — on-device LLMs enter the OS tool

Apple Intelligence shipped with iOS 18.1. The core fact: a model that didn't need a cloud round-trip was now baked into the OS. Writing Tools, Summarize, ChatGPT fallback in the new Siri. The initial release underwhelmed, but a new product category — on-device, privacy-first models — got installed at mobile mainstream scale.

2024 · 10.22 Claude 3.5 Sonnet upgrade + Computer Use — Claude takes the mouse and keyboard model

Alongside the upgraded Claude 3.5 Sonnet and the new Haiku, Anthropic shipped Computer Use: Claude reads the screenshot, computes mouse coordinates, clicks, types. Any app with a screen could now have an agent strapped to it. This was the technical seed of Cowork a year later.

2024 · 11.05 Claude Desktop ships (Windows / macOS) tool

Claude moves from web-only to native OS app. At the time this looked unremarkable. Three weeks later, MCP shipped — and the desktop app became the hub through which "a Claude that talks to your local files and tools" actually worked.

2024 · 11.13 Windsurf launches — Codeium's frontal challenge to Cursor tool

Codeium ships a standalone IDE, Windsurf, with an agent called Cascade as a first-class collaborator. The company would rename itself Windsurf in April 2025. The signal: "AI IDE is one category, and it has room for more than one player."

2024 · 11.25 Model Context Protocol (MCP) — Anthropic takes the standard protocol

Anthropic publishes MCP, an open protocol for connecting LLMs to external tools and data sources. The bet: "what an agent should connect to" should be solved by a *protocol*, not by per-vendor SDKs. At first it attracted modest attention. Six months later, OpenAI, Google, and Microsoft all natively support MCP. By the one-year mark (Nov 2025), Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation. The industry consensus settled around "MCP is the HTTP of LLMs."

2024 · 12.09 Sora opens — the video-generation era begins in earnest model

The Sora that had been a short demo in February 2024 finally opened to ChatGPT Pro / Plus, ten months later. Video models had existed (Runway, Pika), but OpenAI putting video generation into its own product line is what made the moment industrial. The video content industry felt directly threatened by AI for the first time at this point.

Act III · 2025 · 「IV. Agents bloom → V. Pax Anthropica + SaaS Phobia」

The year of explosions — Manus opens the agent era, Claude Code locks in Pax Anthropica, and SaaS feels its first cold

Two currents ran in parallel through 2025. First, the blooming of the agent era — on March 6, Manus went viral hard enough that invitation codes traded for ¥100,000 ($14,000) on Chinese secondary markets, and finally a real product moved into the category Devin had built with marketing a year before. Google A2A (April) added the agent ↔ agent layer. Second, the start of Pax Anthropica — Claude 3.7 + Claude Code preview (Feb), Claude 4 + Claude Code GA (May), Sonnet 4.5 + Skills (Sep), Opus 4.5 + 80% price cut (Nov). Anthropic systematically took one domain after another, from coding to enterprise. In parallel, SaaS phobia — Salesforce, Workday, ServiceNow waking up to the threat that "an AI agent can just route around the SaaS lock-in," the same anxiety that surfaced through Benioff's public statements (see chapter 2) — became a shared industry conversation.

2025 · 01.20 DeepSeek R1 ships — and one week later, $1T in US tech market cap evaporates model

A reasoning model from a Chinese company spun out of a hedge fund matches o1's performance — and the weights are open. Reported training cost: $5.6M (the number is contested, but it was clearly not billions). On January 27, NVIDIA dropped 17% (market cap −$589B, the largest single-day loss for any single company in stock-market history). Total US tech market cap evaporated: roughly $1T. The Hill and The Guardian called it AI's "Sputnik moment." The thesis that "compute is the frontier" cracked for the first time.

2025 · 01 OpenAI Operator ships — the first native computer-use agent product tool

OpenAI shipped Operator inside ChatGPT Pro: a direct response to Anthropic Computer Use (Oct 2024). The model drives the browser itself. Lots of clumsy moments at first, but it was OpenAI's answer to *what an agent UX should look like.*

2025 · 01.21 Stargate $500B announced — Trump + OpenAI + Oracle + SoftBank protocol

Right after Trump's inauguration, OpenAI + Oracle + SoftBank + MGX announce a $500B, four-year buildout of US AI infrastructure: the Stargate Project. $100B deployed immediately. SoftBank carries financial responsibility, OpenAI carries operational responsibility. The timing — six days after the DeepSeek shock — was not subtle. "Defend the frontier with compute" as the American answer. By September 2025, five additional US data center sites pushed planned capacity past ~7GW.

⚡ 2025 · 02.02 Andrej Karpathy tweets "vibe coding" into existence — naming the moment AI started eating coding culture

Karpathy (OpenAI co-founder, former Tesla AI lead) throws a single tweet onto X: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good." By his own admission, it was "a shower of thoughts throwaway tweet that I just fired off." And yet it became one of the industry's defining words of the year.

What made the phrase stick wasn't novelty — it was that it named, in three syllables, a real paradigm shift in developer experience: "you stop reading the code, you steer the vibe, and something working comes out." Twenty-two days later, Claude 3.7 + Claude Code gave the phrase a product to inhabit. A year later, Collins English Dictionary named "vibe coding" its 2025 Word of the Year.

2025 · 02.24 Claude 3.7 Sonnet + Claude Code preview — vibe coding gets a real product model

Anthropic ships Claude 3.7 Sonnet (a hybrid reasoning model) and the Claude Code research preview. Natural-language commands in the terminal, with Claude directly working the filesystem and git. This becomes the *product realization* of the phrase "vibe coding" that Karpathy tweeted into the world three weeks earlier. GA in May with Claude 4. For the year that follows, Claude Code plus Cursor and Windsurf set the new default of the development workflow.

✋ 2025 · 03.06 Manus launches — the signal event for the blooming of the agent era tool

Butterfly Effect (China-origin, Singapore-based) opens Manus to invitation-only beta — a general-purpose agent. The demo video (resume screening, stock analysis, trip planning all completed autonomously end-to-end) crossed a million views in 24 hours. Invitation codes started trading on Chinese secondary markets for ¥50,000–100,000 (USD $7,000–13,800). For the first time since the 2023 AutoGPT mania, "agent" the word matched up to "agent" the actual working product.

One year after Devin manufactured the category with marketing, Manus put a usable product into it. MIT Technology Review wrote in June that "Manus kick-started a Chinese AI agent boom." In December, Meta acquired Manus — Zuckerberg made the founder a VP. From that point on, the current Manus started got absorbed into US Big Tech.

2025 · 03.26 ChatGPT Studio Ghibli-fication goes viral — "our GPUs are melting" culture

GPT-4o's native image generation opens, and within a week, X, Instagram, and LinkedIn fill up with family photos turned into Studio Ghibli-style anime cels. Sam Altman tweeted "our GPUs are melting" and OpenAI threw on rate limits. Miyazaki's old "AI is an insult to life itself" line recirculated. This was the first moment the broad public consumed image generation not as an API but as a social-media meme.

2025 · 04.09 Google A2A + ADK — the orchestration era is declared protocol

At Google Cloud Next, Google announced the Agent2Agent (A2A) protocol and the Agent Development Kit (ADK). If MCP is "agent ↔ tool," A2A is "agent ↔ agent." Over 150 organizations signed on immediately (Atlassian, Box, Salesforce, ServiceNow, SAP, etc.). Donated to the Linux Foundation in June. This was the inflection where "orchestration" became the keyword of the year.

2025 · 04.28 Qwen3 family — Apache 2.0, very strong on non-English model

Alibaba open-sources the entire Qwen3 series under Apache 2.0, from 0.5B all the way to 235B MoE — the broadest size range available in open source. For local-LLM setups, Qwen3 has become a default choice, particularly in non-English contexts where its output quality is unusually strong. The core family powering the Mac mini local-LLM wave.

2025 · 05 Claude 4 (Opus 4 / Sonnet 4) + Claude Code GA model

Anthropic's 4th generation. SWE-Bench Verified breaks into the 70s. Claude Code goes GA at the same time. From this point on, "Claude Code is just part of the developer's default setup" — Cursor and Windsurf use Claude as the model, and the terminal uses Claude Code itself.

2025 · 06.17 Gemini 2.5 Pro — 1M token context window model

The moment Google's counter actually starts landing. Stronger reasoning + 1M token context. The pitch — "feed an entire codebase into context in one shot" — lands well in enterprise and coding. (1M context itself was first introduced in Gemini 1.5 Pro, but 2.5 Pro is the first to bring it at production quality together with reasoning capability.)

2025 · 06 MiniMax-M1 open-sourced — 1M context, hybrid attention model

Shanghai's MiniMax open-sources a hybrid MoE model under Apache 2.0 with 1M input / 80k output context. 456B total / 45.9B active. A novel "lightning attention" variant cuts long-context cost meaningfully. Chinese open-source enters frontier-class context competition for real.

⌘ 2025 · 06 – 07 Gemini CLI · Qwen Code · OpenAI Codex CLI — Big Tech swarms the Claude Code category tool

Four months after Claude Code GA (May 2025) defined the "agentic coding CLI that lives in the terminal" category, the whole industry piles in. June: Gemini CLI (Google, Apache 2.0, Gemini 2.5 Pro backend with 1M context). July: Qwen Code (Alibaba, forked from Gemini CLI with Qwen3-Coder strapped in). Around the same time: OpenAI Codex CLI with a GPT-5 backend. Then Aider, Cline, Goose, Block's CLI, Sourcegraph Amp — by 2026, the cumulative count of CLI agents on the market crosses thirty.

The signal is clear: the coding-tools frontier shifted from IDE plugins to terminal CLIs. Cursor and Windsurf had built IDE-native AI. What came after Claude Code is rawer: "don't go through the IDE — talk directly to an agent in the terminal." The default UX of vibe coding migrated from in-IDE chat to a CLI prompt at exactly this inflection.

2025 · 07 Moonshot Kimi K2 — 1-trillion-parameter open-source MoE model

Moonshot AI open-sources a 1T total / 32B active MoE under a modified MIT license. Strong on coding benchmarks. September: K2-Instruct-0905 extends context from 128k → 256k. January 2026: multimodal Kimi K2.5. The single clearest signal that the Chinese open-source camp has reached the frontier.

2025 · 08.05 gpt-oss-120b / 20b — OpenAI returns to open source model

The first OpenAI weights release since GPT-2 — six years. Apache 2.0. The 120B runs on a single 80GB GPU at roughly o4-mini-grade reasoning; the 20B fits a 16GB device at roughly o3-mini grade. The US open-source camp's counter-punch to DeepSeek / Llama / Qwen — and a partial recovery of the company's literal name.

2025 · 08.07 GPT-5 ships — the major-version jump, finally, after 2.5 years model

After GPT-4 (March 2023) and the long detour through GPT-4o, o1, and o3, the major-version number finally moves. 400K token context. The core architectural move: GPT-5 is a system, not a single model — a fast model, a deep-reasoning model, and a real-time router that picks between them based on the conversation. The user no longer toggles "reasoning on / off"; the model decides. ChatGPT's default UX gets redesigned around this from the inside.

2025 · 08.26 Nano Banana (Gemini 2.5 Flash Image) — the new image-edit standard model

Google's image model — anonymously A/B tested on lmarena under the codename "nano-banana" — gets its real release as Gemini 2.5 Flash Image. Character consistency was decisively better than the competition, and it became the default image model for SNS users and content creators almost overnight. Nano Banana Pro followed in November, and Nano Banana 2 on February 26, 2026.

2025 · 09 Claude Sonnet 4.5 — the coding workhorse settles model

Sonnet 4.5 ships as the price-performance king for coding and agent tasks — beating Opus 4 in many real cases. Anthropic's tier strategy of "frontier performance at Sonnet pricing" is fully validated here. Almost every coding tool — Cursor, Windsurf, Cline, Claude Code — moves its default to Sonnet 4.5.

2025 · 10.16 Claude Skills — the other standard that fills in MCP's gap protocol

Anthropic ships Skills. If MCP is the standard for connectivity ("how the agent reaches outside"), Skills is the standard for expertise ("how the agent does a given thing well"). Bundle the task-specific instructions, system prompt, required tools, and few-shot examples into a single SKILL.md manifest, and Claude auto-loads it on task match. Progressive disclosure: skills only load when needed, so context windows stay tight.

What's interesting about this release is the meta-move: Anthropic patching the weakness of its own MCP with its own Skills. Anyone who has tried building agents on MCP alone knows the gap quickly — you have tool access, but no built-in expertise on *when* and *why* and *how* to use which tool. Stuffing that into the prompt every time is wasteful; putting it in the system prompt doesn't scale per-task. Skills fills the gap. Simon Willison wrote that Skills "might be a bigger deal than MCP." Two months later (December 18), Anthropic publishes Agent Skills as an open standard — running the MCP playbook a second time.

From this moment, the industry's keyword sequence becomes legible: prompt engineering → context engineering → skill / harness engineering.

2025 · 10 OpenAI Aardvark (→ Codex Security) — frontier models officially enter the cybersecurity front tool

OpenAI ships Aardvark: a GPT-5-powered autonomous security researcher. The agent monitors repos and commits, finds vulnerabilities, validates them with a working exploit, and proposes patches. The novel move is methodological — instead of fuzzing or static analysis, it uses LLM reasoning and tool use to understand code behavior and reason about vulnerabilities. On "golden repository" benchmarks, Aardvark catches 92% of known and synthetic vulnerabilities.

Later, Aardvark gets rebranded as Codex Security and folded directly into Codex (web), surfaced as a first-class feature in ChatGPT Enterprise / Business / Edu. The signal from this quarter is unmistakable: one of the next axes of differentiation for frontier models is "security." Anthropic's Mythos Preview, six months later, is the second event on this same line.

2025 · 09.30 Sora 2 + iOS app model

OpenAI ships Sora 2 with a standalone iOS app, designed for short-form video consumption in a feed UI. An attempt at a native AI alternative to TikTok and Reels. The app eventually got shut down in April 2026, ending the first experiment in mainstream AI video, but it was the most serious attempt to ask the question "can AI video live in daily content consumption?"

2025 · 10.27 MiniMax-M2 — the open-weight SOTA throne model

230B total / 10B active. According to Artificial Analysis's independent evaluation, ranked #1 among all open-weight systems on the Intelligence Index. MIT license. The point where Chinese open-source moves from "frontier-adjacent" to "frontier."

2025 · 11.24 Claude Opus 4.5 — Anthropic cuts price 80% and aims at Wall Street model

Performance up vs. Opus 4. Price slashed: $5 / $25 per million tokens, an 80% drop from Opus 4's $15 / $75. "Infinite Chats" introduced as automatic context window management. At this price, token-heavy enterprises — finance, legal, consulting — start large-scale migration to Claude. This is the opening shot of the "Wall Street Claude shock." Two months later, Cowork lands and completes the move.

🐾 2025 · 11 ClawdBot first release (Peter Steinberger) — the project that becomes OpenClaw tool

Peter Steinberger (formerly of PSPDFKit) open-sources a CLI called Clawdbot: the pitch is "with a Claude API key, run a Claude Code-class agent in any environment you want." Anthropic sends a trademark complaint, the project renames to Moltbot on January 27, 2026, then renames again to OpenClaw three days later. This is the seed that becomes the "OpenClaw mania + Mac mini shortage" of spring 2026.

Act IV · 2026.01 – 2026.05 · 「VI. The Personal-Agent Age」

The Harness era — OpenClaw mania, the Mac mini on the desk, the moment everyone gets their own agent

The headline keyword of 2026 is harness engineering, and the product shape it takes is personal agents. Less about model IQ; more about the five surrounding layers — instructions, tools, memory, feedback loops, orchestration — and how carefully you weave them. Hermes Agent gave the concept its flag. OpenClaw mania brought it down to the level of "a single person runs a 24/7 agent on the Mac mini on their desk." Anthropic, meanwhile, was aiming at enterprise with Cowork, Opus 4.5/4.6/4.7. And in April, Anthropic's policy split — "Claude Code subscriptions can no longer bankroll third-party harnesses" — drew the first explicit power line between model providers and harness providers.

2026 · 01.12 Claude Cowork research preview — Claude Code's office-worker sibling tool

If Claude Code was an agent for developers, Cowork is an agent for office workers. It runs on the desktop, drives local files and apps directly, and finishes multi-step work end-to-end. From February, Cowork connects to enterprise tools — Google Drive, Gmail, DocuSign, FactSet, Zoom. CNBC framed it as "Anthropic building a productivity boost for the average office worker." Combined with the Opus 4.5 price cut, this is what completed the Wall Street Claude shock — finance and consulting adoption went vertical.

🐾 2026 · 01.27–30 Clawdbot → Moltbot → OpenClaw rename theater tool

Anthropic's trademark complaint forces Clawdbot to rename — first to Moltbot on January 27, then to OpenClaw three days later. Two renames in one week. Each one took the Hacker News front page. The actual concept — "with one Claude API key, run a Claude Code-class agent anywhere" — got cemented in the general developer mind over those few days. The fuse for the April drama.

2026 · 02.05 Claude Opus 4.6 — 1M context window arrives on Opus model

Anthropic finally puts a 1M token context window into the Opus line. Gemini 2.5 Pro's long-context lead gets neutralized. Combined with Cowork, the usage pattern of "feed the whole document / the whole codebase / the whole company data into context at once" goes mainstream.

⌖ 2026 · 02.25 Hermes Agent ships (Nous Research) — the flag of harness engineering tool

Nous Research releases Hermes Agent, an open-source "self-improving agent" harness. Around the same time, the LangChain team published an experiment showing that holding the model constant and only varying the surrounding harness — instructions, constraints, feedback loops, memory, orchestration — meaningfully changes agent performance. "Harness engineering" hardens as the industry keyword. Hermes is the most ambitious answer to that prompt — 95k+ GitHub stars in seven weeks, v0.9 "everywhere release" by April 13. The fastest-growing open-source agent framework of 2026.

2026 · 02.26 Nano Banana 2 — Gemini chat / Search / Lens integration model

Nano Banana Pro's quality + Gemini Flash's speed, combined and integrated across Gemini chat, Search AI Mode, and Lens. Image generation stops being a standalone product and becomes "part of search," "part of the camera."

2026 · Spring Mac mini shortage — global hardware bottleneck driven by OpenClaw mania culture

The setup — "one Claude API key + a local Qwen3 or Kimi K2 + an OpenClaw / Ollama runtime = a 24/7 personal agent server" — went viral. The M4 Mac mini, with the best performance-per-watt in the category, sold out worldwide. In Korea, used-market listings on secondary platforms got snapped up within hours; Apple's official store fell behind on delivery; resellers like Coupang and Danawa raised prices above MSRP. The mental model of "AI server = data center" pivoted, in one quarter, to "AI server = the box on my desk."

2026 · 04.04 Anthropic separates OpenClaw billing — third-party harness usage gets walled off culture

Anthropic announces that, starting April 4, Claude Code subscription limits can no longer be used for third-party harnesses like OpenClaw or NanoClaw — those become pay-as-you-go on a separate bill. Six days later (April 10), Steinberger gets temporarily banned, then restored. On April 14, Steinberger announces he's joining OpenAI. The first time the question "how much harness ecosystem does the model company allow on top of itself?" became a live industry issue.

2026 · 04.07 Claude Mythos Preview + Project Glasswing — a new tier above Opus, and a cybersecurity consortium model

Anthropic announces Claude Mythos Preview, a new model tier above Opus — not an Opus upgrade, a separate tier. They explicitly choose not to release it publicly. Instead, access is granted only through Project Glasswing, an industry consortium with AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto Networks, and roughly 40 other organizations as founding members.

The reason for the gating is specific. In the weeks before the announcement, Anthropic used Mythos to find thousands of zero-day vulnerabilities across every major OS and every major browser. The framing Anthropic chose: "AI's coding capability has crossed the point where it can surpass nearly all human security professionals at finding and exploiting software vulnerabilities, and openly releasing a frontier model with that capability is no longer safe."

The implication is clear. From here, "how widely to release" becomes a more important question than IQ for frontier models. And cybersecurity — following OpenAI Aardvark six months earlier — has firmly settled in as a frontier camp axis. This is the most consequential event sitting at the trailing edge of this chapter.

2026 · 04.16 Claude Opus 4.7 GA — the current frontier model

Keeps the 1M context window, adds further gains on coding and long-horizon agent tasks. As I write this on May 11, 2026, this is the current latest Opus. GitHub Copilot rolled Opus 4.7 to GA on the same day — Claude is now the default option in essentially every major IDE.

Where on this coordinate system are we standing right now

Anyone who reads to the bottom of that timeline will see one thing on their own. The roughly two years from November 2022 to fall 2024 were an era of the model itself. From the moment MCP was announced in November 2024, the industry’s attention started visibly migrating to everything that surrounds the model. Tools (IDEs, agents, CLIs), protocols (MCP, A2A, Skills), context engineering, and finally harness engineering. Model IQ is still climbing every quarter, but it has stopped being where the differentiation lives.

If you pull back a little further, the path LLMs have walked so far organizes itself into the following seven stages. A familiar curve: an industry discovers a new layer, commoditizes it, then climbs to the next layer above. Compilers commoditized → OS engineering era. OS commoditized → framework engineering era. Frameworks commoditized → product engineering era. Same shape.

Stage I

Birth of the Models

2018 – 2022

BERT · GPT-2/3 · Anthropic spinoff · InstructGPT (RLHF) · GitHub Copilot · FIM

Stage II

Model Refinement

2022.11 – 2024 fall

ChatGPT shock · peak scaling law · GPT-4 · Claude (books · persona) · CoT / ReAct / ToT · Cursor · Perplexity (Gen-1 AI tools) · MoE (Mixtral) · Mamba · o1 opens reasoning era

Stage III

Saturation → Agentic Methodology

2024.07 – 2025.04

Llama 3.1 (tool-call standard) · MCP (Anthropic) · Computer Use · A2A + ADK (Google) · context engineering replaces prompt engineering

Stage IV

Agents Bloom

2025.02 – 2025.06

Devin's category gets its first real product in Manus · Operator · Claude Code preview → GA · Cursor / Windsurf pivot to agent mode

Stage V

Pax Anthropica + SaaS Phobia

2025.05 – 2026.02

Claude Code GA · Skills (patching MCP) · Opus 4.5 −80% price · Cowork · Wall Street Claude shock · CLI proliferation (Gemini / Qwen / Codex) · Salesforce · Workday SaaS lock-in at risk

Stage VI · NOW

Personal Agents + Cybersecurity Frontier

2026.01 – 2026.05

OpenClaw mania · Mac mini shortage · Hermes Agent (harness engineering) · 24/7 personal agent on the desk · first power-line between model providers and harnesses · Aardvark / Mythos + Project Glasswing — cybersecurity emerges as a frontier axis

Stage VII

What's Next?

2026.06 –

Settled equilibrium between model and harness? · agent-to-agent economy? · personal AI servers as mainstream? · the layer we haven't named yet

Looking at this diagram, two things show up at once. First, every transition between stages happens the same way: “only after the previous layer is sufficiently commoditized does the climb to the next layer begin.” Stage III started when model IQ first began to saturate. Stage VI started when Claude API keys themselves became, effectively, a commodity. Second, when you combine that curve with the empirical fact that frontier model IQ is rapidly turning into a commodity right now — Opus 4.5’s 80% price cut, gpt-oss and DeepSeek and Qwen and MiniMax all giving away SOTA — Stage VII is already showing its head above the horizon.

Layered on top of all of this is one more current: SaaS phobia. The mood that started forming in Stage V — once Cowork started plugging directly into Google Drive, Gmail, DocuSign, and FactSet — settled into a clear realization for enterprise SaaS like Salesforce, Workday, and ServiceNow: “an agent doesn’t have to fight your lock-in. It just routes around it.” Benioff’s public statements (covered in chapter 2) are surface readings of that mood. How this current resolves in Stage VII — whether SaaS becomes agent-native and survives, or whether agents quietly absorb the seat SaaS used to sit in — is going to be the single biggest enterprise question for the next year or two.

So, three observation coordinates I’d recommend to anyone trying to follow this industry from here. One: the precision of the harness will explain more of product quality than frontier model IQ. Two: “which model I use” will matter less than “which context · tool · memory system I work inside” for individual productivity. Three: the power-and-policy frontier between model providers (Anthropic / OpenAI / Google) and harness providers (Hermes, OpenClaw, LangChain) will be the most interesting industry story of the next quarter — Anthropic’s OpenClaw billing split was the first small skirmish on that line.

Finally — tools change weekly, models change monthly, but this seven-stage coordinate system doesn’t move that much. If you build the small habit of asking, every time a new release lands, “which era of which current does this sit in?”, you can keep observing this industry for the rest of your career.

So, what’s next?

AI is an industry. Which means we should be listening, attentively, to what each frontier lab is actually saying out loud. We’re lucky to be doing this in the age of YouTube, LinkedIn, and Twitter — their statements, talks, and offhand tweets are unusually easy to access, and I think that’s a gift worth being grateful for. I’ll try to organize that — what each lab is saying, and what it implies — in the next post.

A confession in closing: this timeline is curated by what I personally judged to be important, and I want to be honest about that. Plenty of contemporaneous events didn’t make it in — EU AI Act passage, NVIDIA Blackwell, Microsoft Recall, Adobe Firefly, ElevenLabs, Replit Agent, Korea’s Naver HyperCLOVA X, Kakao Kanana, and many others — and their absence reveals my own attention bias. There’s also a heavy Anthropic tint to the piece, which is the natural shadow of being an FDE whose daily driver is Claude.