· 17 min read · AI

The AI Stack in 2026

Agents, MCP, evals, context engineering, vibe coding. everything that actually matters if you're building with AI right now

#AI #LLMS #Agents

I keep having the same conversation. A friend who’s been deep in their own product for the last year asks me what they should actually be learning right now. They’ve used Copilot a bit, maybe tried Claude or ChatGPT for some things, but they can tell something bigger happened and they missed it.

They’re not wrong. I wanted to write down what I’d tell them over coffee.

This isn’t a post about how transformers work or what a neural network is. there are great resources for that already. this the stuff that actually matters if you’re building software with AI today.

Agents changed everything

the biggest thing that happened is that AI went from “get a response” to “do a task.”

But think about about what last year actually looked like. You’d grab some broken code, throw it into ChatGPT, get something back, paste it in, realize it didn’t quite work, go back again. Sometimes you’d go back and forth three or four times before it worked. Now you can just tell an agent “the /settings page is broken on mobile, fix it” and it goes off and reads your code, edits a bunch of files, runs the tests, see what failed, fixes those too and opens a PR.

The first time I saw this work end to end, I kept expecting it to fall over. It did fall over sometimes! But way less than I expected. And the pace of improvement is fast enough that things which didn’t work last month work now.

so what actually makes an agent different from just… chatting with a model ? It’s the loop. A chatbot is one pass: input goes in, output comes back. With an agent , there’s a goal. The model figures out a plan, pick some tools it thinks it needs (your file system , a browser, whatever), and starts executing. After each step it looks at what happened and decides whether to keep going or try something else. Sometimes it loops three times, sometimes thirty.

What’s even more interesting is multi-agent systems. Instead of one model trying to do everything, you set up a few agents working together. One does research, one writes code, one reviews. you’re orchestrating a little squad. the patterns have names now ( ReAct, Reflection, Planning, Human-in-the-loop ) and if you’ve done microservices work, the mental model is surprisingly familiar.

Here’s the part that doesn’t get talked about enough, though.

Reviewing agent work is genuinely hard. I’ve gotten back a 500-line diff from an agent and realized I had no idea if half of it was correct. It’s this weird thing where checking the work sometimes takes longer than just doing the work would have. And if you’re just skimming the diff and hitting merge.. you’re not saving time. you’re borrowing it.

What I’ve found works: keep the loops tight. let the agent make one change, run the tests, look what happened. Don’t kick off a big task and walk away for an hour. That’s when you come back to a mess.

Reasoning models

You might have heard of chain-of-thought prompting. You ask a model to think step by step before answering. That trick is now just.. built in to certain models.

OpenAI’s o-series, DeepSeek R1, Claude’s extended thinking. These models actually spend time working through a problem before responding. It’s not instant anymore. the model thinks, backtracks sometimes, tries a different approach, and then gives you the answer.

For engineering tasks, this a big deal. These models are meaningfully better at debugging, at catching edge cases you didn’t think of, at writing code that doesn’t just handle the happy path. I find myself reaching for a reasoning model whenever the problem has any real complexity to it.

The trade-off is tokens and latency. A reasoning model response might cost 10x what a regular one does. So you do what engineers always do: pick the right tool. I’m not using a expensive reasoning model for simple extraction or classification. That would be like spinning up a new database connection when the answer’s already in cache.

// Pick the right model for the job
const model = task.requiresReasoning ? 'claude-opus-4-6' : 'claude-haiku-4-5'

MCP

If you only learn one new thing this year, make it MCP.

The Model Context Protocol is an open standard for connecting AI to external tools and data. The problem it solves is boring but important: before MCP, if you wanted your agent to talk to Postgres, you’d write a custom integration. GitHub? Another one. Slack? Another one. Every tool was its own bespoke wiring.

MCP gives you one protocol. Plug in an MCP server for Postgres, another for GitHub, another for Slack. Your agent discovers what’s available and decides when to use each one. People keep calling it “USB-C” for AI and honestly the analogy is pretty accurate.

const server = new MCPServer({
tools: [
postgresServer,
githubServer,
slackServer,
browserServer
]
})

Context is the whole game

You’ve heard of prompt engineering. Context engineering is what that grew up into.

The thing I keep trying to explain to people is that what you get out of model depends almost entirely on what you put in. I don’t mean the prompt specifically. I mean everything. The documents you retrieved, the tools you made available, the conversation history, stuff it remembers from last time, examples of what good output looks like , rules you set up.

To make this concrete:

"Summarize this document"

versus:

System: You write for a developer audience.
Short paragraphs. Lead with the takeaway.
Here are examples of good summaries: [examples]
The reader is evaluating whether to adopt this tool.
They have 5 minutes.
Document: [attached]

Completely different output. Same model both times. The only thing that changed is what you gave it to work with. I keep seeing teams blame the model when the real issue is that they’re not giving it enough to work with.

People I’ve talked to who are building really good AI products right now all seem to have figured this out. They spend more time thinking about what goes into the model than which model to use. It reminds me of database design, honestly. If you get the schema wrong, everything built on top of it is going to be painful. Context is like that.

MCP, RAG, memory systems, rules files, structured system prompts. These are all just different ways of feeding context to a model.

RAG is still the workhorse here. The idea is dead simple: before you ask the model anything, go search your own data and pull back whatever’s relevant. Then you shove those chunks into the prompt alongside the question. That way you’re not hoping the model magically knows about your internal docs or your API. You looked it up for it. It sounds simple but getting retrieval right is its own whole thing. Chunking strategy, embedding model choice, re-ranking, hybrid search. Most production AI products I’ve seen are basically a RAG pipeline with a nice UI on top. If you’ve shipped a real AI feature to users, you’ve probably written more retrieval code than prompt code.

How code gets written

Andrej Karpathy coined “vibe coding” in early 2025. It caught on fast enough that Collins Dictionary made it Word of the Year. The short version is you just tell the AI what you want and it writes the code.

People are actually shipping software this way. But the name is a bit misleading about the range of what’s happening.

Pure vibe coding is one end of it. You describe what you want, you take whatever the model gives you, you don’t look at the code too carefully. This is great for prototypes and personal tools. I use this for stuff I don’t plan to maintain.

Most professional engineers are doing something more like AI-assisted development. The model writes code, you read it, you understand what it did, you make changes where needed. You own the architecture. You still have to know what you’re doing. The model just saves you a lot of typing.

And then there’s something people are calling spec-driven development. The idea is you write a proper spec first, covering inputs and outputs and edge cases and constraints, before you even open the AI tool. Then you hand that spec to an agent and it implements the whole thing. I’ve been seeing more teams go this direction, especially when the code needs to actually work in production. Which, you know, it usually does. Turns out that if you spend more time on the spec, the agent does a way better job. It’s just the context thing again.

Simon Willison said something I keep coming back to : “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book. That’s using an LLM as a typing assistant.”

The tools are good now. Cursor, Windsurf, Claude Code, GitHub Copilot, Codex CLI. They understand your whole project, not just whatever file is open. They can run your tests, check linting, use a browser to look at your running app, and iterate until things work.

But here’s something I think people need to hear. METR ran a randomized controlled trial and found that experienced open-source developers were actually 19% slower when using AI coding tools. Before the experiment, they predicted they’d be faster. After the experiment, they still believed they’d been faster. That’s a weird result and it stuck with me.

I don’t think it means the tools are bad. My read is that they’re good enough to make you feel fast even when you’re not actually being fast. You’re producing more code, sure. But someone still has to read all that code. Someone has to catch the bugs in it. And if you’re not doing that carefully, you end up spending the time later anyway, just in a less fun way.

Evals or it didn’t happen

Here’s the thing nobody talks about at the demo stage but everyone talks about in production: how do you know your AI feature actually works?

Not “works on the three examples I tried.” Works across the weird inputs your users send it. Works after you swap in a new model. Works after you tweak the system prompt because a customer complained. This is the evals problem and I’m genuinely surprised how many teams skip it.

An eval is just a test for your AI system. You have inputs, expected behaviors, and some way of scoring the output. Sometimes that’s deterministic, like “did the function call return valid JSON.” Sometimes you need an LLM to judge the output, which feels weird but actually works pretty well if you write a clear rubric. Anthropic published a whole guide on this recently and it’s worth reading.

The reason this matters so much is that AI systems fail silently. A traditional bug crashes or throws an error. An LLM just confidently gives you a worse answer and nothing in your monitoring lights up. I’ve seen teams push a prompt change that tanked quality on 15% of queries and not notice for weeks because they had no evals running. Nobody wants to be that team.

What I’ve settled into: treat evals like tests. Run them in CI. If you change the prompt, the model, the retrieval logic, or the tool schema, run the suite before you deploy. Have a set of golden examples that cover your important cases and your known failure modes. You don’t need anything sophisticated to start. Fifty examples you picked by hand, with basic pass/fail logic, will already catch most of the bad stuff.

For agents it gets harder because the outputs aren’t just text, they’re actions. Did the agent call the right tool? Did it call it with the right arguments? Did it try something dumb on step 3 that happened to work out by step 7? You need traces for this, not just final outputs.

Which brings up observability. If you’re running agents in production and you don’t have tracing set up, you’re flying blind. Tools like LangSmith, Langfuse, Braintrust, Arize. Basically they give you a full replay of what the agent did: which tools it called, what came back, how long each step took, how many tokens it burned. When something breaks in production (and it will), you want to be able to pull up that trace and see exactly which step went off the rails. This isn’t optional anymore. The LangChain survey from late 2025 showed almost 90% of teams with agents in production have observability set up. It’s the same lesson we learned with microservices ten years ago: distributed systems need distributed tracing.

I know evals and observability aren’t the exciting part. Nobody writes blog posts about their eval suite. But every team I’ve seen that’s actually shipping reliable AI stuff? They invested in evals and tracing way earlier than felt necessary. That’s usually what separates the thing that works in a demo from the thing that works on a Tuesday afternoon when a user sends it something weird.

Everything is multimodal now

This one is simple. Every serious model in 2026 handles text, images, audio, video, and code natively. Text-only AI already feels like a limitation.

Your users expect this now. If your AI product can’t look at a screenshot or process a PDF, it feels incomplete. The good news is the APIs have gotten clean enough that adding multimodal support is genuinely just a few lines:

const response = await model.analyze({
image: screenshot,
prompt: "Extract form field labels and values as JSON"
});

What I’m more excited about is generative UI. Instead of the model always returning text, it can return actual rendered components. You ask about your data and get back a chart, not a paragraph describing the numbers. The model produces the interface. It’s early but it’s one of the more interesting design spaces opening up

Memory

Two years ago, AI didn’t remember you. Every conversation started from zero. That’s changed.

Models now maintain memory across sessions. Your preferences, your project context, your team’s conventions. Combine that with context windows that can hold hundreds of thousands of tokens and the experience becomes something different. You stop re-explaining things. The model just… knows your setup.

For agents, this is a big deal. There’s a real difference between a tool you have to re-explain everything to every time and a collaborator that already knows your codebase and conventions. Three types of memory are emerging: episodic (what happened in past conversations), semantic (general knowledge about your project), and procedural (learned habits, like “always run the test suite first”).

If you’re building an AI product and it forgets the user between sessions, it’s going to feel bad compared to everything else on the market. Memory went from a nice feature to something people notice when it’s missing.

Picking models

Quick orientation if you haven’t been tracking every release.

GPT-5, Claude 4.5, Gemini 3, Llama 4. A year ago I would have just picked one and stuck with it. Now I switch between them all the time. Claude writes really well. Codex is good for long autonomous tasks. Gemini can hold a crazy amount of context. Haiku is fast and cheap when I just need something simple done.

Open source is legitimately competitive now. DeepSeek R1 does solid reasoning at a fraction of the cost of the closed models. Qwen, Mistral, and the Llama family are real options if you need to self-host for privacy or cost reasons.

Mixture-of-Experts (MoE) is the architecture trend to understand. Instead of the entire model activating for every request, it routes to specialized sub-networks. You get the capability of a much larger model without paying for all of it on every call. This matters a lot for keeping inference costs sane.

Small models are worth knowing about too. A 7B parameter model running locally on a laptop is surprisingly capable for things like classification, autocomplete, and simple extraction. And it’s fast, basically free, and you don’t send any data to an API. Not everything needs the biggest model

The real shift is that picking a model is now an engineering decision with actual trade-offs. What are you optimizing for? Speed? Cost? Quality? Privacy? The answer changes depending on what you’re building.

What’s coming

Three things I keep thinking about.

Physical AI is starting to get real investment. Making language models bigger is running into diminishing returns, and the frontier is shifting to models that can perceive and act in the physical world. Not humanoid robots (that’s further out). More like warehouse automation, inspection drones, surgical assistance. Narrow applications where the payoff is obvious.

Edge AI is getting practical. Running models locally on phones and laptops instead of calling an API every time. Lower latency, better privacy, no usage costs. Apple, Google, and Qualcomm are shipping dedicated hardware for this. When every phone can run a decent model locally, there will be entire categories of apps we haven’t thought of yet.

And then there’s governance. Not the fun topic, but if your AI is sending emails or booking flights or making purchases on behalf of users, you need to care about this. The EU AI Act is live. You’re going to need audit trails, some form of explainability, fairness testing. It’s the kind of thing that’s way easier to build in from the start than to bolt on after something breaks and someone asks why.

People worth following

People I learn the most from:

  • Simon Willison. Easily the most useful daily writing about AI tools and what’s actually happening. I read everything he posts.
  • Andrej Karpathy. If you want to really understand what’s going on under the hood, his YouTube lectures are still the best place to start.
  • Latent Space (swyx and Alessio). Best AI engineering podcast. Real technical depth with interesting guests.
  • Sebastian Raschka. Research-level ML writing. His book Build a Large Language Model (From Scratch) is great if you want to go deeper.
  • Ethan Mollick. Focused on how AI is changing how people actually work. Practical and clear.
  • Addy Osmani. Comprehensive guide to AI-assisted development.
  • MCP docs. Honestly just go read the spec and build something with it.
  • Research blogs from OpenAI, Anthropic, and Google DeepMind.

The thing that strikes me about where we are is that the fundamentals didn’t really change. It’s still transformers. Still predicting the next token. But the layer where we spend our time as engineers moved way up. We’re not crafting prompts anymore. We’re building systems where AI components have goals and memory and tools and can take action on their own.

I think the people who are going to do well with this are not necessarily the ones who understand the model internals the deepest. It’s the people who can figure out the right context to give a model, pick the right model for the job, stay close to what the agents are doing, and actually ship. That’s always been the job, honestly. The tools are different now but the work isn’t.