← Back to Blog

Code is Not Text

April 20264 min read

Git is one of the greatest pieces of software ever written. The content-addressable object model, the DAG of commits, the branching system that makes parallel work feel natural. It solved distributed version control so thoroughly that nobody seriously tries to replace it anymore, and for good reason. When people complain about git, they're usually complaining about its CLI, not its architecture. The architecture is brilliant.

But git was designed to track text files, and source code, while stored as text, has structure that plain text doesn't. A Python file isn't just a sequence of lines. It's a collection of functions and classes, each with defined inputs and outputs, connected to other functions and classes in other files through import statements and function calls. Git doesn't know any of this, and it was never supposed to. That wasn't the problem git set out to solve. But it means there's a layer of understanding that's missing from the tools we use every day, and once you start thinking about what that layer could do, especially for AI agents, the implications are surprisingly deep.

Consider what happens when a diff is expressed in lines versus entities. A typical file has L lines but only E entities, where E is much smaller than L, often by an order of magnitude. A file with 300 lines might contain 15 functions. When someone changes a few of those functions, git reports the diff in terms of L: you see some number of lines added, removed, or modified, grouped by file. But the actual semantic content of the change, the thing you need to understand in order to review it, is proportional to E, because what changed is some number of functions. Everything else in the diff is context, noise, or cosmetic reformatting. The ratio of signal to noise in a line-level diff is roughly E/L, and for most files that's a small number. You're asking your reviewer, whether human or agent, to wade through a lot of lines to find the few entities that actually matter.

This is especially important for AI agents, and understanding why requires thinking about how agents process code. An agent reviewing a pull request pays a cost proportional to the number of tokens it has to read. Line-level diffs are expensive: every changed line, every context line, every reformatted line costs tokens. But the number of decisions the agent actually needs to make is proportional to E, the number of changed entities. If you feed the agent a line diff, it spends most of its token budget on noise. If you feed it an entity diff, it spends almost all of its budget on signal. The difference isn't marginal. In a codebase where L/E is 20, you're asking the agent to do 20x the work for the same amount of understanding. That's 20x the cost, 20x the latency, and 20x the opportunity for the model to get confused by irrelevant context.

That's the gap we set out to fill with sem. It sits on top of git and adds a layer that understands the structure of code. Instead of seeing a file as a sequence of lines, sem sees it as a collection of entities: functions, classes, methods. Each entity gets a structural hash computed from its AST rather than its text, so two versions of a function that look different but do the same thing produce the same hash. Reformatting doesn't change the hash. Renaming a local variable doesn't change it. Adding a comment doesn't change it. Only changes to the actual logic register as changes. This means you can finally answer the question that matters: did behavior change, or just appearance?

But the most important thing sem adds, and the thing that matters most for agents, is the dependency graph. Because sem understands entities, it can build a cross-file graph of which functions call which other functions, across the entire codebase. And once you have that graph, you can answer a question that no amount of LLM reasoning can reliably answer: if I change this function, what else might break?

Think about why this is hard for an LLM. If you change a function f, and some other function g in a different file calls f, the agent needs to know about g. An LLM can try to figure this out by reading the codebase, but it can only search what it can see in its context window, it might miss indirect callers, and it might hallucinate dependencies that don't exist. The graph doesn't have any of these problems. If f has D transitive dependents, the graph gives you all D of them deterministically, in milliseconds, with zero hallucination. The agent doesn't need to reason about dependencies at all. It just asks the graph. This means the agent can focus its reasoning capacity on the hard part, understanding whether the change is correct, instead of wasting it on the mechanical part of figuring out what the change affects.

There's a more subtle point here about how entity-level thinking changes the economics of agent-driven code review. Say a pull request modifies E entities, and the total number of entities transitively affected is E + D, where D is the sum of all dependents. An agent doing line-level review has to read the entire diff and then try to reason about impact from the raw text. An agent with access to the entity graph can immediately see which E entities changed, look up their D dependents, and focus its attention on the E + D entities that actually matter, ignoring everything else in the codebase. The reduction in scope is usually dramatic, because most changes touch a small number of entities that affect a small fraction of the total codebase. The agent ends up reviewing less code, more precisely, at lower cost, with better results.

We originally built sem because we needed exactly this for our own agents. But it turns out that what agents need and what humans need are the same thing, and probably always have been. Both agents and humans have limited attention. Both want to know what changed, whether it's a real change or cosmetic, and what it affects. The only difference is that agents hit the limitation harder because their attention is literally metered in tokens. Code has always had structure. Our version control tools have always tracked text. There's room for a layer in between that bridges that gap, and once you add it, both humans and agents wonder how they ever worked without it.