d_{model}
NullabilityEval
across scale
The last five years have shown us that large language models, like ChatGPT, Claude, and DeepSeek, can effectively write programs in many domains. This is an impressive capability, given that writing programs involves having a formal understanding of program semantics. But though we know that these large models understand programs to an extent, we still don’t know many things about these models’ understanding. We don’t know where they have deep understanding and where they use heuristic reasoning, how they represents program knowledge, and what kinds of situations will challenge their capabilities.
Fortunately, recent work in model interpretability and representation engineering has produced promising results which give hope towards understanding more and more of the internal thought processes of LLMs. Here at d_{model} , we can think of no better place to apply these new techniques than formal methods, where there are many abstract properties that can be extracted with static analysis. The vast work done in programming language theory over the past hundred years provides many tools for scaling an understanding of the internal thought processes of language models as they write code.
In that spirit, we wanted to start with a simple property that comes up in every programming language, nullability. Nullable values are represented differently across languages; as null pointers in C or C++, with explicit Option types in Rust, and with special nil or None values in dynamic languages like Javascript, Lisp, or Python. In every case, understanding where values can be nullable is necessary for writing even basic code, and misunderstanding where they are nullable can often be a source of bugs.
Do our models understand when a value is nullable? They must, to be able to write code that deals with nullable values, but we haven’t known what form this knowledge takes, what situations are likely to confuse the model. Until now.
In this work:
We introduce a microbenchmark of 15 programs that test basic model understanding of the flow of nullability through a program. (Sec. 2.1)
We find that models begin to understand nullability in a local scope, satisfying many requirements of the python typechecker, before they start to understand how nullability flows across the program. (Sec. 2.5)
We find that models develop an internal concept of nullability as they scale up and are trained for longer. (Sec. 3)
We end with a demo: after we train a probe that uses the determines whether the model thinks a variable read corresponds to a nullable variable, we can demonstrate that internal knowledge in a reading diagram:
We begin with a “skyline” estimate of model understanding of nullability (a la Feng and Steinhardt (2024)). That is, we evaluate model nullability understanding externally, at the token-level, (as opposite to internally, at the activation level, using interpretability techniques). In this section, we first explain the task of nullability understanding (Sec. 2.1). Then we formally decompose the reasoning steps required to reason about nullability both inside (Sec. 2.2) and across (Sec. 2.3) functions. Finally, we present the results of our “skyline” analysis. (Sec. 2.5).
NullabilityEval
To measure nullability understanding externally, we ask the model to
complete simple partial programs that each require an understanding of
nullability. We refer to this suite of programs as
NullabilityEval
. All of the tests in this benchmark suite
are composed of three functions or less, where the largest function is
seven lines long.
In our experiments we focus on the Pythia model suite (Biderman et al. 2023), as they have checkpoints available across training runs and various scales. For measuring performance at larger sizes, we also include Qwen2.5-Coder-32b, Llama 3.1 405b Instruct, and DeepSeek-V3 (671b).
Each partial program is constructed such that there are a very limited number of valid next lines in the program, and all of them demonstrate some knowledge of the concept of nullability.
For example, our first test is:
Test 11
def main() -> None:
= [1, -4, None, -3, 10, -1, None, None, 8]
some_numbers list[int] = []
result: for num in some_numbers:
We would like the model to generate code that performs some operation
on the elements of some_numbers
. The simplest way to do
that is to introduce a branch if num is None
, but several
variants are also valid: if num is not None
,
if num == None
, if isinstance(num, int)
. That
is, this example is constructed such that there is a small space of
valid next lines, and all of them require some understanding that
num
may not be an int
.2
We find that Pythia models as small as 2.8b can successfully complete Test 1, and that they learn to complete the test in the first third of training. Consistent with observations that larger models are more sample-efficient , larger Pythia models learn to complete this test earlier, with Pythia 12b able to complete the test 20% of the way into training and Pythia 2.8b able to complete it 28% of the way into training.3
These results indicate that these models understand nullability to some extent, but how deep is this understanding? To quantify this, we give a syntax and semantics of a minimalist subset of python that captures nullability in Appendix B. We can then classify each partial program by which program constructs and rules determine the nullability of the target variable. For instance, Test 1 uses the (List), (Var), and (For) rules.
So, do Pythia models 2.8b and up understand the semantics of these three rules? As it turns out, not exactly. LLM’s pick up on a lot of information relationships in their training data that have statistical correlation, without it necessarily being causal. What this means in practice is that the models use things like variable names, whitespace, statement orderings, and constant values to guess at certain programming concepts.
For example, while many of the Pythia models can complete:
= [1, -4, None, -3, 10, -1, None, None, 8]
some_numbers list[int] = []
result: for num in some_numbers:
with a proper None check, changing the variable names and the constants into:
= [60, None, -33]
foo list[int] = []
bar: for corejoice in foo:
causes all the Pythia models to fail the test.
To test the reliance on these features more robustly, we generated 100 random variable namings and constant values for the above program, as well as for small program that test other typing rules. We also balanced these tests with programs that made use of the same language constructs, but where the loop variable was not optional. We found that when lists and for loops were involved, Pythia 2.8b was only able to correctly label 57% of the programs, only slightly better than chance. Even Pythia 12b could only correctly label 68% of the tests. These results show that the Pythia models rely heavily on these non-semantic features when reasoning about for loops.
Fortunately, many other typing rules do not exhibit such a strong reliance on variable naming and constants. In the same setting of generated programs, we found that Pythia 2.8b was able to correctly label programs using the Abs, Var, App, and If_Out a much greater proportion of the time (99%, 93%, 86%, and 98% respectively). Stay tuned in the future (Sec. 4) for a more in-depth exploration of how the models behave on individual typing rules when we increase the variability of our program generation.
We can further challenge the model by adding layers of procedural
indirection between the source and sink of nullable values, thereby
testing the model’s interprocedural understanding. First, we
demonstrate how to write such tests, and the difficulties of writing
tests that may be too easy (Sec. 2.3.1).
Then, we present a harder problem (Sec. 2.3.2) and introduce a stronger type system
mypy++
to formalized the needed reasoning (Sec. 2.3.3).
Here’s a simple test for interprocedural analyses:
Test 2
from typing import Optional
def main() -> None:
= [1, -4, None, -3, 10, -1, None, None, 8]
some_numbers print(positive_numbers(some_numbers))
def positive_numbers(numbers: list[Optional[int]]) -> list[int]:
list[int] = []
result: for num in numbers:
In this test, we’ve taken the same logic and spread it across
main
and another function, positive_numbers
.
Ideally, the model would have to think a little harder to understand
that some_numbers
is flowing from main
through
the function call into the body of positive_numbers
,
causing the for loop body to need a None
check. In
practice, however, we find this test is actually easier for the
model to pass than Test 1, with models as small as Pythia 1b passing it,
and Pythia 12b learning to pass it 13% of the way into training.
Because of the type annotations on the positive_numbers
function, the model doesn’t need to pay attention to main
at all. It can just look at positive_numbers
, and use the
type annotation to know that numbers
contains
Optional[int]
s. Then, using the For rule it can reason that
num
is nullable, so it must be checked for None before
proceeding. Looking at the type annotation turns out to be easier for
the model than scanning through a list to determine if there are None
and non-None values, resulting in an easier test overall.
So how would we actually test for interprocedural nullability understanding in the model? Well, type annotations on Python functions aren’t required4, so we can instead provide the model with an unannotated function, and see if it still understands the flow of nullability. Here’s a test that does that:
Test 3
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
= process_value(value) + 1
x print(x)
def process_value(value):
Our base set of typing rules (listed as “Common Rules” in Appendix B.1) don’t handle unannotated functions
though, so we’re going to have to add some more, and here we’re faced
with a choice. The typing rules for normal Python say that functions
without return type annotations return the Any type, and arguments
without a type annotation have the type Any. In fact, normal mypy will
not check unannotated functions at all, even for internal
consistency; the --check-untyped-defs
option will add some
checking back, but the types of arguments and return type will still be
Any. In Python, a value of any type can be converted to an Any, and an
Any can be converted to any value type.
This means that it would be technically type safe to do anything in
the body of process_value
, including just returning the
argument, without a static type error. All Pythia models with at least
410 million parameters are able to make use of this extra flexibility to
write code for Test 3 that typechecks under mypy. But at runtime, code
that exploits it would still fail.
If we want code that does not TypeError
at runtime, we
can strengthen our type checker by requiring that there be some valid,
non-Any
, type for the function that typechecks at the call
site and in the function body. This new typechecker is still checking
unannotated functions, but passing fewer of them.5
We’ll call this augmented type system mypy++.
In Appendix B.2, we formalize the unannotated function rules for mypy vs mypy++.
There’s no consistent threshold of size at which Pythia models can pass Test 3. Pythia 1b, 2.8b, and 6.9b pass the test in their final revisions, but Pythia 410m, 1.4b, and 12b don’t. The models with at least 1 billion parameters all have points in training where they can pass the test, but only intermittently. Even 6.9b, the best performing size on this test, fails the test in its second-to-last available revision6. You can see how this evolves over scale in fig. 2 and time in fig. 3. See Sec. 3.5 for further discussion of performance over time.
What the models can do well, however, is learn to pass these tests in the mypy type system (as opposed to mypy++). In that system, where they don’t need to reason globally about the functions but only locally, this test is one of the easiest for the models to complete.7
Since this test suite is meant to be informative beyond the sizes of the Pythia models, we also add another layer of indirection to add more difficulty, in this test:
Test 4
def handle_value(value, guard):
if guard:
return process_value("Foobar") + 1
else:
return process_value(value) + 1
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
= handle_value(value, x < 10)
x print(x)
def process_value(value):
In Test 4, instead of main
calling
process_value
directly, it calls handle_value
,
which itself calls process_value
in a branch. With two
layers of indirection, we start to hit the limits of the capabilities of
even frontier models. Llama 405b is unable to successfully pass this
test, as are smaller models like Qwen Coder 32b, while DeepSeek V3 (671b
parameters) is able to pass it. However, Pythia 6.9b is still able to
pass this test with some consistency.
Finally, we can test how well the models write write type annotations for functions. Here, the trailing colon makes the type expression the only valid completion.
Test 5:
def program_48() -> None:
int] = None
number: Optional[= get_square(number)
square if square is not None:
print(f"Square of the number is {square}")
else:
print("No number provided to square")
def get_square(number:
None of the Pythia models pass this test. Qwen Coder 32b is also incapable of passing this test, but both Llama 405b and DeepSeek V3 pass it.
We would indeed expect that writing type annotations is more difficult than merely implicitly reasoning about the types of python programs, as only a small fraction of python programs in the wild are thus annotated.
We wrote three variations of each of these tests, resulting in 15 tests total. In Fig. 2 below, we can see the number of passing tests for each model under mypy and mypy++.
We can see the number of fully passing tests for each model in blue. Generally speaking, models get better with scale: Pythia-2.8b parameters can pass about half the tests, but we need the much larger and more parameter efficient Llama-405b to pass all of the tests. This matches our expectations that eval scores should scale logarithmically, indicating that these tests are well distributed.
We can also see the test result for the pythia models under the weaker mypy success criteria. As we expected, the mypy results (red bar) are (almost8) always above the mypy++ results (blue bar), as mypy++ is a stricter type system. There are six tests in the dataset involving non-annotated functions, and using the weaker mypy typesystem causes up to five more tests to pass than using mypy++9
Next, we want to understand the training dynamics at play here. Below, we can see how Pythia 6.9b performs on the tests during training from step 2 to 143000: 10
We see that each individual model learns to write code which typechecks under mypy before it learns to write code which typechecks under mypy++ and throws no type errors at runtime.
At this point, we’ve figured out how to roughly measure nullability understanding in the output of various language models, but we still don’t know what their internal representations might look like or when they emerge. Next, we detail how we train reading vectors (Sec. 3.3), using prompts designed to make the model think about the phenomena of interest (Sec. 3.2). Finally, in Sec. 3.5, we validate that these probes improve their understanding of nullability over the course of pretraining to the level that we expect from the external, or token-level understanding evals we describe in the previous section.
In this section, we review representation engineering (Zou et al. 2025) techniques that we will use to look for linear representations of nullability inside the model.
Zou et al. (2025) shows how representations can be extracted for concepts like “happiness”, “honesty”, and “fairness”. First, they construct many prompts which cause the model to act in a way aligned with the concept, and many which cause the model to act in a way aligned against the concept. For instance, they might prompt the model with “Pretend you’re a dishonest person. The Eiffel Tower is” and “Pretend you’re an honest person. The Eiffel Tower is”. Then, they take the internal activations which correspond to each, and try to extract a high-dimensional vector which points towards the states which are aligned, and away from the states that are not aligned. This vector can then be used as a linear model to measure how much of the concept is activated in the model during any given forward pass (e.g. for honesty, this gives us a lie-detector).
We avoid dealing with the ambiguities of natural language by working in a setting where the model needs only to analyze the nullability of individual variable occurrences. Specifically, we probe for “the variable I just generated refers to an nullable quantity”, so our prompts looked like:
def program_1() -> None:
def find_value(data: List[int], target: int) -> Optional[int]:
for value in data:
if value == target:
return value
return None
= [1, 2, 3, 4, 5]
data = 3
target = find_value(data, target)
result if result
We queried o1, o1-mini, deepseek-coder, and claude-3.5-sonnet with the following prompt:
Generate me 100 typed python programs that use the Optional type, along with type annotations for any undefined functions that they use. Make sure that the programs are unique, and each involves at least eight lines of logic. Number each program from 1 to 100. Please put all the programs in a single file, with a main function that tests each. Don’t include any text before or after the code, just the code. I will be using mypy with the –strict option to check the code.
We label each variable read occurrence with nullability information derived from mypy, and prompt the “model under test” with a prompt consisting of the tokens up to and including the variable read occurrence.
Prior work focused their probing on a single layer, often handpicked based on prior papers. We probe all layers instead. We use Mass Mean Shift probing11 for each layer, because it’s been shown empirically (Li et al. 2024) to generalize better in high dimensional spaces than logistic regression.
We then tested two methods for determining the relative importance of the different layers — either allowing the magnitude of the difference of means vector to determine the importance of the layer in the final probe (MM), or to learn coefficients for each layer using linear regression (MM-LR). We found that which method is more accurate on test data varies over both model size and number of training steps.
In Fig. 4, we can see that pure mass means probing gives lower loss for smaller models (those with less than 410 million parameters), but that for larger models weighting layers using linear regression gives lower loss consistently.
Let us return to the reading diagram from the introduction, reproduced below.
This diagram is adapted from the style of reading diagram from Zou et al. (2025), but only show the activations on tokens that represent variable loads12. For each position of interest, we prompt the model with the partial program that consists of all tokens up to (preceding) and including that position. We then probe the model at the following token. We color the box above that position based on the output of the probe, and a scoring threshold inferred at train-time13.
In this program, there are sixteen tokens that correspond to variable
loads, and (correctly) all but one are marked as non-optional. The only
nullable variable in this program is result
, since it comes
from find_value
which returns Optional[int]
.14
In this section, we study the performance of our nullability probes across time and scale (Tigges et al. 2024). We use mass-means shift probing (Li et al. 2024) on all layers, and a linear regression to determine the weights of each layer.15
In fig. 6, we plot loss against scale and time. While we measured accuracy for every available Pythia model size, we exclude the smallest (14m) from this plot since it would exist entirely above the top of the plot.
One thing that is interesting to note is that models up to 1b reach a minimum loss before loss for this task climbing again. Charitably, this may be because the features beyond this point become more complex — less linear, or the represented features themselves represent more subtle concepts. Cynically, this reflects that models—small models in particular—do not uniformly improve at this task over training, as we observed in Sec. 2.3.3.
Our suspicion is that this pattern would continue even for the larger models if we continued to overtrain them for longer.
We hope to get a better understanding of this phenomenon by studying the decomposed nullability reasoning as described in Sec. 2.2 and Appendix B.1.
Our decision to use Pythia to study feature evolution across time and scale is inspired by Tigges et al. (2024) . They focus on classic circuits-centered interpretability tasks such as IOI (Wang et al. 2022), Gendered-Pronoun (Mathwin et al., n.d.), Greater-Than (Hanna, Liu, and Variengien 2023) , and SVA (Linzen, Dupoux, and Goldberg 2016).
In our setting, we are more interested in how activations vary across inputs, to extract representations of nullability. Zou et al. (2025) surveys techniques for representation engineering with linear probes. We apply similar techniques, but to program semantics and dataflow instead of natural language.
Feng, Russell, and Steinhardt (2024) also study LLM’s ability to reason about propositions, but in a natural language setting, rather than a formal one.
Several techniques exist for constructing linear probes, but after experimental measurement we followed the mass means shift from Li et al. (2024). Li et al. (2024) and Zhong et al. (2023) discuss why mass mean probing might outperform linear regression.
We thank Leo Gao, Chelsea Voss, and Zhanna Kaufman for their comments and suggestions during the drafting process.
@article{dmodel2025null,
title = {How Language Models Understand Nullability},
author = {Sanchez-Stern, Alex and Tondwalkar, Anish},
journal = {dmodel},
year = {2025},
month = mar,
url = {https://dmodel.ai/nullability/},
}
To see how nullability appears in Python, Let’s look at an example program in Python that uses nullability:
class User:
def __init__(self, name: str, email: Optional[str]) -> None:
self.name = name
self.email = email
def get_user_email(user: Optional[User]) -> Optional[str]:
if user is not None and user.email is not None:
return user.email
else:
return None
In the get_user_email
function above, we can see from
the type signature that it takes a nullable User value, and returns an
nullable string. The first thing the function does is check if the input
user
is None or not. This program was actually generated by
o1-mini, so we can already see that the model understands that the user
object is nullable, and that it needs to be checked for None-ness before
anything else can be done.
We can say that there are five variable “occurrences” in the program,
each of which can be either nullable or not. There’s the first
user
in the if statement, the second user
to
the right of the and
in user.email
, there’s
the user.email
itself, the user
on the second
line, and finally the user.email
on the second line. If we
use Python’s typechecker, mypy, to check the types of each of these
occurrences, we find that they have type Optional[User]
,
User
, Optional[str]
, User
, and
str
respectively. That is, the first and third are
nullable, and the rest are not.
Here we define the syntax and semantics of the Python subset we work with:
But before we try to measure nullability understanding, we’ll want to be more precise about exactly what we’re measuring. To that end, we’ll take the notion of different “rules” for nullability that we discussed informally in the overview, and bring it into a formal typing system. We’re not going to try to describe all the typing rules of python, so we’ll restrict ourselves in a couple of ways.
First, we’ll reduce the number of python features that we handle by actually working in a subset of python. This means we can skip worrying about the semantics of complicated features, and just focus on the features necessary for understanding optionality in a semi-realistic setting.
Second, we’ll define all non-None values as converting to the boolean value True, instead of numbers converting to False when they are zero and strings converting to False when they are empty. This is a necessary practicality, because otherwise, the model could circumvent our type tests by doing bare ifs which work on both optional and non-optional types. But to prevent bare ifs from ever being the correct completion for a non-optional type, we’ll design our tests so that there are never any values that would convert to False, namely the number zero and the empty string.
\begin{aligned} \left[Program\right] =& [\epsilon] {\huge|} \left[\begin{array}{l}Stmt\\Program\end{array}\right]\\ [Stmt] =& [Import] | [FnDef] | [Assn] | [ForLoop] | [If] | [Expr] \\ &| [\texttt{return }Expr]\\ [Import] =& [\texttt{from } Ident \texttt{ import } Ident]\\ [FnDef] =& \left[\begin{array}{l} \texttt{def $Ident$($DeclArgs$):}\\ \quad Program \end{array}\right]\\ &{\huge|} \left[\begin{array}{l} \texttt{def $Ident$($DeclArgs$) -> $Type$}:\\ \quad Program \end{array}\right]\\ [DeclArgs] =& [\epsilon] | [Ident] | [Ident : Type]\\ &| [Ident \texttt{,} DeclArgs] | [Ident : Type\texttt{,} DeclArgs]\\ [Type] =& [\texttt{int}] | [\texttt{str}] | [\texttt{bool}] | [\texttt{None}] \\ &| [\texttt{Optional[}Type\texttt{]}] | [\texttt{List[}Type\texttt{]}]\\ [Assn] =& [Ident \texttt{ = } Expr]\\ [ForLoop] =& \left[\begin{array}{l} \texttt{for $ident$ in $ident$:}\\ \quad Program \end{array}\right]\\ [If] =& \left[\begin{array}{l} \texttt{if $expr$:}\\ \quad Program \end{array}\right]\\ [Expr] =& [Ident] | [Constant] | [Expr\texttt{ }Op\texttt{ }Expr]\\ &| [Ident \texttt{(} ParamsList \texttt{)}]| [\texttt{$Expr$ if $Expr$ else $Expr$}]\\ &| [\texttt{[$ListConts$]}] | [\texttt{[]}]\\ [ParamsList] =& [\epsilon] | [Expr] | [Expr\texttt{, }ParamsList]\\ [Op] =& [\texttt{+}] | [\texttt{-}] | [\texttt{*}] | [\texttt{/}] | [\texttt{<}] | [\texttt{>}] | [\texttt{<=}] | [\texttt{>=}] | [\texttt{is}] | [\texttt{is not}] | [\texttt{==}] | [\texttt{!=}]\\ [ListConts] =& [Expr] | [Expr,ListConts]\\ \end{aligned}
We won’t worry too much about the semantics on a formal level, since it’s the same as Python’s semantics on this subset, and we’ll mostly be using it informally to justify typing rules here. But we do want to formally define our typing rules, since we’ll be measuring the models understanding of particular rules. We’ll be using two propositions for typing in this language. There’s \Gamma \vdash \left[p\right] \vartriangleright \texttt{ok}, which says that the program p is well typed in environment \Gamma. And there’s \Gamma \vdash e : t, which says that e is well typed with type t in environment \Gamma. When we exclude the \Gamma, we mean that the judgement is true under any typing environment, including the empty one.
\tag{Const} \frac{Constant\neq\texttt{None}}{\vdash Constant: \text{Atom}} \hspace{1.0cm} \frac{}{\forall t, \vdash \texttt{None}: \texttt{Optional[$t$]}}
\tag{List} \frac{\Gamma \vdash x_1 : t, x_2 : t, ...} {\Gamma \vdash \texttt{[$x_1$, $x_2$, ...]} : \texttt{List[$t$]}} \hspace{1.0cm} \frac{}{\forall t, \vdash \texttt{[]}: \texttt{List[$t$]}}
\tag{Weaken} \frac{\Gamma \vdash x: t}{\Gamma \vdash x: \texttt{Optional[$t$]}}
\tag{Var} \Gamma \vdash e: t \hspace{1cm} \Gamma, x: t \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l}\texttt{$x$ = $e$}\\ p\end{array}\right] \vartriangleright \text{ok}
\begin{gather} \Gamma, x_1: t_1, x_2: t_2, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{Def} \Gamma, f : t_1 \rightarrow t_2 ... \rightarrow t_r \vdash [p] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \text{Returns($b$, $t_r$)} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$: $t_1$, $x_2$: $t_2$, ...) -> $t_r$:}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}
\tag{App} \frac{\Gamma \vdash f : t_1 \rightarrow t_2 ... \rightarrow t_r \hspace{1cm} \Gamma \vdash x_1 : t_1, x_2 : t_2, ...} {\Gamma \vdash f(x_1, x_2, ...) : t_r}
\begin{gather} \Gamma \vdash x : \texttt{Optional[$t$]} \hspace{1cm} \Gamma, x : t \vdash [p] \vartriangleright \text{ok} \nonumber\\ \tag{IfIn} \bigotimes \in \{\texttt{is not, !=}\} \over \Gamma \vdash \left[\begin{array}{l} \texttt{if $x$:}\\ \quad p \end{array}\right] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \left[\begin{array}{l} \texttt{if $x$ $\bigotimes$ None:}\\ \quad p \end{array}\right] \vartriangleright \text{ok} \end{gather}
\tag{IfOut} \Gamma \vdash \left[\begin{array}{l} p_1\\ p_3 \end{array}\right] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \left[\begin{array}{l} p_2\\ p_3 \end{array}\right] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{if $e$:}\\ \quad p_1\\ \texttt{else:}\\ \quad p_2\\ p_3 \end{array}\right] \vartriangleright \text{ok}
\tag{IfExpr} \frac{\Gamma \vdash (e_b : \text{Optional[$t_0$]} \lor e_b : \texttt{bool}), e_1 : t, e_2 : t} {\Gamma \vdash \texttt{$e_1$ if $e_b$ else $e_2$} : t}
\tag{For} \Gamma \vdash y\texttt{ : List } t \qquad \Gamma, x : t \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{for $x$ in $y$:}\\ \quad p \end{array}\right] \vartriangleright \text{ok}
\tag{OpInt} \Gamma \vdash x_1 : \texttt{int}, x_2 : \texttt{int}\qquad \bigotimes \in \{\text{\texttt{+, -, *, /}\}} \over \Gamma \vdash x_1 \bigotimes x_2 : \texttt{int}
\hspace{-1.5cm} \tag{OpString} \frac{\Gamma \vdash s_1 : \texttt{str}, s_2 : \texttt{str}} {\Gamma \vdash s_1 + s_2 : \texttt{str}} \hspace{1cm} \frac{\Gamma \vdash s : \texttt{str}, x: \texttt{int}} {\Gamma \vdash s * x : \texttt{str}}
\hspace{-1.5cm} \tag{OpEquality} \frac{\Gamma \vdash x_1 : t, x_2 : t\qquad \bigotimes \in \{\texttt{==, !=, is, is not}\}} {\Gamma \vdash x_1 \bigotimes x_2 : \texttt{bool}}
\hspace{-2cm} \tag{OpComparison} \frac{\Gamma \vdash x_1 : \texttt{int}, x_2 : \texttt{int}\qquad \bigotimes \in \{\texttt{<, >, <=, >=}\}} {\Gamma \vdash x_1 \bigotimes x_2 : \texttt{bool}}
With:
\tag{ReturnReturn} \frac{\Gamma \vdash e : t}{\Gamma \vdash \text{Returns(\texttt{return $e$}, $t$)}} \tag{NoReturnExpression} \frac{\Gamma \vdash e : t}{\vdash \text{NoReturn([$e$])}} \tag{NoReturnAssign} \frac{}{\vdash \text{NoReturn([\texttt{$i$ = $e$}])}} \tag{ReturnIf} \frac{\Gamma \vdash \text{Returns($p$, $t$)}} {\Gamma \vdash \text{Returns(}\left[\begin{array}{l}\texttt{if $e$:}\\ \quad p\end{array}\right],t\text{)}} \hspace{1cm}
\begin{gather} \Gamma \vdash \left(\text{Returns($p_1$, $t$)} \land \text{Returns($p_2$, $t$)}\right)\lor \nonumber\\ \Gamma \vdash \left(\text{Returns($p_1$, $t$)} \land \text{NoReturns($p_2$)}\right) \lor\nonumber\\ \tag{ReturnIfElse} \Gamma \vdash \left(\text{NoReturns($p_1$)} \land \text{Returns($p_2$, $t$)}\right) \over \Gamma \vdash \text{Returns(}\left[\begin{array}{l} \texttt{if $e$:}\\ \quad p_1\\ \texttt{else:}\\ \quad p_2 \end{array}\right]\text{, $t$)} \end{gather}
\tag{ReturnFor} \frac{\Gamma \vdash \text{Returns($p$, $t$)}} {\Gamma \vdash \text{Returns(}\left[\begin{array}{l} \texttt{for $x$ in $y$:}\\ \quad p \end{array}\right]\text{, $t$)}}
The rules above are sufficient for our purposes for dealing with fully type annotated programs. But what about programs with functions that aren’t type annotated, or are only partially annotated? The mypy type system that Python uses treats unannotated functions as dynamically typed, allowing them to typecheck as any type. Similarly, function parameters without a type annotation are implicitly convertible to any type. So, for normally mypy typechecking, we add the following typing rules:
\begin{gather} \Gamma, x_1 : Any, x_2 : Any, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{DynDef} \Gamma, f : Any \rightarrow ... \rightarrow Any \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$, $x_2$, ...):}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}
\tag{DynConvert} \frac{\Gamma \vdash x : t} {\Gamma \vdash x : Any} \hspace{0.5cm} \frac{\Gamma \vdash x : Any} {\Gamma \vdash x : t}
These rules allow more python programs to check statically, but they mean that some python programs will pass the static checks but still throw type errors at runtime. Importantly for us, they can throw runtime type errors about optionality. A well-written piece of code would not just satisfy this type system then, but actually a stronger one that would prevent runtime type errors. We’ll call this stronger type system mypy++. Instead of the Dyn- rules, it has this one:
\begin{gather} \Gamma, x_1 : t_1, x_2 : t_2, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{InferDef} \Gamma, f : t_1 \rightarrow t_2 \rightarrow ... \rightarrow t_r \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$, $x_2$, ...):}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}
You can see in fig. 2 that the smallest three models don’t pass any test. After that, the trend line for tests passed goes upwards, but it’s not monotonic: Pythia 410m passes only one test, while Pythia 160m passes three, and Pythia 6.9b passes the most tests, with 9, while Pythia 12b only passes 7. For instance, given the partial code:
Test 3
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
= process_value(value) + 1
x print(x)
def process_value(value):
Pythia 6.9b completes process_value
as:
def process_value(value):
if value is None:
return 0
elif isinstance(value, str):
return int(value)
...
While Pythia 12b completes it as:
def process_value(value):
return value
The code generated by Pythia 6.9b properly handles values of both
NoneType and str
and always returns an int, whereas the
code generated by Pythia 12b is the identify function, returning None
when the input is None and returning a string when the input is a
string. Neither of these cases is correct, as the main function expects
this to return a Number that can be added to one.
We say in fig. 7 that while the model generally gets better at passing these tests during training, the performance is not always increasing. Zooming in on a particular test, we can see what this looks like. When asked to complete this code:
Test 12
def handle_value(value, guard):
if guard:
return count_os("Foobar") * 2
else:
return count_os(value) // 2
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
print(handle_value(value, x < 10))
def count_os(value):
Pythia 6.9b at training step 104000 produces the following definition
of count_os
:
def count_os(value):
if value is None:
return 1
elif isinstance(value, str):
return len(value)
...
While 1000 steps later, it produces a very different definition:
def count_os(value):
return len([s for s in os.listdir(os.path.join(os.getcwd(), value)) if s.start ...
While the first definition handles value being None or a string, the
second definition not only assumes it is a string but that it is a name
of a folder in the current directory. In some sense the model “knows”
that the value parameter is nullable at step 104000, but “forgets” it
during further training. It regains the ability to pass the test at
several points during training, but by the end of training, it’s back to
treating value
as if it was always a string.
As expected, type annotations are particularly difficult for these models. When asked to complete the type of get_square in the following code, no model in the Pythia series can successfully output the Optional argument type:
Test 5
def program_48() -> None:
int] = None
number: Optional[= get_square(number)
square if square is not None:
print(f"Square of the number is {square}")
else:
print("No number provided to square")
def get_square(number:
These models also find it much easier to deal with a list that contains some None elements, than to deal with an atomic value that might be None. Pythia 6.9b consistently passes this test in the second half of training:
Test 2
from typing import Optional
def main() -> None:
= [1, -4, None, -3, 10, -1, None, None, 8]
some_numbers print(positive_numbers(some_numbers))
def positive_numbers(numbers: list[Optional[int]]) -> list[int]:
list[int] = []
result: for num in numbers:
But is much more challenged by this code:
Test 3
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
= process_value(value) + 1
x print(x)
def process_value(value):
, intermittently being unable to complete it in a type-safe way up, including in the second-to-last training step. When another level of function indirection is added, the model becomes much better at completing it correctly; Pythia 6.9b consistently completes the following after step 93000:
Test 4
def handle_value(value, guard):
if guard:
return process_value("Foobar") + 1
else:
return process_value(value) + 1
def main(x: int) -> None:
if x > 0:
= "*" * x
value else:
= None
value
= handle_value(value, x < 10)
x print(x)
def process_value(value):
The names of functions are highly influential in the models
understanding of their type semantics. When test 3 is changed so that
process_value
is instead called count_chars
,
Pythia 6.9b becomes unable to correctly complete it on any revision; it
likely has a bias from its training data that functions called
count_chars
always take non-nullable strings. Similarly,
when test 4 is changed so that process_values becomes string_complexity,
it goes from consistently passing to almost always failing.
We were initially very surprised to find that mass means probing would perform better than linear regression. After all, linear regression is a much more powerful technique for fitting data. And mass means probing can be seen as giving the direction of best fit in each dimension independently, without considering other dimensions. The more dimensions you consider at one time, the better your model fit can be. But repeatedly in our data, we found that mass means probing outperformed linear regression on the test data.
The loop indicates that the next few lines need to
process num
in some way, and the fact that it comes from
some_numbers
means it has the type
Optional[int]
. num
can’t be directly appended
to result
, because result
is declared to only
contain int
s, not Optional[int]
s. None of the
normal operations on int
s will work on None
s,
so before num
can be processed, something has to be done to
separate None
s and normal int
s.↩︎
In particular, we can use this program to test models
for nullability understanding by asking them to complete the program’s
next lines, then see if those lines matches the regular expression
num\s*(is\s*(not)?|==)\s*None|isinstance\(num
.↩︎
Note that these results differ substantially from those of Tigges et al. (2024), who find that for circuit analyses (rather than representational analyses like ours), circuit parts are learned at roughly the same point during training across scale.↩︎
Technically this is known as “Optional typing”, but that’s confusing in the context of this post. Not to be confused with Gradual Typing, as introduced by Siek et al.↩︎
In particular, it won’t pass a
process_value
body that just returns the value
argument, since the call site will fail at runtime, trying to add one to
a nullable string, while vanilla mypy will pass such a body.↩︎
Despite this, it does pass the test 40% of the available revisions, about triple what the other closest sizes can accomplish↩︎
This indicates that if you were to train a model using typechecking as reinforcement feedback, you would want to use mypy++ and not mypy.↩︎
Deepseek has one model output that demonstrates complete
understanding of nullability, and runs fine at runtime, but fails the
typechecker. This is because the code it generates catches the
TypeError
and changes control flow instead of checking for
None
values up front.↩︎
We don’t see all six non-annotated function tests passing under mypy, because models can still fail these tests by producing invalid syntax.↩︎
smoothed with a rolling average with a window size of 5↩︎
That is, we set the reading vector to the difference between the mean activation for positive class and the mean activation for the negative class.↩︎
This is, of course, where we trained our probes, but there is also a practical reason: right after the model has generated a variable that will be written to, it often does not have access to the assigning expression or type annotation, giving it no way to determine if the value will be optional or now.↩︎
Red tokens are significantly below the threshold, and green tokens are significantly above it; tokens that scored near the threshold would have a near-white color, but no such tokens appear in this example.↩︎
When this variable appears for the first time, it is in
the if
statement that checks if it’s None
.
Then, the model knows it is nullable, and the results of the probe
reflect that understanding. But when it appears a second time on the
next line, in the format string of print
, the body of this
if statement only runs if it is not None
. The
model understand this as well, and the probe accurately reflects this.↩︎
Li et al. (2024) and Zou et al. (2025) suggest that mass means probes are best for reading, while the direction perpendicular to the separating hyperplane is best for intervention. However, previous work leaves open the question of cross-layer weights. We use LR on the cross-layer weights, thanks to our investigations above.↩︎