How Language Models Understand Nullability

Abstract

Large language models have demonstrated an emergent ability to write code, but this ability requires an internal representation of program semantics that is little understood. Recent interpretability work has demonstrated that it is possible to extract internal representations of natural language concepts, raising the possibility that similar techniques could be used to extract program semantics concepts. In this work, we study how large language models represent the nullability of program values. We measure how well models of various sizes at various training checkpoints complete programs that use nullable values, and then extract an internal representation of nullability.

The last five years have shown us that large language models, like ChatGPT, Claude, and DeepSeek, can effectively write programs in many domains. This is an impressive capability, given that writing programs involves having a formal understanding of program semantics. But though we know that these large models understand programs to an extent, we still don’t know many things about these models’ understanding. We don’t know where they have deep understanding and where they use heuristic reasoning, how they represents program knowledge, and what kinds of situations will challenge their capabilities.

Fortunately, recent work in model interpretability and representation engineering has produced promising results which give hope towards understanding more and more of the internal thought processes of LLMs. Here at d_{model} , we can think of no better place to apply these new techniques than formal methods, where there are many abstract properties that can be extracted with static analysis. The vast work done in programming language theory over the past hundred years provides many tools for scaling an understanding of the internal thought processes of language models as they write code.

In that spirit, we wanted to start with a simple property that comes up in every programming language, nullability. Nullable values are represented differently across languages; as null pointers in C or C++, with explicit Option types in Rust, and with special nil or None values in dynamic languages like Javascript, Lisp, or Python. In every case, understanding where values can be nullable is necessary for writing even basic code, and misunderstanding where they are nullable can often be a source of bugs.

Do our models understand when a value is nullable? They must, to be able to write code that deals with nullable values, but we haven’t known what form this knowledge takes, what situations are likely to confuse the model. Until now.

We end with a demo: after we train a probe that uses the determines whether the model thinks a variable read corresponds to a nullable variable, we can demonstrate that internal knowledge in a reading diagram:

2 Measuring Nullability Understanding Externally

We begin with a “skyline” estimate of model understanding of nullability (a la Feng and Steinhardt (2024)). That is, we evaluate model nullability understanding externally, at the token-level, (as opposite to internally, at the activation level, using interpretability techniques). In this section, we first explain the task of nullability understanding (Sec. 2.1). Then we formally decompose the reasoning steps required to reason about nullability both inside (Sec. 2.2) and across (Sec. 2.3) functions. Finally, we present the results of our “skyline” analysis. (Sec. 2.5).

2.1 NullabilityEval

To measure nullability understanding externally, we ask the model to complete simple partial programs that each require an understanding of nullability. We refer to this suite of programs as NullabilityEval. All of the tests in this benchmark suite are composed of three functions or less, where the largest function is seven lines long.

In our experiments we focus on the Pythia model suite (Biderman et al. 2023), as they have checkpoints available across training runs and various scales. For measuring performance at larger sizes, we also include Qwen2.5-Coder-32b, Llama 3.1 405b Instruct, and DeepSeek-V3 (671b).

Each partial program is constructed such that there are a very limited number of valid next lines in the program, and all of them demonstrate some knowledge of the concept of nullability.

We would like the model to generate code that performs some operation on the elements of some_numbers. The simplest way to do that is to introduce a branch if num is None, but several variants are also valid: if num is not None, if num == None, if isinstance(num, int). That is, this example is constructed such that there is a small space of valid next lines, and all of them require some understanding that num may not be an int.²

2.2 Understanding Typing Rules

We find that Pythia models as small as 2.8b can successfully complete Test 1, and that they learn to complete the test in the first third of training. Consistent with observations that larger models are more sample-efficient , larger Pythia models learn to complete this test earlier, with Pythia 12b able to complete the test 20% of the way into training and Pythia 2.8b able to complete it 28% of the way into training.³

These results indicate that these models understand nullability to some extent, but how deep is this understanding? To quantify this, we give a syntax and semantics of a minimalist subset of python that captures nullability in Appendix B. We can then classify each partial program by which program constructs and rules determine the nullability of the target variable. For instance, Test 1 uses the (List), (Var), and (For) rules.

So, do Pythia models 2.8b and up understand the semantics of these three rules? As it turns out, not exactly. LLM’s pick up on a lot of information relationships in their training data that have statistical correlation, without it necessarily being causal. What this means in practice is that the models use things like variable names, whitespace, statement orderings, and constant values to guess at certain programming concepts.

To test the reliance on these features more robustly, we generated 100 random variable namings and constant values for the above program, as well as for small program that test other typing rules. We also balanced these tests with programs that made use of the same language constructs, but where the loop variable was not optional. We found that when lists and for loops were involved, Pythia 2.8b was only able to correctly label 57% of the programs, only slightly better than chance. Even Pythia 12b could only correctly label 68% of the tests. These results show that the Pythia models rely heavily on these non-semantic features when reasoning about for loops.

Fortunately, many other typing rules do not exhibit such a strong reliance on variable naming and constants. In the same setting of generated programs, we found that Pythia 2.8b was able to correctly label programs using the Abs, Var, App, and If_Out a much greater proportion of the time (99%, 93%, 86%, and 98% respectively). Stay tuned in the future (Sec. 4) for a more in-depth exploration of how the models behave on individual typing rules when we increase the variability of our program generation.

2.3 Interprocedural Analysis

We can further challenge the model by adding layers of procedural indirection between the source and sink of nullable values, thereby testing the model’s interprocedural understanding. First, we demonstrate how to write such tests, and the difficulties of writing tests that may be too easy (Sec. 2.3.1). Then, we present a harder problem (Sec. 2.3.2) and introduce a stronger type system mypy++ to formalized the needed reasoning (Sec. 2.3.3).

2.3.1 A simple test

In this test, we’ve taken the same logic and spread it across main and another function, positive_numbers. Ideally, the model would have to think a little harder to understand that some_numbers is flowing from main through the function call into the body of positive_numbers, causing the for loop body to need a None check. In practice, however, we find this test is actually easier for the model to pass than Test 1, with models as small as Pythia 1b passing it, and Pythia 12b learning to pass it 13% of the way into training.

Because of the type annotations on the positive_numbers function, the model doesn’t need to pay attention to main at all. It can just look at positive_numbers, and use the type annotation to know that numbers contains Optional[int]s. Then, using the For rule it can reason that num is nullable, so it must be checked for None before proceeding. Looking at the type annotation turns out to be easier for the model than scanning through a list to determine if there are None and non-None values, resulting in an easier test overall.

2.3.2 Unannotated Functions

So how would we actually test for interprocedural nullability understanding in the model? Well, type annotations on Python functions aren’t required⁴, so we can instead provide the model with an unannotated function, and see if it still understands the flow of nullability. Here’s a test that does that:

Our base set of typing rules (listed as “Common Rules” in Appendix B.1) don’t handle unannotated functions though, so we’re going to have to add some more, and here we’re faced with a choice. The typing rules for normal Python say that functions without return type annotations return the Any type, and arguments without a type annotation have the type Any. In fact, normal mypy will not check unannotated functions at all, even for internal consistency; the --check-untyped-defs option will add some checking back, but the types of arguments and return type will still be Any. In Python, a value of any type can be converted to an Any, and an Any can be converted to any value type.

This means that it would be technically type safe to do anything in the body of process_value, including just returning the argument, without a static type error. All Pythia models with at least 410 million parameters are able to make use of this extra flexibility to write code for Test 3 that typechecks under mypy. But at runtime, code that exploits it would still fail.

2.3.3 A stricter type system for Python: mypy++

If we want code that does not TypeError at runtime, we can strengthen our type checker by requiring that there be some valid, non-Any, type for the function that typechecks at the call site and in the function body. This new typechecker is still checking unannotated functions, but passing fewer of them.⁵ We’ll call this augmented type system mypy++.

In Appendix B.2, we formalize the unannotated function rules for mypy vs mypy++.

There’s no consistent threshold of size at which Pythia models can pass Test 3. Pythia 1b, 2.8b, and 6.9b pass the test in their final revisions, but Pythia 410m, 1.4b, and 12b don’t. The models with at least 1 billion parameters all have points in training where they can pass the test, but only intermittently. Even 6.9b, the best performing size on this test, fails the test in its second-to-last available revision⁶. You can see how this evolves over scale in fig. 2 and time in fig. 3. See Sec. 3.5 for further discussion of performance over time.

What the models can do well, however, is learn to pass these tests in the mypy type system (as opposed to mypy++). In that system, where they don’t need to reason globally about the functions but only locally, this test is one of the easiest for the models to complete.⁷

Since this test suite is meant to be informative beyond the sizes of the Pythia models, we also add another layer of indirection to add more difficulty, in this test:

In Test 4, instead of main calling process_value directly, it calls handle_value, which itself calls process_value in a branch. With two layers of indirection, we start to hit the limits of the capabilities of even frontier models. Llama 405b is unable to successfully pass this test, as are smaller models like Qwen Coder 32b, while DeepSeek V3 (671b parameters) is able to pass it. However, Pythia 6.9b is still able to pass this test with some consistency.

2.4 Generating Type Annotations

Finally, we can test how well the models write write type annotations for functions. Here, the trailing colon makes the type expression the only valid completion.

None of the Pythia models pass this test. Qwen Coder 32b is also incapable of passing this test, but both Llama 405b and DeepSeek V3 pass it.

We would indeed expect that writing type annotations is more difficult than merely implicitly reasoning about the types of python programs, as only a small fraction of python programs in the wild are thus annotated.

2.5 External Test Results Across Training and Scale

We wrote three variations of each of these tests, resulting in 15 tests total. In Fig. 2 below, we can see the number of passing tests for each model under mypy and mypy++.

We can see the number of fully passing tests for each model in blue. Generally speaking, models get better with scale: Pythia-2.8b parameters can pass about half the tests, but we need the much larger and more parameter efficient Llama-405b to pass all of the tests. This matches our expectations that eval scores should scale logarithmically, indicating that these tests are well distributed.

We can also see the test result for the pythia models under the weaker mypy success criteria. As we expected, the mypy results (red bar) are (almost⁸) always above the mypy++ results (blue bar), as mypy++ is a stricter type system. There are six tests in the dataset involving non-annotated functions, and using the weaker mypy typesystem causes up to five more tests to pass than using mypy++⁹

Next, we want to understand the training dynamics at play here. Below, we can see how Pythia 6.9b performs on the tests during training from step 2 to 143000: ¹⁰

We see that each individual model learns to write code which typechecks under mypy before it learns to write code which typechecks under mypy++ and throws no type errors at runtime.

3 Measuring Nullability Understanding Internally

At this point, we’ve figured out how to roughly measure nullability understanding in the output of various language models, but we still don’t know what their internal representations might look like or when they emerge. Next, we detail how we train reading vectors (Sec. 3.3), using prompts designed to make the model think about the phenomena of interest (Sec. 3.2). Finally, in Sec. 3.5, we validate that these probes improve their understanding of nullability over the course of pretraining to the level that we expect from the external, or token-level understanding evals we describe in the previous section.

3.1 Background

In this section, we review representation engineering (Zou et al. 2025) techniques that we will use to look for linear representations of nullability inside the model.

Zou et al. (2025) shows how representations can be extracted for concepts like “happiness”, “honesty”, and “fairness”. First, they construct many prompts which cause the model to act in a way aligned with the concept, and many which cause the model to act in a way aligned against the concept. For instance, they might prompt the model with “Pretend you’re a dishonest person. The Eiffel Tower is” and “Pretend you’re an honest person. The Eiffel Tower is”. Then, they take the internal activations which correspond to each, and try to extract a high-dimensional vector which points towards the states which are aligned, and away from the states that are not aligned. This vector can then be used as a linear model to measure how much of the concept is activated in the model during any given forward pass (e.g. for honesty, this gives us a lie-detector).

3.2 Designing Prompts to Extract Nullability Activations

We avoid dealing with the ambiguities of natural language by working in a setting where the model needs only to analyze the nullability of individual variable occurrences. Specifically, we probe for “the variable I just generated refers to an nullable quantity”, so our prompts looked like:

We queried o1, o1-mini, deepseek-coder, and claude-3.5-sonnet with the following prompt:

We label each variable read occurrence with nullability information derived from mypy, and prompt the “model under test” with a prompt consisting of the tokens up to and including the variable read occurrence.

3.3 Extracting Reading Vectors

Prior work focused their probing on a single layer, often handpicked based on prior papers. We probe all layers instead. We use Mass Mean Shift probing¹¹ for each layer, because it’s been shown empirically (Li et al. 2024) to generalize better in high dimensional spaces than logistic regression.

We then tested two methods for determining the relative importance of the different layers — either allowing the magnitude of the difference of means vector to determine the importance of the layer in the final probe (MM), or to learn coefficients for each layer using linear regression (MM-LR). We found that which method is more accurate on test data varies over both model size and number of training steps.

In Fig. 4, we can see that pure mass means probing gives lower loss for smaller models (those with less than 410 million parameters), but that for larger models weighting layers using linear regression gives lower loss consistently.

3.4 Visualizing Probe Outputs

This diagram is adapted from the style of reading diagram from Zou et al. (2025), but only show the activations on tokens that represent variable loads¹². For each position of interest, we prompt the model with the partial program that consists of all tokens up to (preceding) and including that position. We then probe the model at the following token. We color the box above that position based on the output of the probe, and a scoring threshold inferred at train-time¹³.

In this program, there are sixteen tokens that correspond to variable loads, and (correctly) all but one are marked as non-optional. The only nullable variable in this program is result, since it comes from find_value which returns Optional[int].¹⁴

3.5 Probing Results Across Training and Scale

In this section, we study the performance of our nullability probes across time and scale (Tigges et al. 2024). We use mass-means shift probing (Li et al. 2024) on all layers, and a linear regression to determine the weights of each layer.¹⁵

In fig. 6, we plot loss against scale and time. While we measured accuracy for every available Pythia model size, we exclude the smallest (14m) from this plot since it would exist entirely above the top of the plot.

One thing that is interesting to note is that models up to 1b reach a minimum loss before loss for this task climbing again. Charitably, this may be because the features beyond this point become more complex — less linear, or the represented features themselves represent more subtle concepts. Cynically, this reflects that models—small models in particular—do not uniformly improve at this task over training, as we observed in Sec. 2.3.3.

Our suspicion is that this pattern would continue even for the larger models if we continued to overtrain them for longer.

4 Future Work

We hope to get a better understanding of this phenomenon by studying the decomposed nullability reasoning as described in Sec. 2.2 and Appendix B.1.

5 Related Work

Our decision to use Pythia to study feature evolution across time and scale is inspired by Tigges et al. (2024) . They focus on classic circuits-centered interpretability tasks such as IOI (Wang et al. 2022), Gendered-Pronoun (Mathwin et al., n.d.), Greater-Than (Hanna, Liu, and Variengien 2023) , and SVA (Linzen, Dupoux, and Goldberg 2016).

In our setting, we are more interested in how activations vary across inputs, to extract representations of nullability. Zou et al. (2025) surveys techniques for representation engineering with linear probes. We apply similar techniques, but to program semantics and dataflow instead of natural language.

Feng, Russell, and Steinhardt (2024) also study LLM’s ability to reason about propositions, but in a natural language setting, rather than a formal one.

Several techniques exist for constructing linear probes, but after experimental measurement we followed the mass means shift from Li et al. (2024). Li et al. (2024) and Zhong et al. (2023) discuss why mass mean probing might outperform linear regression.

6 Acknowledgements and Cite As

We thank Leo Gao, Chelsea Voss, and Zhanna Kaufman for their comments and suggestions during the drafting process.

References

Biderman, Stella, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, et al. 2023. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.” arXiv. https://doi.org/10.48550/arXiv.2304.01373.

Feng, Jiahai, Stuart Russell, and Jacob Steinhardt. 2024. “Monitoring Latent World States in Language Models with Propositional Probes.” arXiv. https://doi.org/10.48550/arXiv.2406.19501.

Feng, Jiahai, and Jacob Steinhardt. 2024. “How Do Language Models Bind Entities in Context?” arXiv. https://doi.org/10.48550/arXiv.2310.17191.

Hanna, Michael, Ollie Liu, and Alexandre Variengien. 2023. “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model.” arXiv. https://doi.org/10.48550/arXiv.2305.00586.

Li, Kenneth, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. “Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.” arXiv. https://doi.org/10.48550/arXiv.2306.03341.

Linzen, Tal, Emmanuel Dupoux, and Yoav Goldberg. 2016. “Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies.” arXiv. https://doi.org/10.48550/arXiv.1611.01368.

Mathwin, Chris, Guillaume Corlouer, Esben Kran, Fazl Barez, and Neel Nanda. n.d. “Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small.”

Tigges, Curt, Michael Hanna, Qinan Yu, and Stella Biderman. 2024. “LLM Circuit Analyses Are Consistent Across Training and Scale.” arXiv. https://doi.org/10.48550/arXiv.2407.10827.

Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” arXiv. https://doi.org/10.48550/arXiv.2211.00593.

Zhong, Ziqian, Ziming Liu, Max Tegmark, and Jacob Andreas. 2023. “The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks.” arXiv. https://doi.org/10.48550/arXiv.2306.17844.

Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, et al. 2025. “Representation Engineering: A Top-Down Approach to AI Transparency.” arXiv. https://doi.org/10.48550/arXiv.2310.01405.

Appendicies

A Nullability In Python

To see how nullability appears in Python, Let’s look at an example program in Python that uses nullability:

In the get_user_email function above, we can see from the type signature that it takes a nullable User value, and returns an nullable string. The first thing the function does is check if the input user is None or not. This program was actually generated by o1-mini, so we can already see that the model understands that the user object is nullable, and that it needs to be checked for None-ness before anything else can be done.

We can say that there are five variable “occurrences” in the program, each of which can be either nullable or not. There’s the first user in the if statement, the second user to the right of the and in user.email, there’s the user.email itself, the user on the second line, and finally the user.email on the second line. If we use Python’s typechecker, mypy, to check the types of each of these occurrences, we find that they have type Optional[User], User, Optional[str], User, and str respectively. That is, the first and third are nullable, and the rest are not.

B Python Subset Syntax and Typing Rules

B.1 Common Rules

But before we try to measure nullability understanding, we’ll want to be more precise about exactly what we’re measuring. To that end, we’ll take the notion of different “rules” for nullability that we discussed informally in the overview, and bring it into a formal typing system. We’re not going to try to describe all the typing rules of python, so we’ll restrict ourselves in a couple of ways.

First, we’ll reduce the number of python features that we handle by actually working in a subset of python. This means we can skip worrying about the semantics of complicated features, and just focus on the features necessary for understanding optionality in a semi-realistic setting.

Second, we’ll define all non-None values as converting to the boolean value True, instead of numbers converting to False when they are zero and strings converting to False when they are empty. This is a necessary practicality, because otherwise, the model could circumvent our type tests by doing bare ifs which work on both optional and non-optional types. But to prevent bare ifs from ever being the correct completion for a non-optional type, we’ll design our tests so that there are never any values that would convert to False, namely the number zero and the empty string.

\begin{aligned} \left[Program\right] =& [\epsilon] {\huge|} \left[\begin{array}{l}Stmt\\Program\end{array}\right]\\ [Stmt] =& [Import] | [FnDef] | [Assn] | [ForLoop] | [If] | [Expr] \\ &| [\texttt{return }Expr]\\ [Import] =& [\texttt{from } Ident \texttt{ import } Ident]\\ [FnDef] =& \left[\begin{array}{l} \texttt{def $Ident$($DeclArgs$):}\\ \quad Program \end{array}\right]\\ &{\huge|} \left[\begin{array}{l} \texttt{def $Ident$($DeclArgs$) -> $Type$}:\\ \quad Program \end{array}\right]\\ [DeclArgs] =& [\epsilon] | [Ident] | [Ident : Type]\\ &| [Ident \texttt{,} DeclArgs] | [Ident : Type\texttt{,} DeclArgs]\\ [Type] =& [\texttt{int}] | [\texttt{str}] | [\texttt{bool}] | [\texttt{None}] \\ &| [\texttt{Optional[}Type\texttt{]}] | [\texttt{List[}Type\texttt{]}]\\ [Assn] =& [Ident \texttt{ = } Expr]\\ [ForLoop] =& \left[\begin{array}{l} \texttt{for $ident$ in $ident$:}\\ \quad Program \end{array}\right]\\ [If] =& \left[\begin{array}{l} \texttt{if $expr$:}\\ \quad Program \end{array}\right]\\ [Expr] =& [Ident] | [Constant] | [Expr\texttt{ }Op\texttt{ }Expr]\\ &| [Ident \texttt{(} ParamsList \texttt{)}]| [\texttt{$Expr$ if $Expr$ else $Expr$}]\\ &| [\texttt{[$ListConts$]}] | [\texttt{[]}]\\ [ParamsList] =& [\epsilon] | [Expr] | [Expr\texttt{, }ParamsList]\\ [Op] =& [\texttt{+}] | [\texttt{-}] | [\texttt{*}] | [\texttt{/}] | [\texttt{<}] | [\texttt{>}] | [\texttt{<=}] | [\texttt{>=}] | [\texttt{is}] | [\texttt{is not}] | [\texttt{==}] | [\texttt{!=}]\\ [ListConts] =& [Expr] | [Expr,ListConts]\\ \end{aligned}

We won’t worry too much about the semantics on a formal level, since it’s the same as Python’s semantics on this subset, and we’ll mostly be using it informally to justify typing rules here. But we do want to formally define our typing rules, since we’ll be measuring the models understanding of particular rules. We’ll be using two propositions for typing in this language. There’s \Gamma \vdash \left[p\right] \vartriangleright \texttt{ok}, which says that the program p is well typed in environment \Gamma. And there’s \Gamma \vdash e : t, which says that e is well typed with type t in environment \Gamma. When we exclude the \Gamma, we mean that the judgement is true under any typing environment, including the empty one.

\tag{Const} \frac{Constant\neq\texttt{None}}{\vdash Constant: \text{Atom}} \hspace{1.0cm} \frac{}{\forall t, \vdash \texttt{None}: \texttt{Optional[$t$]}}

\tag{List} \frac{\Gamma \vdash x_1 : t, x_2 : t, ...} {\Gamma \vdash \texttt{[$x_1$, $x_2$, ...]} : \texttt{List[$t$]}} \hspace{1.0cm} \frac{}{\forall t, \vdash \texttt{[]}: \texttt{List[$t$]}}

\tag{Var} \Gamma \vdash e: t \hspace{1cm} \Gamma, x: t \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l}\texttt{$x$ = $e$}\\ p\end{array}\right] \vartriangleright \text{ok}

\begin{gather} \Gamma, x_1: t_1, x_2: t_2, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{Def} \Gamma, f : t_1 \rightarrow t_2 ... \rightarrow t_r \vdash [p] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \text{Returns($b$, $t_r$)} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$: $t_1$, $x_2$: $t_2$, ...) -> $t_r$:}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}

\tag{App} \frac{\Gamma \vdash f : t_1 \rightarrow t_2 ... \rightarrow t_r \hspace{1cm} \Gamma \vdash x_1 : t_1, x_2 : t_2, ...} {\Gamma \vdash f(x_1, x_2, ...) : t_r}

\begin{gather} \Gamma \vdash x : \texttt{Optional[$t$]} \hspace{1cm} \Gamma, x : t \vdash [p] \vartriangleright \text{ok} \nonumber\\ \tag{IfIn} \bigotimes \in \{\texttt{is not, !=}\} \over \Gamma \vdash \left[\begin{array}{l} \texttt{if $x$:}\\ \quad p \end{array}\right] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \left[\begin{array}{l} \texttt{if $x$ $\bigotimes$ None:}\\ \quad p \end{array}\right] \vartriangleright \text{ok} \end{gather}

\tag{IfOut} \Gamma \vdash \left[\begin{array}{l} p_1\\ p_3 \end{array}\right] \vartriangleright \text{ok} \hspace{1cm} \Gamma \vdash \left[\begin{array}{l} p_2\\ p_3 \end{array}\right] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{if $e$:}\\ \quad p_1\\ \texttt{else:}\\ \quad p_2\\ p_3 \end{array}\right] \vartriangleright \text{ok}

\tag{IfExpr} \frac{\Gamma \vdash (e_b : \text{Optional[$t_0$]} \lor e_b : \texttt{bool}), e_1 : t, e_2 : t} {\Gamma \vdash \texttt{$e_1$ if $e_b$ else $e_2$} : t}

\tag{For} \Gamma \vdash y\texttt{ : List } t \qquad \Gamma, x : t \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{for $x$ in $y$:}\\ \quad p \end{array}\right] \vartriangleright \text{ok}

\tag{OpInt} \Gamma \vdash x_1 : \texttt{int}, x_2 : \texttt{int}\qquad \bigotimes \in \{\text{\texttt{+, -, *, /}\}} \over \Gamma \vdash x_1 \bigotimes x_2 : \texttt{int}

\hspace{-1.5cm} \tag{OpString} \frac{\Gamma \vdash s_1 : \texttt{str}, s_2 : \texttt{str}} {\Gamma \vdash s_1 + s_2 : \texttt{str}} \hspace{1cm} \frac{\Gamma \vdash s : \texttt{str}, x: \texttt{int}} {\Gamma \vdash s * x : \texttt{str}}

\hspace{-1.5cm} \tag{OpEquality} \frac{\Gamma \vdash x_1 : t, x_2 : t\qquad \bigotimes \in \{\texttt{==, !=, is, is not}\}} {\Gamma \vdash x_1 \bigotimes x_2 : \texttt{bool}}

\hspace{-2cm} \tag{OpComparison} \frac{\Gamma \vdash x_1 : \texttt{int}, x_2 : \texttt{int}\qquad \bigotimes \in \{\texttt{<, >, <=, >=}\}} {\Gamma \vdash x_1 \bigotimes x_2 : \texttt{bool}}

\tag{ReturnReturn} \frac{\Gamma \vdash e : t}{\Gamma \vdash \text{Returns(\texttt{return $e$}, $t$)}} \tag{NoReturnExpression} \frac{\Gamma \vdash e : t}{\vdash \text{NoReturn([$e$])}} \tag{NoReturnAssign} \frac{}{\vdash \text{NoReturn([\texttt{$i$ = $e$}])}} \tag{ReturnIf} \frac{\Gamma \vdash \text{Returns($p$, $t$)}} {\Gamma \vdash \text{Returns(}\left[\begin{array}{l}\texttt{if $e$:}\\ \quad p\end{array}\right],t\text{)}} \hspace{1cm}

\begin{gather} \Gamma \vdash \left(\text{Returns($p_1$, $t$)} \land \text{Returns($p_2$, $t$)}\right)\lor \nonumber\\ \Gamma \vdash \left(\text{Returns($p_1$, $t$)} \land \text{NoReturns($p_2$)}\right) \lor\nonumber\\ \tag{ReturnIfElse} \Gamma \vdash \left(\text{NoReturns($p_1$)} \land \text{Returns($p_2$, $t$)}\right) \over \Gamma \vdash \text{Returns(}\left[\begin{array}{l} \texttt{if $e$:}\\ \quad p_1\\ \texttt{else:}\\ \quad p_2 \end{array}\right]\text{, $t$)} \end{gather}

\tag{ReturnFor} \frac{\Gamma \vdash \text{Returns($p$, $t$)}} {\Gamma \vdash \text{Returns(}\left[\begin{array}{l} \texttt{for $x$ in $y$:}\\ \quad p \end{array}\right]\text{, $t$)}}

B.2 Unannotated Functions

The rules above are sufficient for our purposes for dealing with fully type annotated programs. But what about programs with functions that aren’t type annotated, or are only partially annotated? The mypy type system that Python uses treats unannotated functions as dynamically typed, allowing them to typecheck as any type. Similarly, function parameters without a type annotation are implicitly convertible to any type. So, for normally mypy typechecking, we add the following typing rules:

\begin{gather} \Gamma, x_1 : Any, x_2 : Any, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{DynDef} \Gamma, f : Any \rightarrow ... \rightarrow Any \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$, $x_2$, ...):}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}

\tag{DynConvert} \frac{\Gamma \vdash x : t} {\Gamma \vdash x : Any} \hspace{0.5cm} \frac{\Gamma \vdash x : Any} {\Gamma \vdash x : t}

These rules allow more python programs to check statically, but they mean that some python programs will pass the static checks but still throw type errors at runtime. Importantly for us, they can throw runtime type errors about optionality. A well-written piece of code would not just satisfy this type system then, but actually a stronger one that would prevent runtime type errors. We’ll call this stronger type system mypy++. Instead of the Dyn- rules, it has this one:

\begin{gather} \Gamma, x_1 : t_1, x_2 : t_2, ... \vdash [b] \vartriangleright \text{ok}\nonumber\\ \tag{InferDef} \Gamma, f : t_1 \rightarrow t_2 \rightarrow ... \rightarrow t_r \vdash [p] \vartriangleright \text{ok} \over \Gamma \vdash \left[\begin{array}{l} \texttt{def $f$($x_1$, $x_2$, ...):}\\ \quad b\\ p \end{array}\right] \vartriangleright \text{ok} \end{gather}

C Detailed High-Level Nullability Test Results

C.1 Across Scale

You can see in fig. 2 that the smallest three models don’t pass any test. After that, the trend line for tests passed goes upwards, but it’s not monotonic: Pythia 410m passes only one test, while Pythia 160m passes three, and Pythia 6.9b passes the most tests, with 9, while Pythia 12b only passes 7. For instance, given the partial code:

The code generated by Pythia 6.9b properly handles values of both NoneType and str and always returns an int, whereas the code generated by Pythia 12b is the identify function, returning None when the input is None and returning a string when the input is a string. Neither of these cases is correct, as the main function expects this to return a Number that can be added to one.

C.2 Across Time

We say in fig. 7 that while the model generally gets better at passing these tests during training, the performance is not always increasing. Zooming in on a particular test, we can see what this looks like. When asked to complete this code:

Pythia 6.9b at training step 104000 produces the following definition of count_os:

While the first definition handles value being None or a string, the second definition not only assumes it is a string but that it is a name of a folder in the current directory. In some sense the model “knows” that the value parameter is nullable at step 104000, but “forgets” it during further training. It regains the ability to pass the test at several points during training, but by the end of training, it’s back to treating value as if it was always a string.

C.3 Common Mistakes

As expected, type annotations are particularly difficult for these models. When asked to complete the type of get_square in the following code, no model in the Pythia series can successfully output the Optional argument type:

These models also find it much easier to deal with a list that contains some None elements, than to deal with an atomic value that might be None. Pythia 6.9b consistently passes this test in the second half of training:

, intermittently being unable to complete it in a type-safe way up, including in the second-to-last training step. When another level of function indirection is added, the model becomes much better at completing it correctly; Pythia 6.9b consistently completes the following after step 93000:

The names of functions are highly influential in the models understanding of their type semantics. When test 3 is changed so that process_value is instead called count_chars, Pythia 6.9b becomes unable to correctly complete it on any revision; it likely has a bias from its training data that functions called count_chars always take non-nullable strings. Similarly, when test 4 is changed so that process_values becomes string_complexity, it goes from consistently passing to almost always failing.

D Mass Mean Probing vs Linear Regression

We were initially very surprised to find that mass means probing would perform better than linear regression. After all, linear regression is a much more powerful technique for fitting data. And mass means probing can be seen as giving the direction of best fit in each dimension independently, without considering other dimensions. The more dimensions you consider at one time, the better your model fit can be. But repeatedly in our data, we found that mass means probing outperformed linear regression on the test data.

The loop indicates that the next few lines need to process num in some way, and the fact that it comes from some_numbers means it has the type Optional[int]. num can’t be directly appended to result, because result is declared to only contain ints, not Optional[int]s. None of the normal operations on ints will work on Nones, so before num can be processed, something has to be done to separate Nones and normal ints.↩︎
In particular, we can use this program to test models for nullability understanding by asking them to complete the program’s next lines, then see if those lines matches the regular expression num\s*(is\s*(not)?|==)\s*None|isinstance\(num.↩︎
Note that these results differ substantially from those of Tigges et al. (2024), who find that for circuit analyses (rather than representational analyses like ours), circuit parts are learned at roughly the same point during training across scale.↩︎
Technically this is known as “Optional typing”, but that’s confusing in the context of this post. Not to be confused with Gradual Typing, as introduced by Siek et al.↩︎
In particular, it won’t pass a process_value body that just returns the value argument, since the call site will fail at runtime, trying to add one to a nullable string, while vanilla mypy will pass such a body.↩︎
Despite this, it does pass the test 40% of the available revisions, about triple what the other closest sizes can accomplish↩︎
This indicates that if you were to train a model using typechecking as reinforcement feedback, you would want to use mypy++ and not mypy.↩︎
Deepseek has one model output that demonstrates complete understanding of nullability, and runs fine at runtime, but fails the typechecker. This is because the code it generates catches the TypeError and changes control flow instead of checking for None values up front.↩︎
We don’t see all six non-annotated function tests passing under mypy, because models can still fail these tests by producing invalid syntax.↩︎
smoothed with a rolling average with a window size of 5↩︎
That is, we set the reading vector to the difference between the mean activation for positive class and the mean activation for the negative class.↩︎
This is, of course, where we trained our probes, but there is also a practical reason: right after the model has generated a variable that will be written to, it often does not have access to the assigning expression or type annotation, giving it no way to determine if the value will be optional or now.↩︎
Red tokens are significantly below the threshold, and green tokens are significantly above it; tokens that scored near the threshold would have a near-white color, but no such tokens appear in this example.↩︎
When this variable appears for the first time, it is in the if statement that checks if it’s None. Then, the model knows it is nullable, and the results of the probe reflect that understanding. But when it appears a second time on the next line, in the format string of print, the body of this if statement only runs if it is not None. The model understand this as well, and the probe accurately reflects this.↩︎
Li et al. (2024) and Zou et al. (2025) suggest that mass means probes are best for reading, while the direction perpendicular to the separating hyperplane is best for intervention. However, previous work leaves open the question of cross-layer weights. We use LR on the cross-layer weights, thanks to our investigations above.↩︎

How Language Models Understand Nullability

Highlights

1 Introduction

2 Measuring Nullability Understanding Externally

2.1 `NullabilityEval`

2.2 Understanding Typing Rules

2.3 Interprocedural Analysis

2.3.1 A simple test

2.3.2 Unannotated Functions

2.3.3 A stricter type system for Python: mypy++

2.4 Generating Type Annotations

2.5 External Test Results Across Training and Scale

3 Measuring Nullability Understanding Internally

3.1 Background

3.2 Designing Prompts to Extract Nullability Activations

3.3 Extracting Reading Vectors

3.4 Visualizing Probe Outputs

3.5 Probing Results Across Training and Scale

4 Future Work

6 Acknowledgements and Cite As

References

Appendicies

A Nullability In Python

B Python Subset Syntax and Typing Rules

B.1 Common Rules

B.2 Unannotated Functions

C Detailed High-Level Nullability Test Results

C.1 Across Scale

C.2 Across Time

C.3 Common Mistakes

D Mass Mean Probing vs Linear Regression

Highlights

1 Introduction

2 Measuring Nullability Understanding Externally

2.1 NullabilityEval

2.2 Understanding Typing Rules

2.3 Interprocedural Analysis

2.3.1 A simple test

2.3.2 Unannotated Functions

2.3.3 A stricter type system for Python: mypy++

2.4 Generating Type Annotations

2.5 External Test Results Across Training and Scale

3 Measuring Nullability Understanding Internally

3.1 Background

3.2 Designing Prompts to Extract Nullability Activations

3.3 Extracting Reading Vectors

3.4 Visualizing Probe Outputs

3.5 Probing Results Across Training and Scale

4 Future Work

5 Related Work

6 Acknowledgements and Cite As

References

Appendicies

A Nullability In Python

B Python Subset Syntax and Typing Rules

B.1 Common Rules

B.2 Unannotated Functions

C Detailed High-Level Nullability Test Results

C.1 Across Scale

C.2 Across Time

C.3 Common Mistakes

D Mass Mean Probing vs Linear Regression

2.1 `NullabilityEval`