Why Pydantic became indispensable for LLMs.

How a validation library accidentally became a translator between code and English.

Jan 19, 2024

Pydantic is the most widely used data validation library for Python. It’s long been a favorite of Python pedants, who want to define, share, and enforce “contracts” between systems. With Pydantic I can define what a User, Book, or Pizza is in my code, share that definition with another system, and validate whether or not an incoming data adheres to my definitions.

So why is it it emerging as critical infrastructure for working with LLMs?

Pydantic lets you share your data models by generating clear, standardized representations of their definitions a la JSON Schema. In doing so, so Pydantic makes it trivial for your spaghetti code to interact with my fettuccini code, since it handles the translation to a global standard. These open standards are ubiquitous in nearly every public description of describing data models or web services.

It’s unsurprising then, that in training LLMs like ChatGPT on the entire internet, they’ve seen millions of well-documented JSON schemas and human descriptions of what those services do. This has given LLMs the uncanny ability to reason back and forth been unstructured data and structured data - after all, it’s seen hundreds of millions of examples of “if you want to send a Hawaiian Pizza to our service, you need to send {‘toppings’: [‘Canadian bacon’, ‘pineapple’, ‘garbage’]}.

Breeding LLMs to be both creative and conforming usually means constraining their choices, most often via interleaving them with rigid classic software, or code. We’ll let the LLM listen to you prattle on about what you want for dinner, and when you’re done talking we’ll make it create a valid pizza for our order_pizza function to carry on. We’ll tell it what it means to be valid with an JSON schema it understands, and we’ll use Pydantic to validate your pizza isn’t Hawaiian before ordering it.

If LLMs speak JSON schemas, its unsurprising then that the spoils of LLMs go to any language or utility that can easily broker a conversation between code and schema. Pydantic has earned its place in the Python ecosystem, but it benefitted from being one of the few sane options for brokering this conversation in Python when developers discovered how well LLMs speak JSON schemas.

If we prod deeper, this victory for Pydantic teaches us a more general lesson about the importance of quantizing your decision space. Put another way: today’s LLMs are better at paint-by-numbers than they are at making a masterpiece on a blank canvas. They’re better at speaking JSON schemas than your spaghetti code, they’re better at text-to-ORM than text-to-SQL, they’re better at text-to-component-library than text-to-html. The recipe behind many successful projects is finding the correct tessellation for an LLM to creatively in-paint than telling it to make your user another Picasso. For this moment in time when LLMs are better at paint-by-numbers, AI Engineers distinguish themselves by how well they can quantize these design decisions so they still appear to the end user as a masterpiece.

Looking forward, there are a few hiccups on the horizon for Pydantic and its cousins in other languages (e.g. Zod in TypeScript). LLMs are overfit to the hundreds of millions of examples that conform to the JSON 2019-09 draft. In December 2020 JSON Schema released its 2020-12 draft, which is sufficiently different to cause a headache (e.g. how it handles enums). Pydantic’s last major release conforms to this new specification, and many LLM frameworks that migrated to Pydantic V2 saw regressions in ChatGPT’s performance as a result.

This presents a beautiful conflict and form of amplification bias. JSON Schema draft 2019-09 is more ubiquitous in the training data of today’s LLMs. LLM applications will have more success using an outdated standard since it’s trained on it and slow to change. LLMs sticky training data risks creating a real bias towards using standards from 2017-2021, which in turn discourages adoption of new standards, products, and libraries which is ironically what we’re all trying to build.

FactsMachine

Discussion about this post