The Prodigal JSON.

Welcoming you with OpenAI arms.

Jun 14, 2023

Today OpenAI announced support for functions. When you ask it a question, you can now supply and describe a set of functions to GPT 3.5 and GPT 4. It will decide if and which functions are prudent to call to confidently answer your question. If it decides that it needs to use one of your supplied functions, it’ll even do you the favor of formatting the input to that function. OpenAI choice to format inputs seems at first to be going the wrong way, after all, for the last year developers have focused on getting structured outputs from LLMs. Turns out, it’s all the same.

Consider the usual fare of constraining a large language model to do your bidding:

Here’s this email I got from
Jeremiah Lowin
{email_text}.
Extract the following information: num_puns: int, worst_puns: list[str], best_pun: Optional[str] = None
Output as JSON.

This works okay, at first. But as the number of requirements increases or the depth or ‘nestedness’ of the schema grows, this starts to fail and fail often.

But we can recast this problem as a tool choice problem. OpenAI now lets me pass tools to it in service of a question. If it chooses that tool, it will format the data that needs to be passed to that function. So all we need to do is to

Create a tool whose “inputs” are num_puns, worst_puns, and best_pun
Trick OpenAI into using that tool.

Luckily, OpenAI exposes a parameter where you can force it to use a tool. So, really it’s as simple as telling OpenAI “I want to extract the following information from {email_text}” and passing it:

extract_puns(num_punts:int, worst_puns:list[str], best_pun: Optional[str])

What do you get back? Beautiful, bulletproof, formatted, boring, typesafe, atomic-unit-of-trust JSON.

We shipped this in our latest version of Marvin today, but it’s all under the hood - as it should be. It’s open source, free to use, and all our code is publicly available. You’ll pay your tithing to wherever you buy your inference, but it’s now 15x cheaper to extract structured data from unstructured text and documents.

FactsMachine

Discussion about this post