A Julia DSL for language models

@svilupp and I were talking on the Slack about what a DSL might look like for generative models. I wanted to stick this on the forum for a more sticky conversational place, so I hope this can serve as kind of an evolving document.

Copied at the end is the original slack conversation.

The gist basically is that @svilupp is tinkering with tools for writing mini LLM programs, and I have also been intermittently tinkering with the same. @svilupp is adding stuff to PromptingTools.jl that processes an arbitrary program determined by a series of language model calls.

Here’s his example:

@aimodel function my_model(n=2; model="gpt3t")
    # add a soft check for our AI task
    # syntax: @aisuggest CODEBLOCK CONDITION FEEDBACK 
    # or simply @aisuggest CONDITION FEEDBACK if already INSIDE a CODEBLOCK
    @aisuggest begin
        # airetry will simply re-run if the call fails
        @airetry greeting = ai("Say hi $(n)-times"; model)

        # Nested @aiassert - hard check
        count_hi = length(greeting)-length(replace(greeting,"hi"=>"")) == n
        @aiassert count_hi "There must be exactly $(x) 'hi' in the greeting"

        greeting_check = occursin("John", z)
    end  greeting_check "Greeting must include the name John"

    return greeting

This program would attempt to re-force a language model to run until conditions are met. In this case, that’d be the language model saying “hi” five times. I love this framework and wanted to contribute a generalization of the code above, to think about what an abstract spec of this might mlook like.

I could imagine more powerful programs like this one, where this program is being run on a robot assistant named JerryBot.

User: My mom left her keys, wallet, and glasses on the table at McDonalds. Could you run back and get them?

Now, JerryBot has stuff to do. It has to follow the flow starting from the input from this user, which is a request to pick some stuff up from a table at McDonalds.

JerryBot has to figure out

  1. What did I just get? Is this a request, a statement, or other? If it’s a statement, save it to memory for later. We’ll ignore “other” for now, but in principle you can add separate control flows for other.
  2. If it’s a request, you can enter a separate control block. In this case, you need to know some more things? Here’s a small list of a few you might consider:
    • Who asked? I may only respond to certain people.
    • What kind of request is this? Retrieval, shut down, other?
  3. If it’s a retrieval:
    • What do I have to get? Extract a list if multiple.
    • Where is it?
    • Do I need to know anything else?

and more. The idea here is that you can try to make weird programmatic flow by contextualizing, text extracting, etc. until you have some kind of result, for any arbitrary query.

As a practical example, I could imagine creating an auto-documenter that goes through every person’s Julia source code, and follows a series of steps to iteratively refine the documentation?

  1. Does this repo have Documenter set up? If no, do so. Otherwise, proceed.
  2. What does this package do? Please provide a list of steps to generate documentation for the package. This would be something like write a home page, add the API spec, manuals for X use cases, etc.
  3. For each step, recursively generate sub tasks. Each task is run until the model decides it knows the answer directly, and where no further subtasks are generated.
  4. Validate each step. Check your work – did you accomplish the goal? If not, please redo.
  5. If there is any code you have written, please execute it to ensure that it works. This would basically entail giving your LLM a REPL to use, and it can look at the callstacks and maybe automatically determine what to fix.
  6. Save the code, prepare a pull request.

I think that this type of program is extremely powerful, and I think that Julia is an extraordinary tool for working with this kind of thing. We’re good at DSLs, and I think you can make some absolutely gorgeous programs with language models when you give them a focused application and guidance.

I’m starting shitty tinkering with at colang, and @svilupp is kind of approaching it in his delightful way. I think my idea is to try to define some kind of type system for language. For example, I should be able to perform statements like

  • Is this true or false?
  • Would you say yes or no, assuming you are [insert perspective]?
  • Does this seem to be about Z?
  • Is this a list? For this one, you can also imagine a nested call that extracts the items in a list using structured text extraction.

and received a wrapper type around the expected response type. Once you have that response type that contains the raw response (“yes”) or a reduction of that response (true). There’s maybe a few others (categorical selection, code evaluation, etc.).

I’m very interested in working on this, so I hope I can see some perspectives from people here.

Slack Log

Jan Siml 23 hours ago

Over the weekend, I’ve been playing with writing a compiler/DSL for writing mini LLM programs. The purpose is two-fold being more declarative and being less verbose. The hope is that once we know what user wants to do, we could help him optimize the function (eg, the prompt, call parameters).I discovered too late that @goto/label doesn’t work how I wanted, so I’ll need to rewrite bunch of stuff. I figured it’s a good opportunity to ask if I’m just wasting my time or going in the wrong direction.

Jan Siml 23 hours ago

Would you be able to use something like this?

@aimodel function my_model(n=2; model="gpt3t") # add a soft check for our AI task # syntax: @aisuggest CODEBLOCK CONDITION FEEDBACK # or simply @aisuggest CONDITION FEEDBACK if already INSIDE a CODEBLOCK @aisuggest begin # airetry will simply re-run if the call fails @airetry greeting = ai("Say hi $(n)-times"; model) # Nested @aiassert - hard check count_hi = length(greeting)-length(replace(greeting,"hi"=>"")) == n @aiassert count_hi "There must be exactly $(x) 'hi' in the greeting" greeting_check = occursin("John", z) end greeting_check "Greeting must include the name John" return greeting end

It would effectively rewrite into proper function and add the necessary boiler plate.Motivation and explanation of the syntax here: https://github.com/svilupp/PromptingTools.jl/blob/add-compiler-macro/src/Experimental/AgentTools/DSL_README.mdEDIT: There is no code to test yet - It doesn’t work yet and I’m too embarrassed about it’s current state (edited)

Jan Siml 23 hours ago

I’m keen to learn:

  • would it be broadly useful (thinking about agentic workflows/automations)
  • is something too hard to understand? what would simplify it? (within the limitations of what’s reasonably doable with macros)

SixZero 21 hours ago

I am interested, as I think many of us. I see some ideas written where it would be useful, while I would still want to know examples when these happen. Also figuring out intuitive ways to use it is I think somewhat important (edited)

Jan Siml 20 hours ago

Have you seen: GitHub - stanfordnlp/dspy: DSPy: The framework for programming—not prompting—foundation models ? It’s very much based on that

SixZero 20 hours ago

Yeah, opened it, although couldn’t really read it over, but looked promising.

Jan Siml 19 hours ago

I’d say it depends on what you’re using GenAI for…

  • If you don’t need chaining a few AI calls together (eg, extract something, use it in the next call as inputs), you don’t need it… or if you call OpenAI 100x at once, chances are that one of the calls will fail, so how you handle that
  • if you don’t mind writing prompts and tweaking them “manually”, the optimization is useless for you (and hence using the DSL)

Maybe it’s less “broadly” useful than I thought, which means I can focus on other projects!
That’s also a super valuable feedback

Cameron Pfiffer 12 hours ago

Oh I love this!

Cameron Pfiffer 12 hours ago

I have some stuff in colang for structured text in control flow, like yes/no, contains x, etc. Would be lovely to have that as well

SixZero 11 hours ago

No, I actually thibk these issues are super reasonable and anyone’s issues, so it is worth definitely @svilup

SixZero 11 hours ago

I just wante dto know how our life would be easier with it… eventually I want thing to work in an agent centric manner, to improve its solution step by step…

SixZero 11 hours ago

Just not quite right there tho at this point

Jan Siml 11 hours ago

I want thing to work in an agent centric

Can you say more? Do you have a specific example of how you run something today vs what would make things simpler?With “agents”, I struggle to find anything super useful besides a few chained AI calls with some validation (I think validation / self-retry is actually the most useful bit of the DSL above).What’s your take?

SixZero 11 hours ago

Actually still I don’t have a specific example, more of a dream, I am thinking about how things would be useful…
Also there are some somewhat good solutions to this ex. gpt-pilot…I guess one of the value of this agent thing:

  • It has a lot of time to work, it can work the whole day.
  • It needs to be “applyable” easily to whole projects, and work on them.
  • Accuracy of its suggestion must be improved, making the code worse is just not fun. Probably the problem here is that, we also don’t really know what is a better code and what is not.

gpt-pilot is not good in easy applying in my opinion.
3rd point might be improvable if the system could come up with 3-4 different solutions on the next day, for solving things.Also what I see in copilot is that, it gives a pretty nice git diff on the code changes… this is somewhat a good format to look at as a human, to decide if something got improved or not… showing multiple solutions in this manner is I don’t know how hard…


SixZero 11 hours ago

Copilot excels at the 2nd point !

SixZero 11 hours ago

Accuracy could be improved… also it could used the whole day for improving things…

Saved for later • Due 7 hours ago

Jan Siml 11 hours ago

I have some stuff in colang for structured text in control flow, like yes/no, contains x, etc. Would be lovely to have that as well

Can you expand on that? I’d love to see a mock of what more complicated control flow looks like!Re. structured extraction, the idea here is that aiextract is verbose, so you can write, eg, z = ai(…)::MyFancyType → that makes it obvious that compiler should use aiextract. Plus the z will be already your MyFancyType instance (no need to call AIMessage().content accessor)For yes/no, we should probably leverage aiclassify() which leverages the logit_bias trick to answer in one token - either true/false (you could tweak to be yes/no)… We coudl add to compiler ai(…)::Boolaiclassify For contains x AI calls, we could have a rule that if @aiassert/@aisuggest are missing a condition, the “feedback” text becomes something to pass to LLM judge → aiclassify() call that will answer true/false (with low temperature)
@aiassert begin…my code…end "<statement>"@aiassert begin…mycode…end aiclassify("statement") "<statement> is not true" For string-based occursin conditions, it would be simply @aiassert CODE_BLOCK occursin(…) “feedback”)In general, what would be a killer feature/use cases that would make it worth it to switch to this syntax? Personally, I can write all these control flows very quickly, so it’s hard for me to see if the cost of learning a new syntax (eg, everything is a ai()call and the@ai…` macros) is worth it for users (edited)


Cameron Pfiffer 10 hours ago

Okay I have some thoughts here but will type em up later. Wonder if this is discourse able?


Random thoughts on LLM DSLs that may or may not be useful:

In general I feel like LangChain is a good case study for APIs for what not to do.

First I’ll say that I obviously have huge respect for the LangChain people for working so hard for the community in this space.

The issue with LangChain is they basically added way too many abstractions, way too early. Right now there are like 10 different APIs that can do pretty much the same thing but in slightly different ways. In each of these ways I don’t really understand what is going on under the hood (in case I need to modify it). From using LangChain for a bit, I often get frustrated with all of the abstractions and just end up writing code from scratch that manipulates strings and calls llm.query(my_string).

So from this experience I would just want to add a comment that you should try to keep an LLM domain specific language as “close to the metal” (i.e., string manipulation) as possible - only adding abstractions where there is a huge reduction in code complexity - but leaving the rest up to the user manipulating strings.

For example, the stuff that the library Outlines does is a nice example of this. This is an example that would be really annoying for a user to need to re-create from scratch, so their abstractions help quite a bit, and are worth it.

For example in your code above I think the @airetry and @aiassert are perhaps examples of early abstractions that might not be needed - if you instead expose an API that simply makes it easier to generate and extract structured data from an LLM, then I can simply write a while loop there for a specific extracted variable, and it will be easier to understand and modify. Similar for @aiassert, I can write an @assert following some structured extraction call if needed.

Another area is retrieval augmented generation (either from databases or code itself). There’s probably a good API that is “close to the metal” and doesn’t abstract away too much, so the user can tweak different parts, and still be aware at all times of the exact string being passed to the LLM.

I think also some of the stuff that AutoGPT does would be pretty annoying to code up from scratch. So making an API that handles this (perhaps by passing schema to control different commands) would be quite nice. So for example, any sort of documentation generation could be implemented with that sort of API.


This is all great! Thank you for adding your comments.

This has been my sense too. LangChain is wonderful but I feel like it’s never really exactly what I want to be using.

Hm okay, I agree, and it’s a good point. As with any language the tradeoff between flexibility and expressiveness is a meaningful one. I think here it would be worth going through a set of primitives to see if they are meaningful to implement.

As you say, Outlines does this incredibly well. This example is just amazing:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

In this case we have a “choice” primitive.

I think for us could be reasonably easy to put into a format like

if choice(context, model, prompt, ["yes", "no"]) == "yes"
    . . .
    . . .


  • context represents the evaluation context, i.e. RAG documents, memories, previous evaluation results, etc. In principle this could be kind of a larger type that includes the model, the prompt, and any additional information you want each line to have.
  • model is the model to use.
  • prompt is the specific request to use before making a choice.

This might also be a macro-style approach:

@choose prompt "yes" "no"

which would optionally handle context passing but could be overridden with @choose new_context ....

Another type of abstraction that would be useful is a loop

I agree here – it is a little heavy handed and may go a bit above what a user might like, but it does address this problem where LLMs don’t always do quite what you want. It’d be nice to provide simpler tooling for this, since it is essentially a while loop.

Yes! This would be very cool. Lots of us are working with RAG tools but I don’t think there’s been a concerted effort yet. I would love to have a local SQLite or Chroma database that the language model can use as memory for their program.

It would also be cool to provide memory management tooling to provide arbitrary queries that you think are helpful for some kind of code evaluation. I.e., “I want to know if this code violates our code of conduct”, which can go query the code of conduct and place it in the local context for an evaluation.

I don’t use AutoGPT currently – I’ll do my own tinkering, but does anything stand out to you as a feature that you think would be very useful to have?

Again, thanks for the detailed repsonse! I appreciate it.


I’m not familiar with LLM APIs so my question will be naive.

This looks very string based, whereas I normally like my programs to be typed. Is it possible to make a user-facing interface generate from a more structured set of options, like true | false or a parametrized SumTypes.jl enum like Just{Int}(4) | None or a color like RGBA(0,0,0,0) — or can they only make strings?

Great conversation!

Agreed. I thought those proposed primitives meet that criteria but I might be wrong.

So let’s pick on aiassert/aisuggest, I have an actual use case (inspired by your Notion tweet):

  • given a task from the user, extract fields matching the schema of my database
  • I want to validate some fields, eg,
  1. hard check: no task should take longer than 3hrs,
  2. hard check: tag must be only from the allowed list (We could use enum in aiextract but lets ignore that for now because there are situations where its harder)
  3. soft check: llm check if it’s open source/coding, then it shouldn’t be category “work” (soft because there could be some edge cases, so let it pass)
  4. soft checked: reworded task should be leas than 10 words (again, preferred but not strict)
  • I want to also have a retry on the call if it fails for http/busy server reason
  • I’m happy for the checks to run upto 2-3 times each but not all of them, so probably some total budget of calls - let’s say 10 total calls
  • also, it would be nice to tell the model that it failed and what the failure was. Otherwise, it won’t improve much on the subsequent calls. We need different feedback addition for each condition check.

How would you write that mini program?

I think that’s a simple and a realistic task, no?

  • why the checks and not better prompt? runtime checks always win and I’d rather save an hour tuning a prompt that works 50% of time and rather pay 10 cents instead of 1 cent per query. It’s easy to overfit your prompt to one example
  • why soft checks? I have certain preference that is worth paying for and saves me time in the task cleanup, but I can see that in the diverse tasks it might happen

On hand-rolling your while loops, I think they might be the only choice for this control flow. But somehow I find them harder to get correct/write quickly than for loops (less “finite”/clear). I think in my code and code around me it’s maybe 1:50, for vs while, or something crazy like that. Am I the only one? Especially, if you have multiple of them and some checks stop the program, some just continue.

Btw the biggest advantage of having the above DSL were meant to be the nice inputs for the prompt later on. It would be harder to do if users hand-roll all checks and asserts themselves - we’re losing a lot of information!

On being close to the metal, I agree! I want the macros to allow users to operate on “strings” and forget that there are some LLM calls, lot of it is boilerplate.

So back to the program above, what if there are 2-3 aicalls (extract-> generate-> generate). How do you check what happens between the calls? You’ll probably need to start tracking outputs of the various calls and writing bunch of same stuff. That’s what @aimodel was meant to enable — definitely not needed for one vanilla call with no validations.


EDIT: Check out this tweet for the rationale behind assert/suggest constructs: https://x.com/arnav_thebigman/status/1758554162449789309?s=46&t=LqkQn2Q2J-NjCeYA4p2Dbg

I was thinking about something conditional like that, but in the spirit of the above, that might be left to the standard ai calls?

If you use OpenAI, that’s a simple “aiextract” with Enum return type (a lot of wasted tokens though). It basically builds the “function call”.

There is also aiclassify, which is preset for true/false. It exploits the “logit bias” trick and asking for only 1 token to return.
We could tweak it to behave like Enum selection for any arbitrary list provided. It would be cheaper and faster, but perhaps small loss of performance since we just eagerly grab the first token.

This aiclassify should be possible for llama.cpp models, Ollama doesn’t expose it yet, but maybe it could?

EDIT: Sorry about formatting and typos, I’m on a phone.

EDIT: I’ve added the convenience function to encode “choices” to aiclassify, so you can now do:

choices = ["animal", "plant"]
input = "Palm tree"
aiclassify(:InputClassifier; choices, input) # Output: "plant"

# or add choices with descriptions
choices = [("A", "any animal or creature"), ("P", "for any plant or tree"), ("O", "for everything else")]
input = "spider"
aiclassify(:InputClassifier; choices, input) # Output: "A"

See the details in ?aiclassify or here. To see what’s happening under the hood simply get the whole conversation (return_all=true) and pretty print it with preview(conv).

It’s a good ask, but at the moment I’m focused on getting LLMs do what I want. That’s much easier in the string space thanks to its”fuzziness”.

Once we get a better handle on that, we should definitely explore something more Julian!

1 Like

The thing that’s standing out to me here is that the aiextract stuff for structured text extraction is probably where we’d want to end up focusing for non-raw string stuff. This gives us a bunch of type information as well.

In a DSL structure we can probably slim down the aiextract stuff so that

  1. it dispatches on return type (currently I believe it is a keyword argument, so not type stable)
  2. it’s a little less verbose to use, especially in the return type. Inside a DSL I don’t necessarily want to be calling thing.content all the time, so maybe we can do a wrapper function that access only the raw results of the thing.

An example here might be

@enum CreatureType animal person other

function f(x)
    # Check first if this is about a person or an animal. The return type of 
    # creaturetype must be CreatureType in this case.
    creaturetype = @classify x CreatureType

    # Dispatch on returned value
    if creaturetype == animal
        return "It's an animal."
    elseif creaturetype == person
        return "It's a person."
        return "It's something else."

which is relatively simple to use, though I’m not sure that the macro here is useful. I think it might be nice to add stuff to make @classify use multiple queries to poll the same model multiple times or multiple different models, which could help you have more of a probabilistic understanding of the result.

I think polling is not a terrible idea here, mostly because I think it’d be fun to experiment with.

In that case, we’d have the return type be something like

using Statistics

@enum CreatureType animal person other

struct ClassifiedProbability

# Probability of a specific class
function probof(classified::ClassifiedProbability, t::CreatureType)
    # Return mean of types equal to `t`
    return mean(x -> x == t, classified.classifications)

probof(ClassifiedProbability([animal, animal, person, animal]), animal)
# 0.75

# Get all probabilities
function probabilities(classified::ClassifiedProbability)
    # Return a dictionary of probabilities
    num_instances = length(instances(CreatureType))
    probs = zeros(num_instances)

    for c in classified.classifications
        probs[Int(c)+1] += 1.0

    return probs ./ length(classified.classifications)

probabilities(ClassifiedProbability([animal, animal, person, animal])) 
# [0.75, 0.25, 0.0]

Nitpick: I would try to do as much as you can without macros. Macros make it harder for downstream users to integrate it into custom code as it’s not clear what types it expects, whether it evaluates at compile time or runtime, what code actually gets called, whether I can precompile it, etc. And static analysis tools are not as useful. It’s not even necessarily more convenient nowadays since people just have CoPilot fill in the tedious parts.

I think it’s also always better to not necessarily add all the features yourself, but just make things modular and idiomatic enough so a user can easily extend things and do things in normal Julia code. At a given level of complexity it’s often easier to write things from scratch than read the docs — so it’s good to just make that stage easier for the user. This is why I think PyTorch is such a popular framework; they only add the basic ingredients (nn.Linear, F.relu, Adam), with some additional APIs for annoying stuff like FSDP, but make it amazingly simple and robust for the user to put together custom pipelines and forward models in plain Python code. I can debug my PyTorch code with good ol’ print.

I get the ai"prompt"; that’s convenient enough to be worth it and clear what type it needs. But the

@classify x CreatureType

is perhaps unnecessary. You can just do it with classify(x, CreatureType) and take advantage of dispatch. But this type of thing I might just do in a general schema approach and let the user specify a method if they want to explicitly do classification.


One example: how would I use tokenizer A for the first 20% of the input; and tokenizer B for the last part? LLMs are such an active area of research you kind of need to expose the low-levels of processing to the user at some level, so we can do very custom stuff. Always better to lean hard into the base programming language than invent a new one :wink:

I think specifying some kind of interface that users can customise would be nice. See Interfaces · The Julia Language. Maybe there should be an abstract interface for LLM stuff.


Following some of the feedback here, I’ve added the retry functionality to PromptingTools 0.13 macro-free.

It’s called airetry! and it allows you to provide condition as a function and feedback for the model as a string/function: roughly like this airetry!(f_cond::Function, aicall, feedback::Union{String,Function}.

It leverages the “lazy” ai* function calls (eg, aigenerateAIGenerate), which can remember their inputs and trigger the LLM call only on run! (but can do so repeatedly).

Example 1: Catch API issues

# API failure because of a non-existent model
out = AIGenerate("say hi!"; config = RetryConfig(; catch_errors = true),
    model = "NOTEXIST")
run!(out) # fails

# we ask to wait 2s between retries and retry 2 times (can be set in `config` in aicall as well)
# It will throw error in the end // alternative you could check for aicall.success == false
airetry!(isvalid, out; retry_delay = 2, max_retries = 2, throw=true)

Example 2: Validate outputs (can be with normal functions or even another LLM calls as LLM-judge)

out = AIGenerate(
    "Guess what color I'm thinking. It could be: blue, red, black, white, yellow. Answer with 1 word only";
    verbose = false,
    config = RetryConfig(; n_samples = 2), api_kwargs = (; n = 2))

## Let's ensure that the output is in lowercase - simple and short
airetry!(x -> all(islowercase, last_output(x)), out, "You must answer in lowercase.")
# [ Info: Condition not met. Retrying...

More heavyweight example but think about how hard this would be to build with simple ai* calls:
Example 3: LLM guesser program


Mini program to guess the number provided by the user (betwee 1-100).
function llm_guesser(user_number::Int)
    @assert 1 <= user_number <= 100
    prompt = """
I'm thinking a number between 1-100. Guess which one it is. 
You must respond only with digits and nothing else. 
Your guess:"""
    ## 2 samples at a time, max 5 fixing rounds
    out = AIGenerate(prompt; config = RetryConfig(; n_samples = 2, max_retries = 5),
        api_kwargs = (; n = 2)) |> run!
    ## Check the proper output format - must parse to Int, use do-syntax
    ## We can provide feedback via a function!
    function feedback_f(aicall)
        "Output: $(last_output(aicall))
Feedback: You must respond only with digits!!"
    airetry!(out, feedback_f) do aicall
        !isnothing(tryparse(Int, last_output(aicall)))
    ## Give a hint on bounds
    lower_bound = (user_number ÷ 10) * 10
    upper_bound = lower_bound + 10
        out, "The number is between or equal to $lower_bound to $upper_bound.") do aicall
        guess = tryparse(Int, last_output(aicall))
        lower_bound <= guess <= upper_bound
    ## You can make at most 3x guess now -- if there is max_retries in `config.max_retries` left
    max_retries = out.config.retries + 3
    function feedback_f2(aicall)
        guess = tryparse(Int, last_output(aicall))
        "Your guess of $(guess) is wrong, it's $(abs(guess-user_number)) numbers away."
    airetry!(out, feedback_f2; max_retries) do aicall
        tryparse(Int, last_output(aicall)) == user_number

    ## Evaluate the best guess
    @info "Results: Guess: $(last_output(out)) vs User: $user_number (Number of calls made: $(out.config.calls))"
    return out

# Let's play the game
out = llm_guesser(33)
[ Info: Condition not met. Retrying...
[ Info: Condition not met. Retrying...
[ Info: Condition not met. Retrying...
[ Info: Condition not met. Retrying...
[ Info: Results: Guess: 33 vs User: 33 (Number of calls made: 10)

There are a few tricks under the hood (see the examples in the docs ?airetry!), eg, to retry from the “best” attempt, I implemented lightweight Monte Carlo Tree Search and also Bandit-like sampler. It will have different use cases (thinking about prompt optimization in the future).

@cpfiffer as per your example, I’ve added “multi-sampling”, so you can ask for several samples per round to increase the chances of passing (+ with OpenAI and many commercial providers you can generate all these examples in a single call → api_kwargs = (; n=3). It saves time and money (you can for the prompt once).

On the classification discussion above, the communication always happens in strings. It’s up to the user to decide what to convert them into. With airetry!, they now have a simple way to validate and retry (eg, convert, tryparse, …).

Currently there two mechanisms to achieve more “robust/stable” outputs:

  • aiclassify → Uses logit_bias hack. It encodes into the prompt “Category 1 is XYZ, …”, and forces model to return one out of the X categories (we restrict output to 1 token). So you have a guarantee of what you’ll get back:
choices = [("A", "any animal or creature"), ("P", "for any plant or tree"), ("O", "for everything else")]
input = "spider"
aiclassify(:InputClassifier; choices, input) 
# Output: "A" --> convert to your fixed type?
  • aiextract → Uses tools kwarg (when available). It encodes your return_type into JSON schema and leverages function-calling to get a structured JSON back. I’d say majority of commercial providers support it. For locally-hosted models, you need to use JSON mode + airetry!.

I think aiclassify is a good starter if you have simple fixed outputs, aiextract is for the cases when you need to extract “arguments”.

Thinking about DSL, I don’t see how we would save more than 1 line of code, eg,

  • @ai "my prompt" MyType → aiextract and unwrap it
  • @ai "my prompt" ["choice A", "choice B"] → aiclassify and unwrap it
    Would it be that beneficial?

In any case, for the “lazy” AI calls, I’ve added methods last_output and last_message to make it easier to pipe them around and get only the generated text.

Interesting! I haven’t heard about it yet and can’t really think of a use case - do you have some specific ideas?

Agreed in general with your point, but I think it applies more to someone like Outlines, which is closer to the “backend” and actually operate on the model itself.

I wish someone defined some LLM interfaces :smiley: My struggle is that it’s easy for well-understood domains where you have an idea of what people want/need. I don’t think LLM applications are mature enough for it (with exception of open-source instruct prompt templates → I feel very strongly that everyone should just use ChatML… :sob: )

I agree here – those are kind of an example of a general syntax, but I agree that classify(stuff) is probably the way to go. There might be a place for macros for more complicated flows later.

I see the value in having things be print debuggable, good point! This is a downside of airetry, though in principle you could probably pass print stuff to the validation function to handle this. airetry is just a convenience function though around a loop of calls, so I’m 100% for it since the user can always just roll their own loop.

I would actually really love this. We built something like this for AbstractMCMC.jl in the Turing universe, and it was delightful to work on.

We’d basically need significantly more access to internals though I think, i.e. llama.cpp or a richer implementation/loading of arbitrary language model weights in Llama2.jl or Llama.jl.

We should probably spin that out into a separate thread though.

I think I see the shape of this as being very useful, but it is quite difficult to read. This is I think why I want tools that provide even higher levels of abstraction. There are not that many things that I want out of a language model. The only things are

  • Free form response, i.e. “write me an eassy about…”
  • A structured extraction that is still free form, i.e. JSON
  • An exact response from a list, i.e. yes/no, red/blue, cat/dog/hamster.

That’s a short list. I would definitely want those in types, since that makes it a lot easier to work with. @svilupp already has one of these types (the general-purpose one, AIMessage).

That leads me to expect there should be three atomic types falling under some abstract type like AbstractResponse:

  • Response for arbitrary text.
  • StructuredResponse for structured text.
  • CategoricalResponse for classified text.

Each of these should store the raw response, tokens used, and perhaps some additional debug information if needed (maybe model used?).

Any practical feedback is welcome! Tbf, it requires reading the docs to understand what it does and why.
Think of it as assert on steroids:


  • it can retry automatically, not just throw an error
  • it manages the "conversation’ (list of messages) for you

I guess that equivalent code to achieve similar functionality would be 2-3x longer.
The example might be excessive but it’s effectively 3x airetry! checking different things.

There are already distinct types for the first two cases you mention:

  • Free form response, i.e. “write me an eassy about…” → AIMessage
  • A structured extraction that is still free form, i.e. JSON → DataMessage
  • An exact response from a list, i.e. yes/no, red/blue, cat/dog/hamster. → this would be AIMessage because it’s simply a text.

Can you say more about why you would want a different output type for a response from a option list (CategoricalResponse)? It’s still a text, not an object. If you need a different type for each option, you could simply use a Dict to map string->MyType().

If all you want is to avoid doing the Dict lookup, it could be added easily to aiclassify. There is already a function that decodes the token return by LLM into a string. No reason why it would be able to return a type (but it would break a lot of downstream workflows).

I guess we can discuss it tmrw!

lol I wish I had some right now – apologies for the unsubstantiated criticism. Reducing verbosity is hard and I don’t have a good idea yet!

Don’t get me wrong, I think for the moment this should be available – let’s operate on the assumption that we’ll have like airetry available. It’s a useful abstraction layer. People who don’t want it don’t have to use it and it helps reduce a lot of annoying boilerplate.

I wanted separate types for exact response for dispatch purposes – I know we already discussed that, but I’m still always looking to provide simple wrappers so people who want to do Julian stuff can do so easily.

For categorical stuff, it need not actually be just a string – enum, for example, is not a string, and in principle you’ve got a lot of working code that lets you select arbitrary structs. We should permit non-string return values.

The input knows the desired type and so it might be useful to provide that type when the value is returned.