April 14, 2024 Léopold Mebazaa

Instructions is all you need

When LLMs meet DSLs

AI agents may need custom programming languages to succeed

AI agents promise a future where repet­i­tive tasks that are tra­di­tion­ally per­formed man­u­ally are auto­mat­ed. But, in the pre­sent, these agents are slow, clunky, and error-prone. Why is that?

Some believe the cur­rent fail­ures of AI agents will dis­ap­pear as LLMs increase in size and reli­a­bil­i­ty. I sus­pect that won’t be suf­fi­cient: AI agents, as they’re designed today, are con­cep­tu­ally flawed.

The impasse of the “Choose Your Own Adventure” agents

For those famil­iar, there are two main types of AI agents. I refer to the first type as a “Choose Your Own Adven­ture” agent. One sem­i­nal exam­ple would be natbot – which is a CYOA agent that con­trols the web browser. There have been many suc­ces­sors to nat­bot, but the core mechan­ics tend to stay the same.

CYOA agents require the user, at the out­set, to pro­vide a series of actions, such as click­ing, scrolling, or typ­ing, that the agent will use to accom­plish the prompted task. At each step, the agent chooses one of these actions until it suc­ceeds. For exam­ple, one could request the CYOA agent to book a flight between San Fran­cisco and Hawaii. The agent would work by select­ing one of the pre­de­fined actions at every step (in this case, on every page), until the plane ticket is booked and the task is deemed accom­plished.

While these agents are quite impres­sive, they have not proven to be use­ful. Why not?

First, CYOA agents are inefficient. These agents, in gen­er­al, take sev­eral sec­onds to decide on the best course of action. Per­form­ing the same tasks man­u­ally is faster, espe­cially given how error-prone these agents still are.

More­over, CYOA agents are not designed for repetition. Automat­ing tasks is only use­ful if one can repeat actions effi­cient­ly. Unfor­tu­nate­ly, there has been very lit­tle devel­op­ment in this area. Cur­rent­ly, exe­cu­tion of the same task requires the user to ask the CYOA the same thing, every time. This is a lengthy and expen­sive process, even if the agent is suc­cess­ful every time. As it stands, at scale, writ­ing a script is still cheaper and more efficient.

Final­ly, CYOA agents don’t model com­plex behavior. They are designed to mimic human behav­ior, and can thus only exe­cute a sequence of actions. This means they don’t han­dle things like loops or con­trol flow.

The perils of code-generating agents

The sec­ond type of AI agents are those that gen­er­ate code (e.g. Python) before exe­cut­ing it. Cur­rent­ly, the most ubiq­ui­tous code-­gen­er­at­ing agent is Ope­nAI’s Code Inter­preter. Code Inter­preter is gen­er­ally used for math and data analy­sis, but it can do many other tasks.

Code-­gen­er­at­ing agents are the­o­ret­i­cally more pow­er­ful than CYOA agents because they are not lim­ited to per­form­ing a series of pre­de­fined actions. But these agents also present sig­nif­i­cant problems.

First, code-­gen­er­at­ing agents do not have enough for­mal guardrails. Code that is writ­ten and exe­cuted by AI with lit­tle human super­vi­sion presents the pos­si­bil­ity of a secu­rity dis­as­ter. While Ope­nAI’s Code Inter­preter oper­ates within the con­fines of some guardrails, other code-­gen­er­at­ing agents might not do so.

Guardrails around AI agents are essen­tial for more than sim­ply pre­vent­ing a rogue AI. For exam­ple, com­pa­nies that wish to auto­mate their cus­tomer ser­vice sys­tems will want to ensure that these agents com­ply with all of the com­pa­ny’s inter­nal poli­cies. Code-­gen­er­at­ing agents will there­fore not be widely adopted with­out strong guar­an­tees of for­mal guardrails. Agents that freestyle in Python sim­ply can­not promise these guar­an­tees.

Sec­ond­ly, code-­gen­er­at­ing agents, like CYOA agents, are inefficient. Take the exam­ple of web brows­ing. Code that is intended to scrape and crawl web­sites usu­ally con­sists of hun­dreds of lines or more. Large parts of that code are boil­er­plate and unop­ti­mized, not to men­tion unsafe. And because the first pass of code is often rife with error, the process of cor­rect­ing it through trial and error becomes long and cost­ly. Using Code Inter­preter through Ope­nAIs API cur­rently costs five cents per ses­sion.

Third­ly, big­ger and bet­ter code-­gen­er­at­ing mod­els might in fact cre­ate new problems. As AI code-­gen­er­at­ing agents advance, we may reach a point where these agents pro­duce so much code that man­ual review becomes all but impos­si­ble. This prob­lem will exist even if for­mal guardrails are imple­ment­ed, and even if these LLMs are able to pro­duce safe, opti­mized code on the cheap. Tra­di­tional pro­gram­ming lan­guages might not be the right tool for agents to code in, as they will pro­duce code that is too long for human review.

Why custom programming languages might be the missing piece

I think that AI agents might be improved with the intro­duc­tion of domain-spe­cific pro­gram­ming lan­guages (DSLs). More specif­i­cal­ly, there should be a col­lec­tion of domain-­fo­cused DSLs that can be used for what­ever agent is tasked.

An agent that would gen­er­ate code in a DSL would syn­the­size both types of agents in the fol­low­ing way:

Over the past decades, as the need to under­stand the inter­nal work­ings of the com­puter has declined, com­puter pro­gram­ming has become far more abstract. We went from zeros and ones to assem­bly to C to Python. Many set­tled on the lat­ter because it is a good choice for man­u­ally writ­ten code that is review­able by oth­ers. As we enter a new era with AI, we might need to reach a new lev­el: DSLs for AI-writ­ten code that is review­able by the rest of us. And I’m work­ing for Colum­bia Uni­ver­sity until this fall to build pre­cisely that.

If you want to chat – My email is lemeb ‘at’ this domain.

Thanks to Natasha Espon­da, Fred Kjol­stad, Alex Ozdemir, Eric Pen­ning­ton, and Michael Völske for their feedback.