4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate
5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next
deskamess 14 hours ago [-]
I always wondered why AST's were not more of a part in both editing and scoping of changes/parsing code. I thought I read an article where they said 'grep' was just as effective. It kinda made sense for the case they were talking about.
miki123211 6 hours ago [-]
I think we should use ASTS more, not for performance, but for easier code review.
Changes that are primarily code refactorings, like breaking up a large module into a bunch of smaller ones, or renaming a commonly-used class, are extremely tedious to review, both in LLM generated diffs and human-written PRs. You still have to do it; LLMs have a habit of mangling comments when moving code across files, while for a human, an unassuming "rename FooAPIClient to LegacyFooAPIClient" PR is the best place to leave a backdoor when taking over a developer's account. Nevertheless, many developers just LGTM changes like this because of the tedium involved in reviewing them.
If one could express such changes as a simple AST-wrangling script in a domain-specific language, which would then be executed in a trusted environment after being reviewed, that would decrease the review burden considerably.
I believe that with agentic development, the most important constraint we have is human time. Making the LLM better and faster won't help us much if the human still needs to spend a majority of their time reading code. We should do what we can to give us less code to read, without losing confidence in the changes that the LLM makes.
GodelNumbering 14 hours ago [-]
Grep is effective for the most part, except for situations like when you have huge codebases and the thing you're looking for is used in too many places both as symbol and non-symbol.
Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window
embedding-shape 14 hours ago [-]
> Grep is effective for the most part
It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.
Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.
sigbottle 11 hours ago [-]
It's not intuitive to humans, even after learning parsing theory. I can do basic name refactorings. I've even written neovim plugins to do 1 specific thing with the AST (dfs down and delete one subtree which I understand). Those are fine.
I would not be comfortable doing an on-the-fly "rewrite all subtrees that match this pattern" kind of edit.
It seems like a tool that's good for LLM's though.
spullara 9 hours ago [-]
"rewrite all subtrees that match this pattern" works really well in jetbrains, they call it structure search-and-replace.
lukeundtrug 8 hours ago [-]
Happened to have written both a tool and a blog post about the topic. It’s more about the different technical approaches you have in solving the problem but it might still interest you :)
I just realized that the fact that LLMs work so well for me in Clojure might be partly because of the clojure-mcp tools. They provide structural browsing and editing.
tmzt 8 hours ago [-]
Has anybody thought about encoding AST tokens as LLM tokens, similar to how different words can have different meanings and that's reflected in their embedding?
janalsncm 6 hours ago [-]
Language keywords are almost definitely individual tokens. But I think you mean more than that. Basically replacing identifiers with special tokens as well. It’s worth a shot but there’s some practical problems.
Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.
messh 12 hours ago [-]
Anchor based editing requires injecting new anchors to the context, and dirac does so via a diff. So how is this more efficient (token-wise) than search and replace?? Even at a single token per hash. Also, code is read more than written so these just add up. I experimented once with stable anchors, albeit longer than a single token, and found it a downgrade.
My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default
hedgehog 9 hours ago [-]
I'm not sure one way or another but I've been using a related tool called Tilth by another poster here. It doesn't do anchor-based editing, but it does do syntax-aware search and will e.g. report the line range for function definitions, provide file outlines with line numbers on a file name match, etc.
I have six patches that I will at some point upstream, the main bug/surprise is the .gitignore behavior is not what's documented, but even without it seems to work quite well.
gchamonlive 8 hours ago [-]
> My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default
how hard do you think it would be to bring this optimization to oh-my-pi and opencode? I am testing dirac and it's very cool but the tooling isn't there yet comparing to oh-my-pi in terms of UX.
GodelNumbering 4 hours ago [-]
Would love some more feedback on this. Where do you think are major gaps?
gchamonlive 3 hours ago [-]
Thinking back, I might have jumped the gun here. I can't objectively evaluate UX without spending more time with the tool. I'll try to daily drive it a bit before I can form an opinion.
jbellis 12 hours ago [-]
> Batches all operations. Does large number of reads/edits simultaneously...
I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)
verdverm 12 hours ago [-]
I think Anthropic may have mentioned this first, this pattern is also something my custom agent's tools are designed around, pretty sure I picked it up from them.
The agent would work even without a language parser, just that the AST-based functionalities won't work
gavinray 12 hours ago [-]
Yes
sally_glance 12 hours ago [-]
Is there a complete list of the tools somewhere? I'm interested in how you chose to expose the AST specifically. In my own harness attempts I wanted to keep the number of tools absolutely minimal and briefly experimented with including an AST lib to use via an execute_python tool (plus some examples in the system prompt). Results were mixed though, with most models preferring ripgrep.
rgbrgb 11 hours ago [-]
It would be really cool to do a causality investigation to determine which one of these boosts it so much / quantify how much each matters. Who knows, they may all interact in a sum-is-greater-than-parts way that only improves the score when shipped altogether.
blurbleblurble 12 hours ago [-]
Did you consider incorporating ast-grep or gritql?
Congratulations, great work.
sally_glance 12 hours ago [-]
Can't speak for OP but I tried providing ast-grep in the execution context of an execute_bash tool, but even with pretty aggressive steering most models just don't seem to use it a lot. More expensive/SOTA models or higher reasoning increases the chances but lowers speed and raises cost. Maybe due to training bias for exploration tasks?
blurbleblurble 12 hours ago [-]
Yes, I've tried this passive approach too and didn't dig much further after that. I thought maybe they'd figured out something more intentional in the prompting to enable these kinds of approaches.
sally_glance 12 hours ago [-]
I have a hunch model proficiency for a given CLI tool very much correlates with how many StackOverflow answers and blog entries providing examples for it there are...
blurbleblurble 11 hours ago [-]
My sense is that we're at a tipping point where instruction following is getting good enough to disrupt these old habits
GodelNumbering 11 hours ago [-]
Not really, but interested in trying them out for a future version, especially gritql.
drakythe 10 hours ago [-]
How are the two token anchors chosen when the initial 1700 single token anchors run out? I'm assuming just a 2 word combination from the 1700.
GodelNumbering 4 hours ago [-]
That's correct
tripleee 14 hours ago [-]
[flagged]
deaux 12 hours ago [-]
1. Would be good to benchmark at least one other model from a different family to see if it indeed generalizes. Minimax 2.7 seems a good candidate to keep it affordable. Until then we can't really tell if it's just overfit on Gemini 3 Flash.
2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.
3. Assuming that cheaper also means faster in this case where model is equal? If
so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.
4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.
GodelNumbering 11 hours ago [-]
Good points.
1. I have been trying to benchmark openweights models but keep running into timeouts due to slow inference (terminal bench tasks have strict timeouts that you are not allowed to modify). Posted my frustration here https://www.reddit.com/r/LocalLLaMA/comments/1stgt39/the_fru...
2. Done (updated github readme)
3. Yes, on an average the times were shorter, but I did not benchmark it because at random times, the model outputs get slower, so it is not a rigorous benchmark
4. Added info on this too
deaux 8 hours ago [-]
1. Good point, didn't know about the timeouts, that's rough for the benchmarks. Though they IMO don't necessarily be "SWE-official" to have value, if the only difference is disabling those.
3. Maybe you could instead provide a measure of output tokens used (including thinking), as that's a reasonable measure for speed. I guess input tokens would be similar unless the AST usage and hashes etc increases them a lot? Seems unlikely.
mdasen 13 hours ago [-]
It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.
Is there a leaderboard out there comparing harness results using the same models?
manx 12 hours ago [-]
We probably want to compare the cartesian product of model+harness.
culi 10 hours ago [-]
Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"
dominotw 10 hours ago [-]
That would be counter to AI company goals. They want harness to be dumb and models to be smart so they can sell models.
History indicates you can't tool and harness your way to effectively competing against a smarter model with more compute.
satvikpendem 7 hours ago [-]
Not really. Anthropic for example sells both the harness and the models as a unified kit via Claude Code, it is in their best interest to make sure both parts work as well as possible, via reinforcement learning of previous usage as well for new model performance increases.
nikcub 5 hours ago [-]
the most cited is terminal bench 2.0, but its also plagued by cheating accusations and benchmaxxing.
somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark
I really wish there was! I thought of even creating one but it would be conflict of interest
adyavanapalli 13 hours ago [-]
I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!
GodelNumbering 13 hours ago [-]
A few months ago one afternoon I was very frustrated with how slow Cline was being so decided to look under the hood. Decided to make a couple of changes. Got sucked in. About 70k lines of change, another 40k lines of deletions and two months later, here we are.
mring33621 11 hours ago [-]
The best kind of project. I'm trying this today. I've been happily using OpenCode so far.
satvikpendem 6 hours ago [-]
I've been looking into local LLMs and new harnesses recently, how good is Pi compared to OpenCode, I'm seeing that it's a lot better? What are the best models and customizations for it to fully utilize it?
adyavanapalli 11 hours ago [-]
I had a chance to look at this and noticed you were sending telemetry to an endpoint you control: https://dirac.run/v1/event. It doesn't seem like you're sending anything obviously sensitive or doing anything in bad faith (though, I do see api errors being sent, which could potentially leak sensitive info), but you gotta admit that that's scary seeing you as the sole dev for this. Plus, it's opt out too. Sorry, it's no go for me.
GodelNumbering 11 hours ago [-]
Thanks! Since it is a Cline fork, the telemetry mechanism is inherited. I left it as it might help debug issues. There is no evil purpose behind it nor does it create or store any PII
deviation 11 hours ago [-]
Here's all the telemetry:
1. Telemetry to dirac.run/v1/event — Sends machine ID, token usage, model info, events, errors (first 500 chars), and platform info. Hardcoded API key. Defaults to opt-in (setting is "unset", not "disabled").
2. Feature flags from dirac.run/v1/event/decide — Polls every 60 minutes with your machine ID. Always enabled, independent of telemetry opt-out. No way to disable without code changes.
3. Web tools route through api.dirac.run — Web search and web fetch tools proxy through Dirac's own API server, sending your request content plus system headers (platform, version, machine ID).
4. Model list fetches — Calls OpenRouter, HuggingFace, Groq, etc. for model listings even when using the Anthropic provider.
GodelNumbering 11 hours ago [-]
> Web tools route through api.dirac.run
This is something that needs to be deprecated entirely. The web fetch tool no longer is used or works. There is nothing even listening at api.dirac.run. This was the result of me stretching my capacity too thin and bulk renaming cline.bot to dirac.run
UPDATE (+1h): both Web search and web fetch tools are now nuked.
avereveard 13 hours ago [-]
"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.
that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.
himata4113 12 hours ago [-]
That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.
grzracz 9 hours ago [-]
Seems completely backwards to me. This is like judging Formula 1 just by the raw power of the engine. The rest of the car has just as much engineering, if not more.
wyre 8 hours ago [-]
ARC-AGI is testing raw intelligence, like the raw power of a Formula 1 engine. The rest of the car is the harness.
gchamonlive 8 hours ago [-]
Maybe there is a complex relationship between harness, model and the emergent perceived intelligence we just can't access by isolating the model alone to evaluate "raw intelligence". I don't think it's absurd to imagine a model that by itself wouldn't be that impressive, but would outperform other models given the right harness. It's also not absurd to think of a model that has incredible raw intelligence, but would not scale much with different harnesses. Model performance given different scenarios depend a LOT on dataset and training strategies, so we need to account for these complex relationships, otherwise measuring "raw intelligence" would be the next AI benchmark that is purely for show.
vova_hn2 10 hours ago [-]
The model is not allowed to create a harness either, I think.
himata4113 7 hours ago [-]
it can, it just has to be within the same 'session', but it's mostly limited to scratch notes afaik since there's no python or bash, yah if there's no way to execute code there's no real way to build a harness.
12 hours ago [-]
gobdovan 12 hours ago [-]
Very interesting, especially the harness point, how much of performance is in the wrapper tools (when I almost run out of credits, I change my model to a smaller one and try to give it more structured prompts; very often gpt-5.4-mini with structure works better than gpt-5.4 with vibes)
This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.
The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.
I also dogfooded it on the Dirac repo itself and included a short report.
Would appreciate feedback from the original author, if the prompts and tools [1] are representative.
I am using dirac with Kimi 2.6 for refactoring a rust codebase. I have a Clean Architecture design which is being reinforced.
The scope of work is laid out in a Beads epic with sub-issues.
The planning was done with gpt5.5, and gpt5.5 is checking the work is complete.
I have found that dirac is more productive on large codebase refactoring than OpenCode which actually trashed the .rs file and had to revert the code.
amunozo 10 hours ago [-]
For gpt-5.5, do you use Dirac, too?
sally_glance 12 hours ago [-]
Great job and congrats! Working on my own harness has been one of my favorite side projects in the past couple of weeks, of course I never finish anything... But I'm very interested in your experience with the following:
1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.
2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.
GodelNumbering 11 hours ago [-]
Thanks.
1. Context management - Don't bother with pruning unless your API doesn't support caching. Every prune breaks the cache and you lose the 90% discounted caching rate
2. I did some work improving Cline's subagent feature that Dirac inherited. In my experience, not all models are trained effectively to delegate work, so YMMV. A common pitfall to watch is, what happens if one or more subagents get stuck in a loop or for whatever reason don't return? You need a mechanism to control them from the main agent
sally_glance 8 hours ago [-]
1. For me pruning is a bit less about cost than performance. Recent research suggests lower context size is nearly always better, and many harnesses implement a sliding window for tool output pruning. Also not every provider supports caching, and if they do it might have expired (especially on restored sessions).
2. That's a good hint, I'm currently only trying with tighter turn and token limits for subagents and an error summary on exceeding them. Not sure how else (besides steering and prompt engineering) to ensure the subagent doesn't go wild...
hedgehog 10 hours ago [-]
It depends where you prune and how the specific prefix cache you're targeting works. Pruning or condensing recent items that are unnecessary probably pays for itself.
bryanhogan 14 hours ago [-]
If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?
GodelNumbering 14 hours ago [-]
Yes, plan+act mode is one thing I loved about Cline!
Mashimo 14 hours ago [-]
Interesting. Would love a comparison to pi.dev (Not Ohmypi)
How does this perform in day to day coding tasks, outside of benchmarks?
README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.
Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences
howdareme 14 hours ago [-]
Going to assume you didnt capture the data but could you add time taken to completion for each if you have it?
messh 13 hours ago [-]
re. bash type tooling-- it doesnt mean an agent cannot use ast: using treesitter cli this should be perfect possible
Cilvic 7 hours ago [-]
I assume that this benchmarks where done without any modifications to the default open-sourced harness. treesitter CLI would be an extra plugin for pi-mono, put I'd be equally curious about whether it would accomplish the task.
martinald 14 hours ago [-]
Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).
Curious to know if this has been an issue with your AST approach on larger projects?
The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).
I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.
GodelNumbering 14 hours ago [-]
Wrt LSP, it uses the default LSP mechanism of the ide provider.
To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries
martinald 14 hours ago [-]
Yes I understand, but do you not have issues that it drifts out of date and confuses the agents (especially on longer running tasks)?
Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.
GodelNumbering 10 hours ago [-]
It does a before/after comparison. Fetch the LSP error state, apply all edits, fetch it again, diff
tuo-lei 13 hours ago [-]
same issue from the other side. when a human is editing, the LSP fires mid-keystroke and shows bogus errors for a second, whatever. with an agent doing 5 edits in a row, the symbol DB is always behind by one edit, so the next lookup pulls stale references. you can re-index synchronously after each edit but that kills the batching speed.
deviation 11 hours ago [-]
Nice work. I adopted this to use with my workplace's LLM proxy with a few small changes to the api/config files. Works flawlessly.
nzoschke 12 hours ago [-]
I’ve haven’t had great experiences with Gemini for coding yet. I’m doing reasonably simple full stack Go apps. Tried Gemini-ClI, antigravity, Pi.
The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.
A skill hasn’t helped much.
Will need to try this and open code next.
anandkrshnn 9 hours ago [-]
Really impressive results. The point about the harness mattering more than the model is spot on — we've seen similar patterns in our own work.
One thing that stood out to me is your use of hash-anchored edits + AST-based context selection. We're building something in a similar direction with the Sovereign AI Stack, but with a stronger focus on governance and verification.
Curious — did you run into issues with context drift when using AST queries on very large codebases? We found that combining it with incremental symbol DB updates helped a lot.
Congrats on the results!
all2 7 hours ago [-]
I'd be curious to hear more about your work on the 'Sovereign AI Stack'. I'm also working on a project that prioritizes governance and verification and I'd love to compare notes.
anandkrshnn 1 hours ago [-]
[dead]
dbxfb 7 hours ago [-]
Sounds great, I saw your first post but haven't had the time to try it out.
It's unfortunate that this is explicitly a fork of Cline but the commit history stops at a gigantic "initial commit". There is not even the cline history + a giant "initial version of Dirac" commit on top, which is a bit sad :)
all2 7 hours ago [-]
Importing git history is ugly but do-able. I had to do that at a previous job (splitting a git repo in two pieces or importing commit history from SVN).
I can take a look and try to create a PR around this if there is interest.
nthypes 14 hours ago [-]
Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?
mdasen 13 hours ago [-]
Sure. Dirac is just a fork of the Cline harness and obviously OpenCode could take the same techniques and implement them. I don't know how difficult it would be to implement them in OpenCode, but given that Dirac and OpenCode are both open source, a future version of OpenCode could always be a re-branded Dirac (I'm sure there are ways to implement Dirac's techniques without having to completely replace OpenCode's underlying code base, but this illustrates that at the extreme, they could clearly just take Dirac in its entirety to get the same results).
_ink_ 5 hours ago [-]
Is it still advisable to use something like codebase-memory-mcp for large codebases, or is Dirac doing fine without that?
GodelNumbering 4 hours ago [-]
Dirac doesn't use any of that but the memory part may be something I explore in future
davidkunz 10 hours ago [-]
I would like if some of that functionality is extracted in CLI tools. Then every coding agent can use it.
blueTiger33 14 hours ago [-]
Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score?
I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand
GodelNumbering 13 hours ago [-]
I did limited testing using Sonnet on CC vs Sonnet on Dirac. I could not confirm the costs however
2001zhaozhao 9 hours ago [-]
Very cool and interesting direction. I'm interested to see how easy it is to extend the harness's language support.
scoopdewoop 12 hours ago [-]
The Hash-anchor edit guy! Sincerely great idea, I used it in my own toy harness to good effect. I just checked this out, never tried it before, and its great! Clearly a well-iterated design with good choices made.
It is so refreshing to see real FOSS and not a grift. Simple openrouter api key, and I'm going.
This is what I'm using from now on. You are doing the best work in this space.
Aeroi 12 hours ago [-]
harness definitely makes a difference for the benchmarks. I ran my agent Camera Search against a few benchmarks and was able to beat Opus 4.7.
I created a real world benchmark, for mining, oil&gas, construction ect. called FieldOps-bench and it basically proves that vertical agents and specialized harness, tool, systems outperforms SOTA models alone still.
gchamonlive 8 hours ago [-]
hey there! thanks for the project!
I was intrigued with the claims so I wanted to test it myself.
Then I went in to see what's what, but I there isn't support for gemini-cli login, and importing from opencode doesn't work, failing with a message "Something went wrong. Could not read API keys from OpenCode config.". `dirac auth --verbose` doesn't seem to do anything.
Sorry for reporting it here, but it seems that GitHub is throwing a tantrum again and your issues page's been knocked out.
It was able to login with my OpenAI sub though, so let's see how's that.
EDIT: headsup, only gpt-5.4 seems to work, gpt-5.4-pro and gpt-5.5-2026-04-23 all throw api error 400, maybe by no fault of your own. OpenAI has been deliberately hindering third-party agents lately, as oh-my-pi ceased to work last week with all gpt models, either throwing an error or having a ludicrously low api rate limit.
GodelNumbering 4 hours ago [-]
Hey thanks for doing this! I will be looking into the gpt versions.
It doesn't support gemini CLI because google seems to ban users for using it, there was a big controversy about it some time ago so I decided to leave it alone for now. Also, feel free to reach out to me if you want to discuss anything specific
redrove 13 hours ago [-]
I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.
Any ideas?
GodelNumbering 13 hours ago [-]
Assuming you logged in with OAuth, I am guessing you are trying to use gpt-5.5?
In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan
Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path
redrove 12 hours ago [-]
Yes I'm on ChatGPT Pro (OAuth) and I'm trying to use gpt-5.5-xhigh.
That was the issue, 5.4 works just fine.
Support for service: priority (GPT /fast mode) would also be cool!
GodelNumbering 4 hours ago [-]
Will fix this soon. Please feel free to create a github issue in the meantime.
dur-randir 9 hours ago [-]
How do I connect it to a local llama.cpp instance?
GodelNumbering 7 hours ago [-]
It supports LMStudio or you can start a local endpoint, then run
woow, looks very good. I'm wondering if you do any optimizations for cli in general, since you're not using MCP. I'm building my own CLI for AI Agents, and was always concerned with context rot.
snqb 14 hours ago [-]
how well does it do on frontier models like Opus 4.6?
GodelNumbering 13 hours ago [-]
I have only done functionality testing, no benchmark testing on Opus (decided to pay my rent instead)
aetherspawn 14 hours ago [-]
Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?
Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models
As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout
npodbielski 12 hours ago [-]
Ha! I had an idea to do something like that myself over the weekend after trying Junie and Mistral to write some test for my personal project, that took literally hours! because Qwen 3.5 I am using locally can run 10k prompt for 10mins. Which should not be the case if agent would ask really simple questions like:
- what tool you need?
- what would be parameters for the tool
- what method you want to read?
instead of sending few kilobytes of build output and waiting for response.
Oh well.. Good thing someone already did that!
neonstatic 13 hours ago [-]
I am a bit confused. What languages does it help with? You mention AST manipulation, so I am assuming it's not universally applicable, e.g. to Rust?
1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)
2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads
3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)
4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate
5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next
Changes that are primarily code refactorings, like breaking up a large module into a bunch of smaller ones, or renaming a commonly-used class, are extremely tedious to review, both in LLM generated diffs and human-written PRs. You still have to do it; LLMs have a habit of mangling comments when moving code across files, while for a human, an unassuming "rename FooAPIClient to LegacyFooAPIClient" PR is the best place to leave a backdoor when taking over a developer's account. Nevertheless, many developers just LGTM changes like this because of the tedium involved in reviewing them.
If one could express such changes as a simple AST-wrangling script in a domain-specific language, which would then be executed in a trusted environment after being reviewed, that would decrease the review burden considerably.
I believe that with agentic development, the most important constraint we have is human time. Making the LLM better and faster won't help us much if the human still needs to spend a majority of their time reading code. We should do what we can to give us less code to read, without losing confidence in the changes that the LLM makes.
Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window
It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.
Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.
I would not be comfortable doing an on-the-fly "rewrite all subtrees that match this pattern" kind of edit.
It seems like a tool that's good for LLM's though.
https://www.context-master.dev/blog/deterministic-semantic-c...
Let me know, what you think
Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.
My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default
https://github.com/jahala/tilth
how hard do you think it would be to bring this optimization to oh-my-pi and opencode? I am testing dirac and it's very cool but the tooling isn't there yet comparing to oh-my-pi in terms of UX.
I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)
Does that mean that it's only going to work with certain langauges for which it has parsers available?
The agent would work even without a language parser, just that the AST-based functionalities won't work
Congratulations, great work.
2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.
3. Assuming that cheaper also means faster in this case where model is equal? If so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.
4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.
1. I have been trying to benchmark openweights models but keep running into timeouts due to slow inference (terminal bench tasks have strict timeouts that you are not allowed to modify). Posted my frustration here https://www.reddit.com/r/LocalLLaMA/comments/1stgt39/the_fru...
2. Done (updated github readme)
3. Yes, on an average the times were shorter, but I did not benchmark it because at random times, the model outputs get slower, so it is not a rigorous benchmark
4. Added info on this too
3. Maybe you could instead provide a measure of output tokens used (including thinking), as that's a reasonable measure for speed. I guess input tokens would be similar unless the AST usage and hashes etc increases them a lot? Seems unlikely.
Is there a leaderboard out there comparing harness results using the same models?
History indicates you can't tool and harness your way to effectively competing against a smarter model with more compute.
somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark
[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0
1. Telemetry to dirac.run/v1/event — Sends machine ID, token usage, model info, events, errors (first 500 chars), and platform info. Hardcoded API key. Defaults to opt-in (setting is "unset", not "disabled").
2. Feature flags from dirac.run/v1/event/decide — Polls every 60 minutes with your machine ID. Always enabled, independent of telemetry opt-out. No way to disable without code changes.
3. Web tools route through api.dirac.run — Web search and web fetch tools proxy through Dirac's own API server, sending your request content plus system headers (platform, version, machine ID).
4. Model list fetches — Calls OpenRouter, HuggingFace, Groq, etc. for model listings even when using the Anthropic provider.
This is something that needs to be deprecated entirely. The web fetch tool no longer is used or works. There is nothing even listening at api.dirac.run. This was the result of me stretching my capacity too thin and bulk renaming cline.bot to dirac.run
UPDATE (+1h): both Web search and web fetch tools are now nuked.
that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.
This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.
The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.
I also dogfooded it on the Dirac repo itself and included a short report.
Would appreciate feedback from the original author, if the prompts and tools [1] are representative.
[0] https://github.com/ouatu-ro/skill-distillery
[1] https://github.com/ouatu-ro/skill-distillery/blob/main/skill...
1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.
2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.
1. Context management - Don't bother with pruning unless your API doesn't support caching. Every prune breaks the cache and you lose the 90% discounted caching rate
2. I did some work improving Cline's subagent feature that Dirac inherited. In my experience, not all models are trained effectively to delegate work, so YMMV. A common pitfall to watch is, what happens if one or more subagents get stuck in a loop or for whatever reason don't return? You need a mechanism to control them from the main agent
2. That's a good hint, I'm currently only trying with tighter turn and token limits for subagents and an error summary on exceeding them. Not sure how else (besides steering and prompt engineering) to ensure the subagent doesn't go wild...
How does this perform in day to day coding tasks, outside of benchmarks?
README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.
Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences
Curious to know if this has been an issue with your AST approach on larger projects?
The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).
I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.
For AST, it uses tree-sitter WASMs (ships them with the package), and maintains queries (https://github.com/dirac-run/dirac/tree/master/src/services/...)
To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries
Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.
The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.
A skill hasn’t helped much.
Will need to try this and open code next.
One thing that stood out to me is your use of hash-anchored edits + AST-based context selection. We're building something in a similar direction with the Sovereign AI Stack, but with a stronger focus on governance and verification.
Curious — did you run into issues with context drift when using AST queries on very large codebases? We found that combining it with incremental symbol DB updates helped a lot.
Congrats on the results!
It's unfortunate that this is explicitly a fork of Cline but the commit history stops at a gigantic "initial commit". There is not even the cline history + a giant "initial version of Dirac" commit on top, which is a bit sad :)
I can take a look and try to create a PR around this if there is interest.
It is so refreshing to see real FOSS and not a grift. Simple openrouter api key, and I'm going.
This is what I'm using from now on. You are doing the best work in this space.
I created a real world benchmark, for mining, oil&gas, construction ect. called FieldOps-bench and it basically proves that vertical agents and specialized harness, tool, systems outperforms SOTA models alone still.
I was intrigued with the claims so I wanted to test it myself.
First I (vibe)made an AUR package I could use to install it from git source, from master: https://aur.archlinux.org/packages/dirac-cli-git
Then I went in to see what's what, but I there isn't support for gemini-cli login, and importing from opencode doesn't work, failing with a message "Something went wrong. Could not read API keys from OpenCode config.". `dirac auth --verbose` doesn't seem to do anything.
Sorry for reporting it here, but it seems that GitHub is throwing a tantrum again and your issues page's been knocked out.
It was able to login with my OpenAI sub though, so let's see how's that.
EDIT: headsup, only gpt-5.4 seems to work, gpt-5.4-pro and gpt-5.5-2026-04-23 all throw api error 400, maybe by no fault of your own. OpenAI has been deliberately hindering third-party agents lately, as oh-my-pi ceased to work last week with all gpt models, either throwing an error or having a ludicrously low api rate limit.
It doesn't support gemini CLI because google seems to ban users for using it, there was a big controversy about it some time ago so I decided to leave it alone for now. Also, feel free to reach out to me if you want to discuss anything specific
Any ideas?
In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan
Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path
That was the issue, 5.4 works just fine.
Support for service: priority (GPT /fast mode) would also be cool!
OPENAI_COMPATIBLE_CUSTOM_KEY="xxx" dirac -y --provider "https://localhost/v1" --model <model_name> "hi..."
Harness was https://www.npmjs.com/package/dirac-cli
Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models
As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout
- what tool you need?
- what would be parameters for the tool
- what method you want to read?
instead of sending few kilobytes of build output and waiting for response. Oh well.. Good thing someone already did that!
Forge Code is awesome and I plan to test Dirac, too.