Blog

How Solita BenchForge for Databricks Genie turns Databricks Genie Space Curation into a benchmark-driven workflow

Mohamed Elnamoury Machine Learning Engineer

Published 05 May 2026

Reading time 5 min

In many conversational analytics deployments, the hardest part is not making changes to the solution. The hardest part is knowing whether the changes actually helped.

At Solita, we have been working on a new accelerator for governed, production Genie deployments called Solita BenchForge for Databricks Genie. The goal is to help teams manage Genie Space quality more systematically by introducing a benchmark-driven workflow around curation, revision, execution, and comparison over time. The benchmark loop stands on its own; optimization techniques like GEPA (Genetic‑Pareto Optimizer) can optionally be layered on top once teams want to automate revisions.

BenchForge helps teams answer a simple but critical question: did this Genie Space change actually improve user outcomes – without breaking something else?

Teams can always add another instruction, another sample question, or another example. What they usually lack is a disciplined way to evaluate whether those changes improved the Genie space, preserved existing behavior, or introduced regressions elsewhere. BenchForge is built around that problem.

The core idea is simple: treat Genie Space curation as a measurable workflow rather than a manual tuning exercise.

This does not mean changing Genie’s underlying model. It means improving the controllable context around Genie – instructions, examples, semantic guidance, and benchmark feedback – and then measuring the effect. 

Why this matters

Conversational analytics quality is often framed as a model problem. In practice, a large part of the user experience is shaped by the curation layer around the model:

  • how the space is described
  • which sample questions are provided
  • how business terms are clarified
  • how well benchmark questions reflect real user intent

When those elements are changed without a consistent evaluation loop, teams are left with intuition, ad hoc testing, and anecdotal feedback. That makes it hard to scale quality across domains.

Solita BenchForge for Databricks Genie addresses that by turning Genie space improvement into a benchmark-driven operating model.

The operating model

At a high level, the accelerator follows a simple pattern: export → patch → run → inspect → measure

Instead of treating a Genie Space as a static configuration artifact, BenchForge treats it as an evaluation surface that can be revised, tested, and compared over time.

This framing matters because it makes the system understandable in engineering terms:

  • inputs shape how Genie should behave
  • the loop applies and evaluates a candidate revision
  • outputs determine whether the revision is kept, rejected, or refined further 

What the accelerator actually does

Solita BenchForge for Databricks Genie is designed to give teams a controlled lifecycle around Genie space curation.

Its core capabilities are:

  1. Inspect the current Genie Space state: capture the current space configuration and treat it as the baseline
  2. Apply controlled context updates: revise the parts of the Genie Space teams can actually influence, such as instructions, sample questions, and related context
  3. Run benchmark questions against the updated space: evaluate the revised space against a fixed contract rather than relying on informal spot checks
  4. Inspect runtime artifacts: review the query and result evidence behind the answer path. Inspection focuses on observable query and result evidence rather than Genie internals
  5. Compare before and after behavior: determine whether the revision improved quality, stayed neutral, or caused a regression

This is the heart of the accelerator: a repeatable evaluation loop around Genie Space curation.

Why regression detection is a feature, not a failure

One of the most important properties of a system like this is the ability to reject a bad revision. If every change is assumed to be progress, then the workflow is not really measuring quality – it is just documenting edits. A benchmark-driven loop becomes valuable when it can detect that a candidate revision made one part of the space better, another part worse, or simply failed to generalize. The change could for example improve finance queries but break supply-chain terminology.

That is why before/after comparison matters so much in BenchForge. The system is designed to create evidence, not just activity.

Optional: automating candidate generation

Solita BenchForge for Databricks Genie does not depend on DSPy & GEPA to be useful as the core loop explained in the previous sections stands on its own. Instead, GEPA becomes useful when teams want to automate candidate generation more systematically. It can sit on top of the benchmark loop as an optimization engine for proposing stronger revisions to the controllable context around Genie.

That can happen through DSPy, a prompt-optimization framework, or through a standalone GEPA path. Either way, the benchmark loop remains the foundation.

The broader engineering pattern behind this approach is already visible in public writing from industry teams. Databricks and Dropbox have both published technical accounts showing how prompt or instruction optimization can be improved when it is tied to measurable objectives and explicit evaluation.

Those examples are useful reference points for the pattern:

They are not BenchForge results. But they do reinforce the idea that evaluation-driven optimization is a real engineering discipline, not just a prompt-writing habit.

What this means for teams using Genie

For teams deploying conversational analytics on governed enterprise data, this model changes the conversation.

Instead of asking, “How do we manually tune this space again?” teams can ask:

  • what is the baseline?
  • what exactly changed?
  • what did the benchmark show?
  • should this revision of Genie Space be accepted or rejected?

That makes Genie Space improvement more:

  • programmatic
  • measurable
  • repeatable
  • governable

What Solita BenchForge for Databricks Genie introduces

Solita BenchForge for Databricks Genie introduces a benchmark-driven loop around existing Genie spaces and supports evidence-based comparison of candidate revisions. Think of it as a benchmark-driven curation and evaluation system.

The most important idea behind BenchForge is not that it adds more automation for its own sake. It is that Genie Space curation can be treated as a measurable engineering workflow. And once teams can compare revisions against a benchmark instead of relying on intuition alone, conversational analytics becomes easier to improve, easier to govern, and easier to scale.

We at Solita are happy to discuss how this fits your current Genie deployment.