In last months one of the scenarios where you can use AI has been to build an agent which would answer your questions by looking into your data. It could be SQL based data source or for example Power BI semantic model. I would not elaborate in this blog on how to build such an agent. My intention is to address one important aspect of your data agent which is reliability.
Data Agent trust challenges
We built a data agent for one customer on top of Power BI model. However, one of the unpleasant aspects of using LLM for creating queries (in our case it was a DAX) is it can produce different DAX for same user’s question. If you repeat same question three times, it’s quite probable you will get 3 more or less different DAX queries. There are methods to increase consistency in Agent implementation however you cannot eliminate creativity of LLM model.
Another perspective is using model or model version as such. I would like to test my agent with different model and compare cost/reliability as I would like of course to pay as less I can.
At least for those reasons you should implement automated testing on top of your agents.
It will not give you 100% confidence of “Agent never hallucinate!”, but your level of confidence shall be as high as possible.
Agent overview
This is a high-level overview of agent architecture.

Agent receives a message (question) from user, based on model metadata it translates nature language to DAX query and performs DAX against a model. Then it takes result set and interprets result to the user. (table chart, description …)
Testing
You have several aspects of the testing of AI Agents such as performance, testing, safety of the answers …We will focus on most important aspect of data agent – Is agent giving me a correct data?
You have basically two options on how to do it:
- Get dataset out of agent – for example our agent has an option to download answer as CSV
- Get query which is behind the dataset. (Our agent can give us DAX query which is performed when user asks a question.)
We have chosen to go with option 2 for several reasons:
- We can perform compare test at same point in time. (reduce risk of underlying data changes)
- We save some tokens (as we do not perform agent with steps of interpreting and formatting data)
So, what we need for testing:
- Set of testing questions and correct DAX queries with answers.
- Agent which can give us DAX for natural language questions.
- Script which iterates through test questions and gets DAX from Agent and builds CAT testing project file.
- Some tool which will do a comparison of datasets for us – CAT
- Reporting on top of test result.
We could draft overall solutions with testing via CAT like this:

At the end solution is simple we have a Natural Language Test Questions and curated DAX for answers. Script iterates on top of agent and gets DAX per each question. So now we have two DAXs per question:
- Curated DAX from test set
- DAX from agent
When we are done with calling agent and we have Agent DAX for all test questions, script will create YAML file (project file for CAT). Each test then compares Dataset returned by Curated DAX vs Dataset returned from Agent’s DAX and Visualize it in Dashboard like below.

Traps of Testing Data Agent
As agent is creative if we allow it to be creative, we need to be quite specific in questions. For example, following test question is not good enough:
Which opportunities did Joe Smith won in season 2025 S2?
Why it’s not good enough:
- There is no specification which attributes to get from Opportunity so sometimes Agent gives you just an Opportunity Name, sometimes it gives you more columns. This would make test fail as Curated DAX is fixed.
- There is no specified sorting. (maybe for you as a person does not matter, but CAT needs dataset to be sorted for comparison).
Therefore, right version of this Question for testing could be:
Which opportunities did Joe Smith won in season 2025 S2? Bring only Opportunity name and sort by it ascending.
This question will give us more consistent answer.
After a successful testing run you can review the result in test dashboard and potentially identify why test failed.
Summary
Automated testing is crucial part of the process of developing and maintaining Data Agent through whole lifecycle. Testing and monitoring of results for agents in production shall be at lease weekly if not daily mandatory process so we can take corrective action if anything has changed.
The post Using CAT for Testing of Data Agents appeared first on Joyful Craftsmen.

