Who's wrong, the Human or the LLM?

2024-08-28 - What happens when the LLM turns out to be too creative in a data analytics task?

Recently we received a very interesting feedback from one of the very early user of AskYourData.

They told us something like:

ArkYourData is not reliable at answering to the question:
"Show the athletes with italian-sounding name."

AskYourData' goal has always been to be a data analyst compainon using natural language.

We worked (and still working) hard to make AskYourData reliable on its answers, as much as possible.

Thus, that was an interesting challenge to look deep into.

A challenge

The user was right.

AskYourData, with that question, turned from an average of more than 95% of correct and reliable answers to about 50%.

The answers were between 3-4 different alternative outcomes, that seemed to be picked randomly.

By looking at the generated code it was clear what was the root cause.

The LLM was simply not sure about how to compute such "italian-sounding" thing, so it tried to be creative and proposing different methods, generating different outcome.

We had the evidence of this fact by adding to the question this additional note:

"Show the athletes with italian-sounding name. Use the last char on the name that is a vowel."

Suddenly the answers turned to be stable with the usual positive rate.

What happened? We simply told the LLM to use a rule in that particular context, a formula to be used. And it followed correctly.

What we learn

From this investigation we learned a couple of things:

The LLM should take into account these possibilities, try to handle them in a proper way, introducing a "confidence" level in the equation.

What we changed

After this exploration, we've updated the Chat component to first check the "confidence" before answering. If the confidence in providing a reliable answer is too low, it should inform the user, asking for more information or instructions.

This is the first step towards better integration between a traditional tool and LLM capabilities.

Adding a chat interface is certainly the first step. Designing several guardrails, not only to avoid prompt injection and similar issues but also to better respect the user during cognitive activities, is essential and the most challenging task.

Do you want to try exploring data in natural language? (it's dope!)

Go to the App