We trained our own models for some of them, and we combined some well known NLP ...

We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.

You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.

[1] https://www.traceloop.com/blog/gruens-outstanding-performanc...