The biggest problem with retrieval is actually semantic relevance. I think most embedding models don't really capture sentence-level semantic content and instead act more like bag-of-words models averaging local word-level information.
Consider this simple test I’ve been running:
Anchor: “A background service listens to a task queue and processes incoming data payloads using a custom rules engine before persisting output to a local SQLite database.”
Option A (Lexical Match): “A background service listens to a message queue and processes outgoing authentication tokens using a custom hash function before transmitting output to a local SQLite database.”
Option B (Semantic Match): “An asynchronous worker fetches jobs from a scheduling channel, transforms each record according to a user-defined logic system, and saves the results to an embedded relational data store on disk.”
Any decent LLM (e.g., Gemini 2.5 Pro, GPT-4/5) immediately knows that the Anchor and Option B describe the same concept just with different words. But when I test embedding models like gemini-embedding-001 (currently top of MTEB), they consistently rate Option A as more similar measured by cosine similarity. They’re getting tricked by surface-level word overlap.
I put together a small GitHub repo that uses ChatGPT to generate and test these “semantic triplets:
I’m not sure what the “biggest” problem is, but I do think diversity is vastly underappreciated compared to relevance.
You can have maximally relevant search results that are horrible. Because most users (and LLMs) want to understand the range of options, not just one type of relevant option.
Search for “shoes” and only see athletic shoes is a bad experience. You’ll sell more shoes, and keep the user engaged, if you show a diverse range of shoes.
I liked how Karpathy explained part of this problem as "silent collapse" in his recent Dwarkesh podcast. Meaning the models tend to fall into a local minima situation of using a few output wording templates for a large number of similar questions, and this lack of entropy diversity it becomes a tough hard to detect problem when doing distillation or synthetic data generation in general. These algorithms as nice python functions are also useful repurposed for labeling parts of ontology and topic clusters etc [1]. Will definitely star and keep an eye on the repo !
Nice, I actually read that Jina article when it was published, but forgot they use facility location as well! The saturated coverage algorithm looks pretty interesting, I'll have a look at how feasible it would be to add that to Pyversity.
Producing a diverse list of results may still help in a couple of ways here.
* If there are a lot of lexical matches, real semantic matches may still be in the list but far down the list. A diverse set of, say, 20 results may have a better chance of including a semantic match than the top 20 results by some score.
* There might be a lot of semantic matches, but a vast majority of the semantic matches follow a particular viewpoint. A diverse set of results has a better chance of including the viewpoint that solves the problem.
Yes, semantic matching is important, but this is solving an orthogonal and complementary problem. Both are important.
This seems like a good template to generate synthetic data, with positive/negative examples, allowing an embedding model to be aligned more semantically to underlying concepts.
Anyways, I'd hope reranking models do better, have you tried those?
This really feels like a missed opportunity to build something genuinely new, something that actually plays to the strengths of LLMs, instead of just embedding a fixed set of app screens inside chat.
Ideally, users should be able to describe a task, and the AI would figure out which tools to use, wire them together, and show the result as an editable workflow or inline canvas the user can tweak. Frameworks like LlamaIndex’s Workflow or LangGraph already let you define these directed graphs manually in Python where each node can do something specific, branch, or loop. But the AI should be able to generate those DAGs on the fly, since it’s just code underneath.
And given that LLMs are already quite good at generating UI code and following a design system (see v0.app), there’s not much reason to hardcode screens at all. The model can just create and adapt them as needed.
Really hope Google doesn’t follow OpenAI down this path.
Actually these giant companies have proven innovation is impossible. Any company that tries just gets stepped on by the bigger papa company stealing their idea and putting them out of business.
(Also read the documentation, they specifically mention that you can tell it to create new flow paths)
Can you please add support to add descriptions of each column and enumerated types?
For example, if a column contains 0 or 1 encoding the absence of presence of something, LLMs need to know what 0 and 1 stand for. Same goes of column names because they can be cryptic in production databases.
This is not a difficult problem to solve. We can add the schema, columns and column descriptions in the system prompt. It can significantly improve performance.
All it will take is a form where the user supplies details about each column and relation. For some reason, most LLM based apps don't add this simple feature.
It’s not a difficult problem to solve, I did it, last year, with 3.5, it didn’t help. That’s not to say that newer models wouldn’t do better, but I have tried this approach. It is a difficult problem to actually get working.
So, I have not tried it on a very complex database myself so I can't comment how well it will work in production systems I have tried this approach with a single Big Query table and it worked pretty well for my toy example.
If by 3.5 you mean ChatGPT 3.5 you should absolutely try it with newer models, there is a huge difference in capabilities.
Yes, ChatGPT 3.5, this testing was a while back. I’m sure it has improved but I doubt it’s solid enough for me to trust.
Example/clean/demo datasets it does very well on. Incredibly impressive even. But on real world schema/data for an app developed over many years, it struggled. Even when I could finally prompt my way into getting it to work for 1 type of query, my others would randomly break.
It would have been easier to just provide tools for hard-coded queries if I wanted to expose a chat interface to the data.