Skip to content

Conversation

TommasoMoroHtx
Copy link
Contributor

Description

This pull request fixes a bug in the get() method of the lancedb document store where documents retrieved by ID were returned in an arbitrary order (typically insertion order), rather than in the order of the input list of IDs.

As a result, when using vector retrieval and pairing scores with documents (via zip(docs, scores)), the association between documents and their scores was incorrect. In fact, queries would return documents with low relevance scores that did not correspond to the top vector matches, often returning chunks from the first pages of documents due to insertion order, rather than the actual most relevant content.

Consider the following code:

_, scores, ids = self.vector_store.query(
    embedding=emb, top_k=top_k_first_round, **kwargs
)
docs = self.doc_store.get(ids)
result = [
    RetrievedDocument(**doc.to_dict(), score=score)
    for doc, score in zip(docs, scores)
]

This PR modifies get() to return documents in the same order as ids, ensuring that the score-document mapping remains accurate.

Type of change

  • New features (non-breaking change).
  • Bug fix (non-breaking change).
  • Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

  • I have performed a self-review of my code.
  • I have added thorough tests if it is a core feature.
  • There is a reference to the original bug report and related work.
  • I have commented on my code, particularly in hard-to-understand areas.
  • The feature is well documented.

@phv2312 phv2312 merged commit 833982a into Cinnamon:main Jun 5, 2025
@chunlampang
Copy link
Contributor

@chunlampang chunlampang mentioned this pull request Jun 9, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants