Fts milvus dataset #705

Denise2004 · 2026-01-28T07:39:59Z

This PR migrates the FTS test dataset source from local files to remote automatic download.

Main Achievements

New IR_DATASETS Data Source: Integrated ir_datasets library to enable automatic download and conversion of MS MARCO datasets

Ground Truth Evaluation Optimization: Switched FTS evaluation ground truth from qrels to scoreddocs for more comprehensive retrieval result assessment

Core Features

1. Remote Dataset Download

Added IR_DATASETS data source type
Integrated ir_datasets.load("msmarco-passage/dev/small") API
Automatic download and conversion to TSV format files
Support for queries.small.tsv, docs.small.tsv, qrels.small.tsv, scoreddocs.small.tsv

2. Ground Truth Data Optimization

Switched evaluation ground truth from qrels (human-annotated relevant documents) to scoreddocs (BM25 top-1000 retrieval results)

3. Environment Configuration Optimization

Set IR_DATASETS_HOME and IR_DATASETS_TMP environment variables
Redirect dataset cache and temporary files to project tmp directory

sre-ci-robot · 2026-01-28T07:40:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Denise2004
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

XuanYang-cn

Code Review: FTS (Full-Text Search) Implementation for Milvus

Thanks for adding FTS benchmarking support! The overall approach follows existing patterns well. A few issues to address before merging:

Critical Issues (Must Fix)

1. Duplicated code in vectordb_bench/models.py:347-360

The if/else branches have identical try/except blocks - only the index_value assignment differs. Refactor to:

if case_config.get("case_id") in [502, 503, 504]:
    index_value = IndexType.FTS_AUTOINDEX

try:
    task_config["db_case_config"] = db.case_config_cls(index_type=index_value)(**raw_case_cfg)
except Exception:
    log.exception(f"Couldn't get class for index '{index_value}' ({full_path})")
    task_config["db_case_config"] = EmptyDBCaseConfig(**raw_case_cfg)

2. Magic numbers for case IDs (models.py:348)

if case_config.get("case_id") in [502, 503, 504]:

Case IDs 502 and 504 are referenced but only 503 (FTSmsmarcoPerformance) is defined in CaseType. Consider using enum values instead of magic numbers, or remove undefined IDs.

Important Issues (Should Fix)

3. Resource leak in FtsDocumentIterator (dataset.py)

The file handle opened in __next__ may leak on exception. __del__ is unreliable for cleanup. Consider making it a context manager or ensuring proper cleanup.

4. Assertion used for validation (milvus/milvus.py)

assert self.col is not None

Replace with proper validation that raises RuntimeError - assertions can be disabled in production.

5. Hardcoded dataset name in _get_ir_datasets_name() (dataset.py)

Returns "msmarco-passage/dev/small" regardless of self.data properties. This won't scale when additional FTS datasets are added.

Minor Suggestions

Missing period in description (cases.py:668-670) - "...MS MARCO dataset " should end with a period
Chinese comments in milvus/config.py:441-475 - consider translating for consistency
Unused logging in milvus/cli.py - log = logging.getLogger(__name__) is defined but not used
BM25 parameter validation - comments document ranges [1.2, 2.0] for bm25_k1 and [0.0, 1.0] for bm25_b, but no validators enforce them
Unnecessary hasattr checks in milvus/config.py - these are class attributes, so hasattr always returns True

Positives

Good separation of FTS classes from vector search classes
Follows existing codebase patterns well
Proper context manager usage in Milvus init()
Comprehensive metric support (MRR, FTS-specific recall/NDCG)
Good bug fixes for Qdrant logging included

milvus fts bm25

e4ca67c

Add FTS dataset upload from ir

956a26c

Denise2004 force-pushed the fts-milvus-dataset branch from 15625a4 to 956a26c Compare January 28, 2026 08:56

Merge branch 'zilliztech:main' into fts-milvus-dataset

32c6202

XuanYang-cn requested changes Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fts milvus dataset #705

Fts milvus dataset #705

Uh oh!

Denise2004 commented Jan 28, 2026

Uh oh!

sre-ci-robot commented Jan 28, 2026

Uh oh!

XuanYang-cn left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fts milvus dataset #705

Are you sure you want to change the base?

Fts milvus dataset #705

Uh oh!

Conversation

Denise2004 commented Jan 28, 2026

Main Achievements

Core Features

1. Remote Dataset Download

2. Ground Truth Data Optimization

3. Environment Configuration Optimization

Uh oh!

sre-ci-robot commented Jan 28, 2026

Uh oh!

XuanYang-cn left a comment

Choose a reason for hiding this comment

Code Review: FTS (Full-Text Search) Implementation for Milvus

Critical Issues (Must Fix)

Important Issues (Should Fix)

Minor Suggestions

Positives

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants