Skip to content

Conversation

@Denise2004
Copy link
Contributor

This PR migrates the FTS test dataset source from local files to remote automatic download.

Main Achievements

New IR_DATASETS Data Source: Integrated ir_datasets library to enable automatic download and conversion of MS MARCO datasets

Ground Truth Evaluation Optimization: Switched FTS evaluation ground truth from qrels to scoreddocs for more comprehensive retrieval result assessment

Core Features

1. Remote Dataset Download

  • Added IR_DATASETS data source type
  • Integrated ir_datasets.load("msmarco-passage/dev/small") API
  • Automatic download and conversion to TSV format files
  • Support for queries.small.tsv, docs.small.tsv, qrels.small.tsv, scoreddocs.small.tsv

2. Ground Truth Data Optimization

  • Switched evaluation ground truth from qrels (human-annotated relevant documents) to scoreddocs (BM25 top-1000 retrieval results)

3. Environment Configuration Optimization

  • Set IR_DATASETS_HOME and IR_DATASETS_TMP environment variables
  • Redirect dataset cache and temporary files to project tmp directory

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Denise2004
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: FTS (Full-Text Search) Implementation for Milvus

Thanks for adding FTS benchmarking support! The overall approach follows existing patterns well. A few issues to address before merging:


Critical Issues (Must Fix)

1. Duplicated code in vectordb_bench/models.py:347-360

The if/else branches have identical try/except blocks - only the index_value assignment differs. Refactor to:

if case_config.get("case_id") in [502, 503, 504]:
    index_value = IndexType.FTS_AUTOINDEX

try:
    task_config["db_case_config"] = db.case_config_cls(index_type=index_value)(**raw_case_cfg)
except Exception:
    log.exception(f"Couldn't get class for index '{index_value}' ({full_path})")
    task_config["db_case_config"] = EmptyDBCaseConfig(**raw_case_cfg)

2. Magic numbers for case IDs (models.py:348)

if case_config.get("case_id") in [502, 503, 504]:

Case IDs 502 and 504 are referenced but only 503 (FTSmsmarcoPerformance) is defined in CaseType. Consider using enum values instead of magic numbers, or remove undefined IDs.


Important Issues (Should Fix)

3. Resource leak in FtsDocumentIterator (dataset.py)

The file handle opened in __next__ may leak on exception. __del__ is unreliable for cleanup. Consider making it a context manager or ensuring proper cleanup.

4. Assertion used for validation (milvus/milvus.py)

assert self.col is not None

Replace with proper validation that raises RuntimeError - assertions can be disabled in production.

5. Hardcoded dataset name in _get_ir_datasets_name() (dataset.py)

Returns "msmarco-passage/dev/small" regardless of self.data properties. This won't scale when additional FTS datasets are added.


Minor Suggestions

  • Missing period in description (cases.py:668-670) - "...MS MARCO dataset " should end with a period
  • Chinese comments in milvus/config.py:441-475 - consider translating for consistency
  • Unused logging in milvus/cli.py - log = logging.getLogger(__name__) is defined but not used
  • BM25 parameter validation - comments document ranges [1.2, 2.0] for bm25_k1 and [0.0, 1.0] for bm25_b, but no validators enforce them
  • Unnecessary hasattr checks in milvus/config.py - these are class attributes, so hasattr always returns True

Positives

  • Good separation of FTS classes from vector search classes
  • Follows existing codebase patterns well
  • Proper context manager usage in Milvus init()
  • Comprehensive metric support (MRR, FTS-specific recall/NDCG)
  • Good bug fixes for Qdrant logging included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants