Skip to content

Dataset cache name doesn't seem to include the tables used when cache_dir is set. #833

@jhnwu3

Description

@jhnwu3

Tldr;

if self._cache_dir is None:
    # Only creates UUID with tables if NO cache_dir provided
    id_str = json.dumps({
        "root": self.root,
        "tables": sorted(self.tables),  # <-- Only used here
        "dataset_name": self.dataset_name,
        "dev": self.dev,
    })
    cache_dir = Path(...) / str(uuid.uuid5(uuid.NAMESPACE_DNS, id_str))
else:
    # If cache_dir IS provided explicitly, just use it as-is
    cache_dir = Path(self._cache_dir)
    cache_dir.mkdir(parents=True, exist_ok=True)

When user specifies the cache_dir, uuid does not include the self.tables nor the self.dev or self.dataset in how the cache is being defined, which can potentially lead to downstream confusions in dataset initialization on why certain tasks fail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcoreCore functionality (Patient API, BaseDataset, event stream format, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions