Skip to content

best strategy when using add_datatest #17

@edg1983

Description

@edg1983

Hi,

Thanks for the great tool. I'm really interested in the concept, and I'm trying to optimize the process to analyze a large dataset (up to 1M cells). Digging into the repository, I found you provide the add_dataset.py script, which should help analyze large numbers of cells by allowing the analysis to be split into chunks.

However, it seems that each chunk must then be analyzed sequentially, and thus, the run time remains unpractical. If I got it right, for each new chunk, I have to provide the folder of previous results, all the mtx files for the previously analyzed chunks, and the new chunk. Then, the tool apparently performs exactly the same process as in the single chunk run, so each new chunk takes the same time to be analyzed.

In my case, a 1000-cell chunk takes about 1h with 32 cores (which is the maximum I can use per node); thus, the 1M dataset will take ~1000h (42 days).

Am I doing something wrong here?

Is there any way to analyze all the small chunks in parallel and then "merge" results?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions