best strategy when using add_datatest

Hi,

Thanks for the great tool. I'm really interested in the concept, and I'm trying to optimize the process to analyze a large dataset (up to 1M cells). Digging into the repository, I found you provide the `add_dataset.py` script, which should help analyze large numbers of cells by allowing the analysis to be split into chunks.

However, it seems that each chunk must then be analyzed sequentially, and thus, the run time remains unpractical. If I got it right, for each new chunk, I have to provide the folder of previous results, all the mtx files for the previously analyzed chunks, and the new chunk. Then, the tool apparently performs exactly the same process as in the single chunk run, so each new chunk takes the same time to be analyzed.

In my case, a 1000-cell chunk takes about 1h with 32 cores (which is the maximum I can use per node); thus, the 1M dataset will take ~1000h (42 days). 

Am I doing something wrong here?

Is there any way to analyze all the small chunks in parallel and then "merge" results?

Thanks! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best strategy when using add_datatest #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

best strategy when using add_datatest #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions