Hi and thank you for releasing the Infinity-Instruct dataset!
I noticed several subsets listed on Hugging Face and in the dataset viewer (see screenshot below), including:
0625 (660k rows)
3M (3.46M rows)
7M (7.45M rows)
7M_core (7.45M rows)
7M_domains (7.45M rows)
Gen (~1.4M rows)
Could you please clarify the following:
- What are the differences among these subsets?
- Are there overlaps between them? For example, is
3M a subset of 7M?
- What distinguishes
7M, 7M_core, and 7M_domains, especially given they all have 7.45M rows?
- What exactly is the
Gen subset? Does it include examples from the above sets, or is it a separate dataset?
Any documentation or summary would be greatly appreciated. Thanks again for your great work!