Skip to content

Clarification on differences and overlaps among dataset subsets #3

@yileitu

Description

@yileitu

Hi and thank you for releasing the Infinity-Instruct dataset!

I noticed several subsets listed on Hugging Face and in the dataset viewer (see screenshot below), including:

  • 0625 (660k rows)
  • 3M (3.46M rows)
  • 7M (7.45M rows)
  • 7M_core (7.45M rows)
  • 7M_domains (7.45M rows)
  • Gen (~1.4M rows)

Could you please clarify the following:

  1. What are the differences among these subsets?
  2. Are there overlaps between them? For example, is 3M a subset of 7M?
  3. What distinguishes 7M, 7M_core, and 7M_domains, especially given they all have 7.45M rows?
  4. What exactly is the Gen subset? Does it include examples from the above sets, or is it a separate dataset?

Any documentation or summary would be greatly appreciated. Thanks again for your great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions