Skip to content

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

Replace the O(N) recursive sequence_map_inverse implementation with O(1) template depth using pack expansion.

Approach

  • Use constexpr loop in find_source_index to locate permutation inverse indices
  • Expand via pack expansion for O(1) template instantiation depth

Why It Works

Template recursion requires N template instantiations for N iterations, each with its own overhead. Constexpr loops execute within a single template instantiation, avoiding per-instantiation overhead.

Build Performance Impact

Template Instantiation Reduction (measured on device_grouped_conv3d_fwd_bias_bnorm_clamp_instance target, 248 files):

This confirms the optimization successfully reduces template instantiation overhead by eliminating recursive template patterns in favor of pack expansion.

Test Plan

  • Existing SequenceMapInverse.InverseMap and SequenceMapInverse.InverseIdentityMap tests validate correctness
  • CI

Notes

@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 59f0c32 to 5190578 Compare January 16, 2026 17:34
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from 6d792da to f5ada17 Compare January 16, 2026 20:16
@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 5190578 to 887bdf2 Compare January 16, 2026 20:16
@tenpercent tenpercent marked this pull request as ready for review January 17, 2026 03:41
@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 887bdf2 to 02e42dc Compare January 17, 2026 03:51
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from f5ada17 to 9942fd6 Compare January 17, 2026 03:51
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from 9d67d0d to c4d95f7 Compare January 21, 2026 23:43
@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 82b6016 to 602c127 Compare January 21, 2026 23:56
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from c4d95f7 to 631df4f Compare January 21, 2026 23:57
@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 602c127 to 1713ea7 Compare January 22, 2026 01:00
@tenpercent tenpercent changed the base branch from mpodkory/generate-tuple-optimizations to develop January 22, 2026 01:04
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch 2 times, most recently from cbaf07b to 3b8b37d Compare January 22, 2026 19:52
@tenpercent tenpercent changed the title Rewrite O(N) recursive templates with O(1) pack expansion Replace O(N) recursive sequence_map_inverse with O(1) pack expansion Jan 22, 2026
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from 3b8b37d to 7c9cdf0 Compare January 22, 2026 20:24
@tenpercent tenpercent marked this pull request as draft January 22, 2026 20:31
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch 4 times, most recently from d162e26 to f8d808e Compare January 22, 2026 21:11
@tenpercent tenpercent marked this pull request as ready for review January 22, 2026 22:29
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from e921e01 to bd98bd1 Compare January 23, 2026 00:01
@tenpercent tenpercent requested a review from Copilot January 23, 2026 22:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes sequence_map_inverse by replacing O(N) recursive template instantiation with O(1) template depth using pack expansion and constexpr loops, reducing compilation overhead.

Changes:

  • Replaced recursive sequence_map_inverse_impl with ConstexprArray and constexpr loop-based find_inverse
  • Added detailed comments explaining the compilation performance benefits
  • Achieved 1.6% reduction in template instantiations (126,896 fewer instantiations)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if(values[i] == target)
return i;
}
return -1; // should not reach for valid permutation
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value -1 for an invalid permutation is misleading since index_t is likely unsigned. Consider using a static_assert or explicit error handling to catch invalid permutations at compile-time, or document that this path should never execute for valid inputs.

Suggested change
return -1; // should not reach for valid permutation
return static_cast<index_t>(-1); // should not reach for valid permutation

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@shumway shumway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to remove the O(N^2) loop and probably add memoization, too.

if(values[i] == target)
return i;
}
return -1; // should not reach for valid permutation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a compile-time error, just to catch anything that is really broken?

The two patterns I know are to use a static_assert or to call a consteval function that throws an error. We don't want to have this logic silently fail.

// to expand over. Sequence<0,1,2> gives us Positions = 0,1,2, which expands to:
// Sequence<find_inverse(0), find_inverse(1), find_inverse(2)>
// Without a pack, we'd need recursion to generate each element - defeating our goal.
template <index_t... Positions>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we have an O(N^2) evaluation of the permutation inverse. That's much better than a recursive template instantiation, but we can probably be faster:

  1. Write a direct permutation inverse:
template <index_t... Positions>
static constexpr auto compute(Sequence<Positions...>)
{
    // Build result array in one pass
    detail::ConstexprArray<index_t, sizeof...(Is)> result = {};
    index_t pos = 0;
    ((result[values[pos++]] = Positions), ...); // fold expression
    
    return Sequence<result[Positions]...>{};
}
  1. Maybe use also static constexpr templated variable to cache the inverted permutations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the instantiation patterns for a conv device instances target, I see the maximum sequence length is 9 and all of them are valid permutations.

There are 12 unique instances. Most of them are identity permutations. Maybe we can start with a shortcut for inverting identity and rolling back the separate array struct, doesn't seem worth the introduced maintenance complexity and the constant factor effect may be worse than the overhead of local small arrays

Copy link
Collaborator

@cgmillette cgmillette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle repeated indices?

@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from bd98bd1 to 7a427d0 Compare January 27, 2026 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants