Skip to content

Conversation

@Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Sep 3, 2025

Split off from #145219

Cased is a derived property - it is the union of the Lowercase property, the Uppercase property, and the Titlecase_Letter general categories. We already have lookup tables for Lowercase and Uppercase, and Titlecase_Letter is very small. So instead of duplicating a lookup table for Cased, just test each of those properties in turn.

This probably will be slower than the old approach, but it is not a public API: it is only used in string::to_lower when deciding when a Greek "sigma" should be mapped to ς or to σ. This is a very rare case, so should not be performance sensitive.

@rustbot
Copy link
Collaborator

rustbot commented Sep 3, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented Sep 3, 2025

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@Kmeakin Kmeakin changed the title optimization: Eliminate Cased table Remove Cased Unicode table Sep 3, 2025
@Kobzol
Copy link
Member

Kobzol commented Sep 4, 2025

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

rust-bors bot added a commit that referenced this pull request Sep 4, 2025
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 4, 2025
@rust-bors
Copy link
Contributor

rust-bors bot commented Sep 4, 2025

☀️ Try build successful (CI)
Build commit: abd6680 (abd6680cc4021a02ca80a20bb45c967f5cb9f056, parent: 033c0a4742794f5608b19eb78458726596f8ec18)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (abd6680): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
0.2% [0.2%, 0.2%] 1
Regressions ❌
(secondary)
0.4% [0.4%, 0.4%] 1
Improvements ✅
(primary)
-0.3% [-0.5%, -0.2%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.1% [-0.5%, 0.2%] 3

Max RSS (memory usage)

Results (primary 0.4%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
4.3% [4.3%, 4.3%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.6% [-2.0%, -1.2%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.4% [-2.0%, 4.3%] 3

Cycles

Results (secondary 2.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.0% [2.0%, 2.0%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results (primary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.1%, 0.1%] 7
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [0.1%, 0.1%] 7

Bootstrap: 466.287s -> 464.864s (-0.31%)
Artifact size: 388.39 MiB -> 388.39 MiB (0.00%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Sep 4, 2025
@bors
Copy link
Collaborator

bors commented Sep 8, 2025

☔ The latest upstream changes (presumably #146173) made this pull request unmergeable. Please resolve the merge conflicts.

@Kmeakin Kmeakin force-pushed the km/unicode-data/remove-cased branch from a765086 to 4ca4c44 Compare September 8, 2025 19:23
@rustbot

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Oct 4, 2025

☔ The latest upstream changes (presumably #147340) made this pull request unmergeable. Please resolve the merge conflicts.

@Kmeakin Kmeakin force-pushed the km/unicode-data/remove-cased branch from 4ca4c44 to 94be7eb Compare October 5, 2025 00:00
@rustbot

This comment has been minimized.

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Oct 10, 2025

@scottmcm ping?

@bors
Copy link
Collaborator

bors commented Nov 1, 2025

☔ The latest upstream changes (presumably #148337) made this pull request unmergeable. Please resolve the merge conflicts.

@Kmeakin Kmeakin force-pushed the km/unicode-data/remove-cased branch from 94be7eb to 55adb69 Compare November 1, 2025 23:29
@rustbot

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Nov 3, 2025

☔ The latest upstream changes (presumably #148436) made this pull request unmergeable. Please resolve the merge conflicts.

`Cased` is a derived property - it is the union of the `Lowercase`
property, the `Uppercase` property, and the `Titlecase_Letter` general
categories. We already have lookup tables for `Lowercase` and
`Uppercase`, and `Titlecase_Letter` is very small. So instead of
duplicating a lookup table for `Cased`, just test each of those
properties in turn.

This probably will be slower than the old approach, but it is not a
public API: it is only used in `string::to_lower` when deciding when a
Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare
case, so should not be performance sensitive.
@Kmeakin Kmeakin force-pushed the km/unicode-data/remove-cased branch from 55adb69 to a9b456f Compare November 10, 2025 01:40
@rustbot
Copy link
Collaborator

rustbot commented Nov 10, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@scottmcm

This comment was marked as resolved.

@rustbot rustbot assigned joboet and unassigned scottmcm Jan 27, 2026
@joboet
Copy link
Member

joboet commented Jan 27, 2026

I'm surprised that this appears to increase binary size?! If that's still the case, then I don't think we should do this. Otherwise the argument about only being used in to_lower makes sense to me. But let's reevaluate perf first...

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

rust-bors bot pushed a commit that referenced this pull request Jan 27, 2026
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 27, 2026
@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 27, 2026

☀️ Try build successful (CI)
Build commit: 97f47c0 (97f47c0b67e793194ac0fc96b2fb45c8bf97491d, parent: 94a0cd15f5976fa35e5e6784e621c04e9f958e57)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (97f47c0): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.1% [0.0%, 0.1%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.1% [-0.1%, -0.0%] 3
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary 0.6%, secondary -1.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
3.3% [2.3%, 4.4%] 2
Regressions ❌
(secondary)
2.9% [2.9%, 2.9%] 1
Improvements ✅
(primary)
-2.2% [-2.4%, -1.9%] 2
Improvements ✅
(secondary)
-3.0% [-5.9%, -1.7%] 4
All ❌✅ (primary) 0.6% [-2.4%, 4.4%] 4

Cycles

Results (primary -2.1%, secondary 0.7%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
4.1% [3.0%, 5.7%] 4
Improvements ✅
(primary)
-2.1% [-2.1%, -2.1%] 1
Improvements ✅
(secondary)
-3.9% [-4.2%, -3.6%] 3
All ❌✅ (primary) -2.1% [-2.1%, -2.1%] 1

Binary size

Results (primary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.1%, 0.1%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [0.1%, 0.1%] 4

Bootstrap: 473.018s -> 473.56s (0.11%)
Artifact size: 385.68 MiB -> 383.70 MiB (-0.51%)

@rustbot rustbot removed perf-regression Performance regression. S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jan 27, 2026
@joboet
Copy link
Member

joboet commented Jan 30, 2026

Actually it makes sense that this yields a mixed bag of results: For code that only calls .to_lowercase() but not is_lowercase, this needs to include both the Lowercase and the Uppercase tables. But obviously for code that calls is_lowercase and is_uppercase already, this obviously skips a table.

Given that the Lowercase and Uppercase tables are so much larger compared to the Cased table, I'm more inclined to favour the only-.to_lowercase() use-case (and close this PR).

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Jan 31, 2026

Actually it makes sense that this yields a mixed bag of results: For code that only calls .to_lowercase() but not is_lowercase, this needs to include both the Lowercase and the Uppercase tables. But obviously for code that calls is_lowercase and is_uppercase already, this obviously skips a table.

Given that the Lowercase and Uppercase tables are so much larger compared to the Cased table, I'm more inclined to favour the only-.to_lowercase() use-case (and close this PR).

If I understand correctly, what you're saying is

Before this PR, str::to_lowercase() would pull in unicode::to_lower (11,708 bytes), unicode::Case_Ignorable (899 bytes) and unicode::Cased (401 bytes), for a total of 13,008 bytes.

After this PR, str::to_lowercase() would pull in unicode::to_lower (11,708 bytes), unicode::Case_Ignorable (899 bytes), unicode::Lowercase (943 bytes), unicode::Uppercase (799 bytes) and unicode::Lt (33 bytes) for a total of 14,382 bytes.

Calls to str::to_uppercase(), char::is_lowercase() and char::is_uppercase() will be unchanged and pull in the same number of bytes as before.

So this PR is a win for programs that contain calls to str::to_lower() and char::is_lowercase() and char::is_uppercase(). But for programs that only call str::to_lower() this PR will be a loss.

Is that right? If so, I agree with your reasoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants