Remove `Cased` Unicode table #146180

Kmeakin · 2025-09-03T22:16:46Z

Split off from #145219

Cased is a derived property - it is the union of the Lowercase property, the Uppercase property, and the Titlecase_Letter general categories. We already have lookup tables for Lowercase and Uppercase, and Titlecase_Letter is very small. So instead of duplicating a lookup table for Cased, just test each of those properties in turn.

This probably will be slower than the old approach, but it is not a public API: it is only used in string::to_lower when deciding when a Greek "sigma" should be mapped to ς or to σ. This is a very rare case, so should not be performance sensitive.

rustbot · 2025-09-03T22:16:50Z

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-09-03T22:16:52Z

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

Kobzol · 2025-09-04T06:32:39Z

@bors try @rust-timer queue

Remove `Cased` Unicode table

rust-bors · 2025-09-04T08:49:18Z

☀️ Try build successful (CI)
Build commit: abd6680 (abd6680cc4021a02ca80a20bb45c967f5cb9f056, parent: 033c0a4742794f5608b19eb78458726596f8ec18)

rust-timer · 2025-09-04T11:46:09Z

Finished benchmarking commit (abd6680): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @rustbot label: +perf-regression-triaged. If not, please fix the regressions and do another perf run. If its results are neutral or positive, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.2%, 0.2%]	1
Regressions ❌ (secondary)	0.4%	[0.4%, 0.4%]	1
Improvements ✅ (primary)	-0.3%	[-0.5%, -0.2%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.1%	[-0.5%, 0.2%]	3

Max RSS (memory usage)

Results (primary 0.4%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	4.3%	[4.3%, 4.3%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.6%	[-2.0%, -1.2%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.4%	[-2.0%, 4.3%]	3

Cycles

Results (secondary 2.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.0%	[2.0%, 2.0%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Binary size

Results (primary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.1%, 0.1%]	7
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.1%	[0.1%, 0.1%]	7

Bootstrap: 466.287s -> 464.864s (-0.31%)
Artifact size: 388.39 MiB -> 388.39 MiB (0.00%)

bors · 2025-09-08T11:55:01Z

☔ The latest upstream changes (presumably #146173) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2025-10-04T22:31:29Z

☔ The latest upstream changes (presumably #147340) made this pull request unmergeable. Please resolve the merge conflicts.

Kmeakin · 2025-10-10T22:53:09Z

@scottmcm ping?

bors · 2025-11-01T11:40:04Z

☔ The latest upstream changes (presumably #148337) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2025-11-03T17:30:29Z

☔ The latest upstream changes (presumably #148436) made this pull request unmergeable. Please resolve the merge conflicts.

`Cased` is a derived property - it is the union of the `Lowercase` property, the `Uppercase` property, and the `Titlecase_Letter` general categories. We already have lookup tables for `Lowercase` and `Uppercase`, and `Titlecase_Letter` is very small. So instead of duplicating a lookup table for `Cased`, just test each of those properties in turn. This probably will be slower than the old approach, but it is not a public API: it is only used in `string::to_lower` when deciding when a Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare case, so should not be performance sensitive.

rustbot · 2025-11-10T01:40:10Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

joboet · 2026-01-27T19:46:32Z

I'm surprised that this appears to increase binary size?! If that's still the case, then I don't think we should do this. Otherwise the argument about only being used in to_lower makes sense to me. But let's reevaluate perf first...

@bors try @rust-timer queue

Remove `Cased` Unicode table

rust-bors · 2026-01-27T22:02:00Z

☀️ Try build successful (CI)
Build commit: 97f47c0 (97f47c0b67e793194ac0fc96b2fb45c8bf97491d, parent: 94a0cd15f5976fa35e5e6784e621c04e9f958e57)

rust-timer · 2026-01-27T23:11:41Z

Finished benchmarking commit (97f47c0): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.1%	[0.0%, 0.1%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.1%	[-0.1%, -0.0%]	3
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary 0.6%, secondary -1.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	3.3%	[2.3%, 4.4%]	2
Regressions ❌ (secondary)	2.9%	[2.9%, 2.9%]	1
Improvements ✅ (primary)	-2.2%	[-2.4%, -1.9%]	2
Improvements ✅ (secondary)	-3.0%	[-5.9%, -1.7%]	4
All ❌✅ (primary)	0.6%	[-2.4%, 4.4%]	4

Cycles

Results (primary -2.1%, secondary 0.7%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	4.1%	[3.0%, 5.7%]	4
Improvements ✅ (primary)	-2.1%	[-2.1%, -2.1%]	1
Improvements ✅ (secondary)	-3.9%	[-4.2%, -3.6%]	3
All ❌✅ (primary)	-2.1%	[-2.1%, -2.1%]	1

Binary size

Results (primary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.1%, 0.1%]	4
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.1%	[0.1%, 0.1%]	4

Bootstrap: 473.018s -> 473.56s (0.11%)
Artifact size: 385.68 MiB -> 383.70 MiB (-0.51%)

joboet · 2026-01-30T13:13:37Z

Actually it makes sense that this yields a mixed bag of results: For code that only calls .to_lowercase() but not is_lowercase, this needs to include both the Lowercase and the Uppercase tables. But obviously for code that calls is_lowercase and is_uppercase already, this obviously skips a table.

Given that the Lowercase and Uppercase tables are so much larger compared to the Cased table, I'm more inclined to favour the only-.to_lowercase() use-case (and close this PR).

Kmeakin · 2026-01-31T16:09:31Z

Actually it makes sense that this yields a mixed bag of results: For code that only calls .to_lowercase() but not is_lowercase, this needs to include both the Lowercase and the Uppercase tables. But obviously for code that calls is_lowercase and is_uppercase already, this obviously skips a table.

Given that the Lowercase and Uppercase tables are so much larger compared to the Cased table, I'm more inclined to favour the only-.to_lowercase() use-case (and close this PR).

If I understand correctly, what you're saying is

Before this PR, str::to_lowercase() would pull in unicode::to_lower (11,708 bytes), unicode::Case_Ignorable (899 bytes) and unicode::Cased (401 bytes), for a total of 13,008 bytes.

After this PR, str::to_lowercase() would pull in unicode::to_lower (11,708 bytes), unicode::Case_Ignorable (899 bytes), unicode::Lowercase (943 bytes), unicode::Uppercase (799 bytes) and unicode::Lt (33 bytes) for a total of 14,382 bytes.

Calls to str::to_uppercase(), char::is_lowercase() and char::is_uppercase() will be unchanged and pull in the same number of bytes as before.

So this PR is a win for programs that contain calls to str::to_lower() and char::is_lowercase() and char::is_uppercase(). But for programs that only call str::to_lower() this PR will be a loss.

Is that right? If so, I agree with your reasoning.

rustbot assigned scottmcm Sep 3, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 3, 2025

Kmeakin changed the title ~~optimization: Eliminate Cased table~~ Remove Cased Unicode table Sep 3, 2025