Allow urlencoded data URLs by TobiX · Pull Request #467 · Y2Z/monolith

TobiX · 2025-05-02T00:05:59Z

This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.

This is a kinda stupid idea I had as I saw the size of some really large dumps (for example from comic pages on tapas.io)...

It seems to work... In a sample page from tapas.io, it reduces the final size from 213,233,452 bytes to 170,968,176 bytes, which is about 20% smaller.

The characters which are percent-encoded come from https://datatracker.ietf.org/doc/html/rfc3986#section-2.2, since that is refernced from https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Schemes/data - I found no concrete list of characters which should be escaped in the data-URL RFC (https://www.rfc-editor.org/rfc/rfc2397) - Additionally, we escape %, so nested data: URLs (PNGs in CSS anyone?) work and ", so we don't accidentally close the quotes surrounding the data URL.

I'm not sure if the encoding is correct for exotic (non-UTF8) charsets... Please advise if I should add more tests testing such scenarios.

(PS: Feel free to close this PR if this all sounds too stupid/mad)

This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.

snshn · 2025-05-05T16:25:38Z

Hello Tobias,

Thank you very much for this PR!

Not a stupid idea at all, base64 unnecessarily bloats up plaintext, something like 30% longer result on average than using URL encoding.

Originally I created https://github.com/Y2Z/dataurl to move the code that parses and creates data URLs out of Monolith, and make dataurl available as both a crate and CLI tool. It's somewhere in my backlog to switch to using it along with only using base64 for binary data.

I'll review your PR briefly and get back to you.

snshn · 2025-05-07T03:13:05Z

src/url.rs

+    let base64 = BASE64_STANDARD.encode(data);
+    let urlenc = percent_encode(data, DATA_ESC).to_string();
+
+    if urlenc.len() < base64.len() {


I like your logic, but I worry that using base64 and percent-encoding every single asset will eat both more CPU and RAM, and I don't see any benefit of using base64 for plaintext data anyway, even if somehow it manages to be a few bytes shorter. I think the best way to go here is default to percent_encode for plaintext data, and use base64 for non-printable data URLs (fonts, non-SVG images, etc). There's a data type detector somewhere in this codebase, I think it's called "is_plaintext()", that should be enough to make this function here decide if it needs to be base64 or not. I also believe it's not necessarily about file size, but might be more about how much CPU time it takes to decode that into a blob, and something tells me base64 takes more than percent-encode, but I might be wrong. Last but not least, it's priceless to see for humans what's in the data URL without having to decode it, hence percent-encoding should be preferable here, not just because of shorter length of the data URL.

Allow urlencoded data URLs

c2bfa35

This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.

TobiX mentioned this pull request May 3, 2025

[Feature Request] Ability to disable base64 for js and css content #424

Open

snshn reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow urlencoded data URLs#467

Allow urlencoded data URLs#467
TobiX wants to merge 1 commit intoY2Z:masterfrom
TobiX:data-urlencode

TobiX commented May 2, 2025

Uh oh!

snshn commented May 5, 2025

Uh oh!

snshn May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TobiX commented May 2, 2025

Uh oh!

snshn commented May 5, 2025

Uh oh!

snshn May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants