Conversation
This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.
|
Hello Tobias, Thank you very much for this PR! Not a stupid idea at all, base64 unnecessarily bloats up plaintext, something like 30% longer result on average than using URL encoding. Originally I created https://github.com/Y2Z/dataurl to move the code that parses and creates data URLs out of Monolith, and make I'll review your PR briefly and get back to you. |
| let base64 = BASE64_STANDARD.encode(data); | ||
| let urlenc = percent_encode(data, DATA_ESC).to_string(); | ||
|
|
||
| if urlenc.len() < base64.len() { |
There was a problem hiding this comment.
I like your logic, but I worry that using base64 and percent-encoding every single asset will eat both more CPU and RAM, and I don't see any benefit of using base64 for plaintext data anyway, even if somehow it manages to be a few bytes shorter. I think the best way to go here is default to percent_encode for plaintext data, and use base64 for non-printable data URLs (fonts, non-SVG images, etc). There's a data type detector somewhere in this codebase, I think it's called "is_plaintext()", that should be enough to make this function here decide if it needs to be base64 or not. I also believe it's not necessarily about file size, but might be more about how much CPU time it takes to decode that into a blob, and something tells me base64 takes more than percent-encode, but I might be wrong. Last but not least, it's priceless to see for humans what's in the data URL without having to decode it, hence percent-encoding should be preferable here, not just because of shorter length of the data URL.
This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.
This is a kinda stupid idea I had as I saw the size of some really large dumps (for example from comic pages on tapas.io)...
It seems to work... In a sample page from tapas.io, it reduces the final size from 213,233,452 bytes to 170,968,176 bytes, which is about 20% smaller.
The characters which are percent-encoded come from https://datatracker.ietf.org/doc/html/rfc3986#section-2.2, since that is refernced from https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Schemes/data - I found no concrete list of characters which should be escaped in the data-URL RFC (https://www.rfc-editor.org/rfc/rfc2397) - Additionally, we escape %, so nested data: URLs (PNGs in CSS anyone?) work and
", so we don't accidentally close the quotes surrounding the data URL.I'm not sure if the encoding is correct for exotic (non-UTF8) charsets... Please advise if I should add more tests testing such scenarios.
(PS: Feel free to close this PR if this all sounds too stupid/mad)