vs-123 vs-123

Introduction

Hi, I'm vs-123. I am a passionate programmer. I have a deep-seated interest in the inner workings of computers. My interests primarily revolve around low-level development. I enjoy bridging the gap between hardware and software.

C is my primary language of choice with a specific fondness for C99. I appreciate its efficiency and the granular control it offers over machine state, and I find its philosophy aligns perfectly with my interest in building lean & transparent systems.

Skills

C -- C is my primary vehicle for exploring the machine state. I highly value and enjoy the transparency, lean nature and the fine control it offers me over memory and hardware
C++ -- C++ has been my bridge from high-level abstractions to systems programming. I still use it when the project requires complex architectural patterns or modern abstractions
Rust -- Rust used to be my daily-driver. Rust had shaped my understanding of memory safety and has equipped me with key concepts such as memory ownership, borrowing, pointer aliasing, unsafe optimisations, etc.
WASM -- I treat the web as just another compilation target. Sometimes I write projects in C/C++/Rust which I wish to be able to use on my browser, in such cases I compile to WASM
Regex -- Beyond just using it, I've implemented a PCRE-compatible engine called regen from scratch in C99, in order to understand how it works deep under the hood. This has given me deep insights on not just regex, but also non-finite automata and backtracking
Go -- Go is my favourite high-level language. I enjoy the minimalism and simplicity it offers to me along with its C-ish vibes. I use it for systems involving networking
Python -- I use Python as a scripting language, as a glue language. I use it mostly for writing quick scripts that are slightly more complex for bash, quick prototypes of algorithms, visualising data and making statistical observations and verification, like that of ystar
Lua -- I enjoy Lua's simplicity for a scripting language, and its embeddable nature. I use it primarily for my Neovim configuration.
Haskell -- Haskell is my playground and laboratory for parser algorithms. I use Haskell for writing quick parser algorithms, as well as testing them. It's functional nature and expressive type system makes it a perfect environment for me to test formal grammars
HTML5, CSS, JS -- These languages have been my entry point into programming. JavaScript is the first programming language I've learnt. These foundations have also eventually led me to Electron, which had piqued my interest in GUI application development.
TypeScript -- TypeScript was my first encounter with a formal type system. Initially I disliked its statically-typed nature, but after a while of using it, I had eventually learnt to appreciate type systems in general. The fact that I could catch potential errors before runtime was a fascinating idea to me, coming from JavaScript where somehow 0 == "0" and "0" == [] are true but 0 == [] is false and ('b' + 'a' + + 'a' + 'a').toLowerCase() becomes "banana"
Linux & BSD -- I enjoy exploring the philosophies of various platforms including *nix ecosystem, which also includes Linux and BSD. Although Void Linux is my daily driver, I admire the BSD family committing to the true UNIX philosophy. I especially admire the idea of "do one thing and do it well", which is pretty evident in the BSD userland and architecture
Vim -- Vim is the first text editor I tried after months of VSCode. I found the idea of using my entire keyboard to manipulate text quite interesting. Vim has been my entry point into keyboard-oriented text editors
Emacs -- Emacs was my next daily-driver as a text editor. I enjoyed its highly customisable nature. But what stood out for me the most is the idea that it's a "self-documenting editor". I highly appreciated its built-in help features, especially C-h chords
Neovim -- Neovim is my current text editor of choice, I use it for programming most of my projects, including this README. I occasionally use ed for making quick edits
CMake -- CMake is my primary meta-buildsystem of choice when I write C and C++ projects. I enjoy its ability to general build files for various buildsystems. This allows my projects, which I build with GNU Make, to be built on someone's Windows machine with Visual Studio
Git -- Git is my favourite version control system. I use it frequently when I develop projects. It allows me to manage the evolution of my source code, which aids in keeping the project tidy and organised
CI/CD & Docker -- I utilise GitHub actions and Docker to automate the testing of my projects across different environments, and also to streamline binary distribution
LaTeX & Markdown -- These are my tools of choice when it comes to technical documentation and making general notes

Top Projects

regen -- A PCRE-compatible, regular-expressions engine written in C99. Supports various features including capture groups, backreferences, zero-width lookaround assertions, character classes, and more. No external libraries were used
ystar -- A xorshift64* PRNG implementation in C99 with my own custom constants, tested and verified with various statistical tests including Wald-Wolfowitz Runs Test and Lag-1 Correlation Test. PLOTS AVAILABLE IN README
bpvm -- A simple, fully-functional BytePusher VM implementation in C99. SCREENSHOTS AVAILABLE IN README
halloc -- A dead-simple, thread-safe, general-purpose, explicit-freelist heap allocator library written in C99
barc -- A lexicographical permutation-based BWT + local frequency-adaptive MTF transform + multi-byte chunked RLE file-compressor archive utility tool written in C99. This project uses no OS-specific/POSIX/external libraries in order to keep it portable
mbf -- A custom BF implementation, with the addition of macros
dstr -- A dead-simple dynamic-string library written in C89. Supports printf-style formatters for string concatenation, written from scratch
mswpr -- Fully functional Minesweeper implementation in C99, using my own ystar PRNG for mine generation and placement. Screenshots available in README
kernel -- A dead-simple kernel written in C99 with LLVM toolchain
tegen -- A recursive, stochastic text generator for Rust
auc -- A cross-platform auto-clicker in C99
Royal Hemlock Theme -- Soothing royal-blue light-theme for Emacs. Screenshots available in README

Interests

I find myself fascinated by the core mechanics of different programming languages and the way they're implemented. This often leads me to find myself exploring different compilers via Godbolt, where I read and understand the assembly generated by various different compilers. I find it enjoyable to learn and analyse how various compilers handle various optimisation passes and architecture specific optimisations. I treat the translation from code to machine instructions as a field of study in itself.

My time spent with Compiler Explorer has deeply influenced my development workflow. It has helped me develop a habit of making low-level optimisation decisions as I write code. This practice gives me a rough yet grounded intuition about the potential memory layout, cache locality and the precise state transitions the hardware performs during execution of my programs.

As a result of learning and analysing how different compilers target various ISAs, I've gained a well familiarity with several architectures including but not limited to x86-64, x86, MIPS, ARM32/64 and MPPA. This has allowed me to reason about code performance across a wide and diverse range of platforms.

This exposure with a diverse range of assemblies has granted me the advantage of being able to intuitively read and understand most assemblies, even for architectures I haven't formally encountered before. I find this skill incredibly helpful when debugging programs or reverse-engineering.

This grasp on assembly developed over the years, combined with my interest in language implementation, has led me to a much deeper understanding of the core mechanics of a computer. It allows me to evaluate the specific trade-offs and underlying philosophies offered by different programming languages from the POV of how they actually work under the hood.

In addition to code, I enjoy exploring the philosophies of different software and their communities. This curiosity drives my appreciation for Linux, UNIX, *nixes as well as the POSIX standard. I understand and appreciate the historical design patterns and the ideologies emerged historically. I find more interest in the "why" of a system's design more than I do in the "how" of its execution.

Interesting Observations

MIPS Compiler Magic: Branch Delay Slot

Consider the following simple C code:

int add(int x, int y) {
   return x + y;
}

When we compile it with -O3 targeting MIPS, we get this assembly:

add:
   jr     $31
   addu   $2,$4,$5

Notice something interesting is going on here. It does the addition after returning from the function. What's going on?

Just FYI:

$31 is the return address (RA).
$2 holds the return value (V0).
$4 and $5 are arguments x and y.
jr $31 is the jump register (i.e. return) instruction.

At first glance, it looks like the function returns before it ever adds the numbers. However this is a special hardware feature of MIPS called the Branch Delay Slot (BDS).

In a general MIPS pipeline, the instruction immediately following a branch or jump is executed before the jump actually completes.

Hence in this case, the C compiler is actually being smart here. Instead of doing something like:

addu $2,$4,$5   # (cycle 1)
jr              # $31 (cycle 2)
nop             # (delay slot)

The compiler actually reorders the instructions so that the addition happens INSIDE the jump's delay slot. It essentially hides the cost of the addition within the time the processor spends resolving the jump. We get our result in what feels like a single operation.

Honestly it's pretty interesting to see how the compiler "exploits" specific hardware quirks like this to strip off potential inefficiencies

PPCI Compiler

Here's a simple C code

int main() {
   int y = 1;
   int z = 2;
   int x = y + z;
   
   return 0;
}

It's pretty simple, you have an integer x with just the value y + z

Now let's compile this with a different C compiler. Not GCC, not Clang, we will use the ppci compiler.

We'll compile using the flags --std=c89 -O0 (link if you wanna see: https://c.godbolt.org/z/Wq7vcv5ec)

This generates the following assembly:

       section data
       section code
 main:
       push rbp
       mov rbp, rsp
       sub rsp, 24
 main_block0:
       mov r8, 1
       mov [rbp, -8], r8
       mov r8, 2
       mov [rbp, -16], r8
       mov r8, [rbp, -8]
       mov r11, [rbp, -16]
       add r8, r11
       mov [rbp, -24], r8
       mov rax, 0
       jmp main_epilog
 main_epilog:
       add rsp, 24
       pop rbp
       ret

main: is the int main() { part and main_epilog is the } part (end of main() function)

The main interesting part I wanted to talk about is in main_block0, specifically this part:

 ...
 main_block0:
       mov r8, 1
       mov [rbp, -8], r8
       mov r8, 2
       mov [rbp, -16], r8
 ...

These are our int z & int y variables. It's being stored using register r8 as the temporary register in the stack at positions rbp - 8 and rbp - 16

The real interesting part is that it's storing the other int at rbp - 16 instead of what you'd normally expect to be rbp - 12.

The thing is that, most compilers that target x86-64, x86, ARM, MIPS, etc. like clang and gcc, typically use 4 bytes for an int.

This means that, given a stack pointer sp, the variables int z & int y should have been stored in the stack register at n bytes away from sp (i.e. sp - n) and the other one must have been stored 4 bytes away from the last one (i.e. sp - (n+4)) like:

main:
    mov [rbp - 8], 1
    mov [rbp - 12], 2

However, notice that in the case of PPCI, it stores them at rbp - 8 and rbp - 16 instead of rbp - 12. It's using offset of 8 instead of 4 bytes.

Question: Does PPCI treat ints as 8 bytes instead of 4?

Let's find out.

Here's another C program. We'll compile it with the same compiler flags using PPCI.

int main() {
   return sizeof(int);
}

Essentially, our main() function should return the size of int.

Here's the assembly for it:

       section data
       section code
 main:
       push rbp
       mov rbp, rsp
 main_block0:
       mov rax, 8
       jmp main_epilog
 main_epilog:
       pop rbp
       ret

Notice the instruction mov rax, 8. That confirms that our int in PPCI is being treated as 8 bytes instead of the usual 4 bytes.

So our previous code seems to be consistent. But wait, most compilers and platforms use 4 bytes for an int, PPCI is using 8 bytes. Does this mean PPCI is breaking the C standard?

Question: Does PPCI break the C standard?

Let's look at how C standard defines an int:

This screenshot is from the official C standard.

It essentially says that an int must be able to AT LEAST hold a number within the range [−32767, +32767]

The C standard does not say that an int must be 4 bytes, rather it says that it should be able to hold AT LEAST 2 bytes. Note that it's the LEAST, this means an int can theoretically be as large or as small as the compiler/platform wants, it just needs to be able to hold a number within that range in order to call itself an int.

Thus, most systems use their own convention for convenience.

For example, an int in MS-DOS 2 bytes. The same int in Windows x86 is 4 bytes.

There's two interesting conventions called LP64 and LLP64. It dictates the size of longs, long longs, and pointers.

The names themselves are actually shorthands for which types are 8 bytes. L = long, LL = long logng, P = pointer. The 64 at the end says they take up 64-bits of storage.

Majority of UNIX and UNIX-like systems use the LP64 model, whereas Windows uses LLP64 to maintain backward compatibility.

This means that a long on macOS would be 8 bytes, but the same long on Windows would be 4 bytes.

Coming back to PPCI, what does this "an int must be at least 2 bytes" have to do with it?

This implies that, PPCI is actually standard C and its int is perfectly 100% standard-compliant.

Ok now the bigger question, why does PPCI make it 8 bytes instead of 4 bytes?

I'm not quite sure about this, but I believe that PPCI treats it as 8 bytes just so it can use the value directly in rax for simplicity.

If that's the case, there's an implication that since it is promoting our 4 byte int into an 8 byte stack, it actually keeps the stack perfectly aligned for CPU processing.

This is actually perfect because an x64 CPU prefers to read data in chunks of 8 bytes rather than chunks of 4 bytes like x86 CPUs.

So yeah, whether it was intended or not, it's actually convenient for x64 CPUs.

Why XOR EAX, EAX?

Let's take a very simple C program as follows:

int main(void) {
   return 42;
}

Compiling this using clang, targetting x86-64 with the highest optimisation using -O3, we get:

main:
        mov     eax, 42
        ret

mov eax, 42 sets eax to 42 and then ret returns it (as the program's exit code).

Let's try another number.

int main(void) {
   return 128;
}

Compiling this with the same flags as the previous one, we get:

main:
        mov     eax, 128
        ret

Notice that both the assemblies are pretty much identical, except the 42 changed to 128.

Seems fair enough, let's try the usual return 0;. As you might expect, we're gonna get mov eax, 0.

int main(void) {
   return 0;
}

Compiling this with the same setup, we get:

main:
        xor     eax, eax
        ret

Interesting, looks like we didn't get the mov eax, 0 we were expecting. Instead, we got xor eax, eax.

Why's it tryna be all fancy in here? Why xor instead of the usual mov? Turns out, there's an interesting reason behind this.

If we strip away the fancy stuff, assembly languages are just syntactic sugar for machine code. Instead of manually writing stuff like 48 b8 88 77 66 55 44 33 22 11 and 48 01 d8, we write mov rax, 0x1122334455667788 and add rax, rbx because these are much more readable and easier to maintain.

In this case, when we have mov eax, 128 in x86-64, it usually encodes to something like b8 80 00 00 00, where b8 is the opcode for mov eax and 128 is encoded as 80 00 00 00 (0x80 = 128). The reason it's 80 00 00 00 instead of 00 00 00 80 is because x86 is little endian, hence the least significant byte appears first.

Similarly, mov eax, 0 would encode to b8 00 00 00 00, and it makes perfect sense why.

What about xor eax, eax? xor eax, eax actually encodes to just 31 c0. 31 is the opcode for xor, and c0 is the mod r/m byte. In binary, c0 turns to 11 000 000, where 11 says that it's a "register to register" kind of operation (no memory addresses involved), and both of the 000s refer to register eax.

Okay, let's get this straight. mov eax, 0 becomes b8 00 00 00 00, and xor eax, eax becomes 31 c0. Cool, so what? Why does that make xor eax, eax any better?

The idea is that, mov eax, 0 encodes to 5 bytes, whereas 31 c0 is just two bytes. This implies that, when you have a program with thousands of zero-initialisations, the compiler's choice of xor eax, eax over mov eax, 0 actually shaves off several potential kilobytes off your binary size. This makes the binary smaller, and hence you're gonna have better I-cache density, hence fewer cache misses.

In addition to this, modern CPUs don't really "calculate" this xor eax, eax operation. The thing is that there's a logical component in your CPU called "register renamer". Basically, it's sort of a dictionary full of several "idioms". So when it sees a common pattern like xor eax, eax, the register renamer recognises it from its dictionary and immediately marks the register zero instead of sending it to the ALU. This makes the instruction execute at zero latency and also doesn't hog up an execution port.

There's also a side effect related to this in x86-64. The thing is that a x32 operation like xor eax, eax automatically zero extends to the full x64 rax register. This implies that xor eax, eax clears all 64 bits. This makes xor eax, eax more efficient than xor rax, rax because the x32 one also avoids REX prefix byte which saves even more space.

What's even cooler about this is that this xor operation tells the CPU that eax no longer depends on whatever its previous value was. This actually gives the CPU a green light to perform "out of order" execution, meaning it can do its next task without having to wait for the old eax to retire.

Miscellaneous

Aspect	Choice
Environment	My Dotfiles
Text Editor	Neovim
OS	Void Linux
WM	Openbox
Shell	Zsh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vs-123 vs-123

Achievements

Achievements

Block or report vs-123

Introduction

Skills

Top Projects

Interests

Interesting Observations

Miscellaneous

Statistics

Pinned Loading

Uh oh!