I'm a systems-oriented engineer with a deep curiosity about how abstraction layers interact, from low-level physics to high-level software and everything in between. My background spans atomic-scale materials science, semiconductor process development, and performance-focused software engineering.
After over a decade at Micron Technology developing novel materials and scalable fabrication processes that helped enable multiple generations of high-performance DRAM and NAND Flash memory, I shifted focus to software.
Now I work near the software-hardware boundary — building tools and infrastructure that emphasize modularity, performance, and architectural clarity. I thrive in environments where performance bottlenecks aren’t just bugs to fix, but signposts to deeper design opportunities. I'm especially interested in how these principles play out in emerging hardware and next-generation AI systems.
parallel-prefix-engine
High-performance 2D prefix sum engine with CUDA and MPI backends. CUDA kernels are optimized to maximize shared memory usage and reduce global memory traffic, enabling tile-based parallelism and high-throughput execution at large input scales. Containerized for easy deployment; plugin-style architecture supports drop-in backend extensions.
C++
CUDA
MPI
docktuna
Fully containerized template for running the Optuna hyperparameter tuning framework with PostgreSQL RDB storage — powered by Docker, Conda, and Poetry. Built for reproducibility, GPU support, and secure, scalable experiment tracking. Includes full test coverage and API documentation.
Python
Docker
dispatch-model-benchmarks
Benchmarking suite comparing runtime polymorphism (virtual functions) with compile-time alternatives (CRTP, C++ Concepts) across multiple compute functions and optimization levels. Shows faster execution and lower memory access for compile-time options at max optimization, and reveals a significant CRTP advantage over C++ Concepts at the mid-range optimization commonly used in production environments.
C++
hpc-collection
A set of high-performance computing (HPC) projects demonstrating memory-bound and compute-intensive workload optimization across distributed systems. Includes tiled matrix operations, Gauss-Seidel synchronization, MPI-based gas simulation, and a hybrid MPI/OpenMP histogram sort. Highlights cache locality tuning, inter-process communication strategies and scaling analysis on a production HPC cluster.
C++
MPI
OpenMP
SLURM
icu-deep-learning
AI model that uses intensive care unit (ICU) lab and vital sign data to predict patient outcomes. Exhibits 90% faster data pipeline and 60% better predictive performance compared to prior studies of the same dataset. Includes a custom PyTorch module for adversarial attacks, enabling batch-mode evaluation of model vulnerabilities.
PyTorch
SQL
Docker
xiangqigame
AI engine for Xiangqi (Chinese Chess) with a C++ core, and a Python wrapper supporting a command-line interface and data analysis suite. Implements a plugin-style architecture with compile-time polymorphism for performance-critical components, achieving a 10x speedup in decision-making during gameplay.
C++
Python
srepkg
Wraps CLI-enabled Python packages with custom build system files, ensuring installation in isolated virtual environments and allowing package distributors to safeguard against downstream dependency conflicts. Includes an automated test suite with 99% code coverage to safeguard reliability. Available on PyPI.
Python
resticlvm
Configuration-driven CLI tool for atomic, incremental Linux backups using LVM snapshots and Restic. Follows a “Bash executes, Python orchestrates” model. Installable via pip and uses only the Python standard library — no external dependencies.
Shell
Python
systems-workshop
A set of tools and demos exploring systems programming and infrastructure. Topics include process and signal management, low-level I/O, access control, environment conversion, and snapshot-based backup automation. Projects are organized as Git submodules for modular exploration and reuse.
C
Shell
Python
Assembly


