ML systems / compilers / distributed training

Building machine learning systems where compiler passes, GPU kernels, and training infrastructure meet.

I am a software engineer and Georgia Tech MSCS student focused on performance-sensitive ML systems. My work spans MLIR and LLVM compiler tooling, distributed training pipelines, and systems for efficient inference.

Portrait of Mihir Waknis

Selected work

Recent projects that show how I approach systems-heavy ML work.

Each project is structured around a concrete system problem, an implementation decision, and a measurable or verifiable outcome.

MLIR Softmax Backend

An end-to-end compiler backend that takes MLIR softmax input through lowering, NVPTX codegen, PTX emission, and CUDA Driver kernel launch.

Repository

What I built

Built the C++17 pipeline, a custom LICM strength-reduction pass, CUDA runtime integration, and FileCheck plus CTest coverage around the full compiler path.

Impact

The optimization replaces N loop divisions with 1 division plus N multiplications when the denominator is loop-invariant, then verifies correctness on GPU.

C++17MLIRLLVMPTXCUDA Driver API

Open RLHF Pipeline

A reproducible RLHF training stack covering supervised fine-tuning, reward modeling, PPO or DPO policy optimization, evaluation hooks, and a chat demo.

Repository

What I built

Organized data prep, training scripts, hardware configs, web demo tooling, and kernel experiments into a single repository that is easy to run and extend.

Impact

Documents a full alignment workflow that can run on one consumer GPU and scale to multi-GPU setups, lowering the barrier to hands-on RLHF experimentation.

PythonPyTorchTRLGradioTriton

Image Captioning Performance with Attention and Transformer-Based Models

A comparative inference study of four image-captioning architectures on MS COCO using the Karpathy split, measuring caption quality against runtime tradeoffs.

Repository

What I built

Trained and evaluated a CNN-LSTM encoder-decoder, Bottom-Up Top-Down attention, BLIP transformer, and CLIP-based transformer mapping under a common evaluation setup.

Impact

The study found that pretrained transformer models delivered the strongest caption quality, while Bottom-Up Top-Down produced the fastest inference for real-time or resource-constrained deployment.

PyTorchMS COCOTransformersAttention ModelsComputer Vision

Experience

Work shaped around production systems, infrastructure, and research tooling.

Software Engineer Intern - Systems and Infrastructure

LinkedIn

May 2026 - Aug 2026

  • Incoming summer 2026 internship focused on systems and infrastructure.

Software Engineer Co-op

Itron, Inc. / Full-time alongside university studies

Sep 2021 - Present

  • Restructured a C++ time-series ingestion pipeline from array-of-structs to struct-of-arrays layout to enable SIMD vectorization across 10M+ smart meter events, reducing L2 cache misses by about 40% and end-to-end processing time by 25%.
  • Profiled a demand forecasting inference pipeline using Nsight Systems to trace kernel launch timelines and CPU-GPU synchronization points, then reduced idle time by overlapping host-to-device transfers with computation via CUDA streams.
  • Replaced PyTorch-dispatched elementwise preprocessing ops with fused CUDA kernels, eliminating per-op launch overhead. Validated memory bandwidth utilization and occupancy against roofline targets using Nsight Compute.
  • Built an automated performance regression framework in CI that tracked p95 latency and peak GPU memory allocation per commit, catching regressions before merge and cutting deployment cycles from one week to two days.

AI/ML Undergraduate Student Researcher

Arizona Cancer Evolution Center - ASU

Jan 2024 - May 2025

  • Built a retrieval-augmented generation pipeline for medical literature analysis, with an emphasis on throughput and reproducibility.
  • Designed an embedding ingestion and indexing pipeline for 15,000+ papers, using batching, caching, and backpressure to keep GPU utilization stable.
  • Implemented ranking and re-ranking features using citation metrics, publication dates, and author signals, then evaluated their effect on retrieval quality.
  • Benchmarked multiple LLM backends for medical QA, tracked latency and cost per query, and received the Global Impact Award at the ASU Fusion Symposium.

Capabilities

Technical areas I keep returning to across projects.

The strongest signals are the projects above. These keywords are here to make the pattern legible at a glance.

GPU ProgrammingCompiler ToolingDistributed TrainingInference OptimizationMLIR / LLVMPyTorch / TritonSystems ProfilingReliability Engineering

Education

Academic work that supports the systems focus.

Georgia Institute of Technology

M.S. Computer Science, Computing Systems

May 2027 (expected)

Coursework includes deep learning, GPU programming, high-performance computing, distributed systems, and compiler theory.

Arizona State University

B.S. Data Science, Computer Science concentration

May 2025

Focused on statistical learning, systems programming, and undergraduate research in LLM and retrieval systems.

Connect

Email is the fastest route for internships, full-time roles, and collaboration.

I am most interested in machine learning systems work that sits close to compiler infrastructure, GPU execution, or distributed training performance.