Thang Duong

PhD Candidate, Computer Science — University of Arizona

Expected Dec 2026 · Advisor: Prof. Chicheng Zhang · Seeking industry research / applied-science roles

I build sample-efficient learning algorithms — reinforcement learning, multi-armed bandits, and LLM reasoning — that turn theory into real benchmark wins.

Email Resume Google Scholar GitHub LinkedIn

NeurIPS 2024 first author 2× regret reduction on real mmWave benchmarks RL-for-LLMs: warm-starting & process reward models

About

I'm a final-year CS PhD candidate at the University of Arizona, advised by Prof. Chicheng Zhang. My research makes sequential decision-making more sample-efficient by injecting structure — shared representations, domain physics, or LLM priors — into reinforcement learning and bandit algorithms.

I work across the full stack of research: provable guarantees (a NeurIPS 2024 first-author result), real-benchmark gains (2× regret reduction on mmWave beam alignment), and reproducible systems (modular evaluation suites, large-scale experiments on H100 GPUs). Before my PhD I did industry research at Qualcomm (formerly VinAI Research).

I'm now looking for industry research and applied-science roles where this transfers directly — RL for LLMs (post-training, reasoning, RLHF), recommendation and personalization, and adaptive experimentation.

Selected Research

Published NeurIPS 2024

Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits

Thang Duong, Zhi Wang, Chicheng Zhang

First provable method to transfer a shared low-rank representation across a stream of bandit tasks WITHOUT the standard task-diversity assumption — making lifelong bandit transfer applicable to real-world task streams.

We develop an algorithm (BOSS) that learns and transfers a low-rank representation on the fly and prove a regret guarantee under the ellipsoid action-set setting, where prior work required tasks to uniformly span the subspace.

Meta-regret accumulated across the sequence of tasks.

Average regret over the horizon T, across tasks.

PDF arXiv Code

BibTeX

@article{duong2024beyond,
  title   = {Beyond task diversity: provable representation transfer for sequential multitask linear bandits},
  author  = {Duong, Thang and Wang, Zhi and Zhang, Chicheng},
  journal = {Advances in Neural Information Processing Systems},
  volume  = {37},
  pages   = {37791--37822},
  year    = {2024}
}

Published WiOpt 2026

Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications

Hao Qin*, Thang Duong*, Ming F. Li, Chicheng Zhang

2× reduction in beam-alignment regret on the real-world DeepMIMO and DeepSense6G benchmarks — robust where unimodality/multimodality assumptions break.

We cast mmWave beam alignment as a parametric bandit tied to the Phase-Retrieval structure of the channel and design two algorithms (PR-ETC, PR-GREEDY) that exploit sparse multipath structure without restrictive assumptions on the reward function.

arXiv

Working paper

Sample-Efficient Reinforcement Learning by Warm-Starting with LLMs

Thang Duong, et al.

Up to 3× faster convergence on OpenAI Gym in our experiments, by using an LLM's in-context decisions to collect warm-start data for any off-the-shelf RL algorithm.

When no offline dataset exists, we use an LLM to collect a small dataset that covers a good policy, then warm-start RL — combining low cumulative regret with high sample efficiency. (Details and code available upon publication.)

Working paper

Efficient Algorithms for Lifelong Representation Learning in Linear Bandits Beyond Task Diversity

Thang Duong, et al.

A computationally efficient algorithm (BRESS) that drops the task-diversity assumption AND stays polynomial-time, via a black-box reduction to online PCA — with adversarial robustness.

By reducing sequential multi-task linear bandits to online principal component analysis, any low-competitive-ratio online-PCA subroutine yields low meta-regret. (Details available upon publication.)

Published RLC 2024

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

MJ Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, M. Ghavamzadeh

Meta-learns across a sequence of bandit tasks whose optimal arms lie in a small unknown subset, beating the naive Õ(√(KNT)) baseline.

Via a reduction to bandit submodular maximization, the algorithm exploits shared optimal-arm structure across tasks in both the meta-learning and non-stationary settings.

Regret on the task (meta-learning) experiment.

Regret on the optimal-arm subset experiment.

PDF arXiv Code

Interactive: bandit regret

A core theme of my work is sample efficiency — how fast a learner stops paying for exploration. Below, two classic strategies play the same multi-armed bandit live in your browser: UCB (adaptive, logarithmic regret) vs. Explore-Then-Commit (explores, then commits). Drag the sliders to see how the gap between arms and the horizon change the cost of learning. (Illustration — simulated here, not paper data.)

Arms (K)5 Gap between arms (Δ)0.15 Horizon (T)800

UCB Explore-Then-Commit

Publications

Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits

Thang Duong, Zhi Wang, Chicheng Zhang

NeurIPS , 2024 [PDF] [arXiv] [Code]
Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications

Hao Qin*, Thang Duong*, Ming F. Li, Chicheng Zhang

WiOpt , 2026 [arXiv]
Sample-Efficient Reinforcement Learning by Warm-Starting with LLMs

Thang Duong, et al.

Working paper
Efficient Algorithms for Lifelong Representation Learning in Linear Bandits Beyond Task Diversity

Thang Duong, et al.

Working paper
Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

MJ Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, M. Ghavamzadeh

RLC , 2024 [PDF] [arXiv] [Code]
Association of MRI-defined Structure Features at Baseline with Knee Pain Trajectories

S. Liu, X. Sun, Y. Ge, Thang Duong, C.K. Kwoh

ACR Convergence , 2024 [Link]
Deep Regression for Precise Geometric Dimension Measurement

Thang Duong, Binh Nguyen Duc, Phuong Le Khac, Ngoc Tu Nguyen, Mai Nguyen Thi Phuong

J. Korean Soc. Precis. Eng. , 2019 [PDF] [DOI]
Analyzing Seismic Signal using Support Vector Machine for Vehicle Motion Detection

Thang Duong, Nguyen Thi Phuong Mai

J. Industrial Networks and Intelligent Systems , 2019 [PDF]

Experience

Aug 2022 – Present

Tucson, AZ

Graduate Research Assistant · University of Arizona

Proved a regret guarantee that eliminates the task-diversity assumption for sequential multi-task representation transfer in bandits (NeurIPS 2024, first author).
Cut beam-alignment regret 2× on the DeepMIMO and DeepSense6G benchmarks with physics-informed bandit algorithms (cross-team with Prof. Ming Li's ECE lab).
Built a novel RL warm-start pipeline using LLM-collected demonstrations, reaching up to 3× faster convergence on OpenAI Gym in our experiments; built a modular evaluation suite + release scripts for large-scale H100 experiments; mentored one undergraduate.

May 2025 – Aug 2025

Chicago, IL

Visiting Student · Toyota Technological Institute at Chicago (TTIC)

Studied Process Reward Models (PRMs) for LLM reasoning by framing them within the Actor-Critic framework, toward provably improving PRM-guided search-based policies (with Prof. Chicheng Zhang et al.).

Dec 2019 – Jun 2022

Hanoi, Vietnam

Research Resident · Qualcomm (formerly VinAI Research)

Bandit meta-learning: built a modular multi-task bandit codebase leveraging shared structure, with improved sample efficiency across a full experiment suite (RLC 2024). Mentored by Yasin Abbasi-Yadkori, Dinh Phung, and Tung Pham.
Prototyped active-learning strategies for domain adaptation and model warm-starting.
Investigated Sim-to-Real transfer with a proof-of-concept domain-adaptation method in the CARLA simulator.

May 2018 – Jun 2019

Hanoi, Vietnam

AI Team Leader, Scrum Master · NAL Vietnam JSC

Integrated NLP models (intent classification, entity recognition) into Chatops, a commercial business chat interface.
Led a team of six to deploy facial-recognition models for a parent–child matching product across six kindergarten locations.

Education

2022 – exp. Dec 2026

Ph.D. in Computer Science

University of Arizona

Advisor: Prof. Chicheng Zhang. Focus: high-dimensional interactive learning with domain-specific inductive biases.

2012 – 2018

B.Eng. in Mechatronics Engineering (Advanced Program)

Hanoi University of Science and Technology

Skills

Research

Reinforcement Learning Multi-armed Bandits Online Learning Representation / Transfer Learning Meta-learning LLM reasoning (RLHF / PRM) Statistics & Optimization

Engineering

Python PyTorch TensorFlow HuggingFace / Transformers RLlib OpenAI Gym CUDA Docker / Linux / Git

Get in touch

I'm looking for industry research and applied-science roles starting 2027. The fastest way to reach me is email.

thangduong@arizona.edu Resume (PDF) Google Scholar GitHub LinkedIn

Tucson, AZ (open to relocation, US)