Ninad Tongay
-
BEng (Savitribai Phule Prune University, 2021)
Topic
Dynamic and Cost-Efficient Deployment of Large Language Models Using Uplift Modeling and Multi Armed Bandits
Department of Computer Science
Date & location
-
Thursday, April 24, 2025
-
10:00 A.M.
-
Engineering & Computer Science Building
-
Room 468 and Virtual
Reviewers
Supervisory Committee
-
Dr. Sean Chester, Department of Computer Science, University of Victoria (Supervisor)
-
Dr. Alex Thomo, Department of Computer Science, UVic (Member)
External Examiner
-
Dr. Issa Traore, Department of Electrical and Computer Engineering, UVic
Chair of Oral Examination
-
Dr. Tom Ruth, Department of Physics and Astronomy, UVic
Abstract
The rapid advancement of large language models (LLMs) has brought about a new class of challenges in balancing performance, cost, and scalability. As organizations seek to deploy these models in production environments, a key question arises: how can we maintain the quality of responses delivered by advanced LLMs while reducing the significant computational and financial costs associated with them? Relying entirely on high-end models like GPT-4 can ensure quality but often proves economically unsustainable, while defaulting to smaller, cheaper models may sacrifice performance and user satisfaction. This tension calls for more intelligent decision-making strategies—ones that dynamically allocate queries to the most appropriate model depending on the task’s complexity and expected value. To address this, we propose a hybrid decision-making framework that brings together causal uplift modeling and multi-armed bandits to drive cost-aware, adaptive model selection. Uplift modeling enables the system to reason causally about the benefit of using a stronger model for a specific query, thereby offering interpretable, feature-informed decisions from the outset. These predictions serve as a strong offline prior. The bandit component builds on this by adapting the policy in real time—learning from feedback, correcting for model mispredictions, and responding to shifts in query distribution or underlying model performance. This fusion of causal inference and online learning results in a system that is not only efficient and scalable, but also interpretable and responsive to real-world variability. We validate the approach through controlled simulations that mimic real deployment conditions, including concept drift, shifts in user query types, and the emergence of unseen domains. Across these scenarios, the hybrid consistently achieves a more favorable balance between quality and cost than baseline strategies. Furthermore, the system is designed to expose its decision-making logic, offering transparency through uplift scores and feature-based justifications—a critical requirement for high-stakes AI deployments. By combining performance, cost-awareness, and explainability, this work contributes a practical solution to the growing need for intelligent model orchestration in the multi-LLM landscape.