PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

Researchers at an undisclosed institution conducted a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) on the Qwen2.5-3B model for telecommunications customer support, generating 30,000 training examples across 1,560 problem scenarios from a glossary of 52 industry terms. Evaluating 16 LoRA configurations, the team found a clear divergence between quantitative and qualitative performance, with the model achieving the best validation loss (0.5024) ranking only 6th-7th in human-aligned evaluations, while the worst loss (0.6807) ranked first according to both GPT-5.2 and Claude 4.5 Sonnet judges. The findings demonstrate that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI and highlight the need for energy-performance trade-off analysis in sustainable LLM deployment.

arXiv:2606.05176v1 Announce Type: new Abstract: While large language models LLMs show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning PEFT using Low-Rank Adaptation LoRA applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss 0.5024 ranks only 6th-7th in qualitative evaluation, while the worst loss 0.6807 ranks first according to both judges. This work contributes 1 a combinatorial method for synthetic dataset construction, 2 insights into the impact of target module selection for LoRA injection, 3 evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and 4 an energy-performance trade-off analysis for sustainable LLM deployment.