Abstract
This paper introduces a novel framework for addressing domain adaptation challenges in large language models (LLMs), emphasising privacy-preserving synthetic data generation and efficient fine-tuning. The proposed framework employs a multi-stage approach that includes document ingestion, relevance assessment, and automated dataset creation. This process reduces the need for extensive technical expertise while safeguarding data privacy. We evaluate the framework’s performance on domain-specific tasks in fields such as biobanking and public health, demonstrating that models fine-tuned using our method achieve results comparable to larger proprietary models. Crucially, these models maintain their general instruction-following capabilities, even when adapted to specialised domains, as shown through experiments with 7B and 8B parameter LLMs. Key components of the framework include continuous pre-training, supervised fine-tuning (SFT), and reinforcement learning methods such as direct preference optimisation (DPO), which together provide a flexible and configurable solution for deploying LLMs. The framework supports both local models and API-based solutions, making it scalable and accessible. By enabling privacy-preserving, domain-specific adaptation without requiring extensive expertise, this framework represents a significant step forward in the deployment of LLMs for specialised applications. The framework significantly lowers the barrier to domain adaptation for small- and medium-sized enterprises (SMEs), enabling them to utilise the power of LLMs without requiring extensive resources or technical expertise.
Original language | English |
---|---|
Article number | 172 |
Number of pages | 22 |
Journal | Computers |
Volume | 14 |
Issue number | 5 |
Early online date | 2 May 2025 |
DOIs | |
Publication status | Published - May 2025 |
Keywords
- dataset creation
- deep learning
- large language models
- model adaptation
- model fine-tuning
ASJC Scopus subject areas
- Human-Computer Interaction
- Computer Networks and Communications