A framework for domain-specific dataset creation and adaptation of large language models

George Balaskas*, Homer Papadopoulos, Dimitra Pappa, Quentin Loisel, Sebastien Chastin

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Downloads (Pure)

Abstract

This paper introduces a novel framework for addressing domain adaptation challenges in large language models (LLMs), emphasising privacy-preserving synthetic data generation and efficient fine-tuning. The proposed framework employs a multi-stage approach that includes document ingestion, relevance assessment, and automated dataset creation. This process reduces the need for extensive technical expertise while safeguarding data privacy. We evaluate the framework’s performance on domain-specific tasks in fields such as biobanking and public health, demonstrating that models fine-tuned using our method achieve results comparable to larger proprietary models. Crucially, these models maintain their general instruction-following capabilities, even when adapted to specialised domains, as shown through experiments with 7B and 8B parameter LLMs. Key components of the framework include continuous pre-training, supervised fine-tuning (SFT), and reinforcement learning methods such as direct preference optimisation (DPO), which together provide a flexible and configurable solution for deploying LLMs. The framework supports both local models and API-based solutions, making it scalable and accessible. By enabling privacy-preserving, domain-specific adaptation without requiring extensive expertise, this framework represents a significant step forward in the deployment of LLMs for specialised applications. The framework significantly lowers the barrier to domain adaptation for small- and medium-sized enterprises (SMEs), enabling them to utilise the power of LLMs without requiring extensive resources or technical expertise.
Original languageEnglish
Article number172
Number of pages22
JournalComputers
Volume14
Issue number5
Early online date2 May 2025
DOIs
Publication statusPublished - May 2025

Keywords

  • dataset creation
  • deep learning
  • large language models
  • model adaptation
  • model fine-tuning

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A framework for domain-specific dataset creation and adaptation of large language models'. Together they form a unique fingerprint.

Cite this