An sLLM-Powered Multi-modal RAG System for Intelligent Sign Language Avatar Docent Services in Barrier-free Museums.
성명(소속) : 정태문 (컴퓨터 소프트웨어학과)
일시 : 2026년 5월 14일 (목) 12:00 ~ 13:00
장소 : IT/BT 406호
Abstract
Traditional docent services at museums and art galleries face structural challenges, including a shortage of specialized staff and limited accessibility for underserved populations. To address these issues, this study proposes a multimodal, barrier-free intelligent docent platform based on small language models (sLLMs) and search-augmented generation (RAG) technology, and verifies its effectiveness in real exhibition environments. The system proposed in this study centers on a multimodal search-augmented generation (MM-RAG) architecture that integrates text, real-time 3D spatial information (UWB), and visual information. This enables intuitive question-and-answer interactions based on the user’s location and line of sight, and clearly resolves the context of ambiguous questions such as “What is this artwork?” Furthermore, using an sLLM (DocentGemma-V) fine-tuned with over 140,000 domain-specific data points and 6,800 pairs of preference data, the system generated responses that deliver high accuracy and user satisfaction even within a resource-constrained environment. Furthermore, we integrated interfaces designed for inclusive accessibility, including sign language avatars for the hearing impaired, audio descriptions (TTS) and haptic feedback for the visually impaired, and multilingual support for international visitors. In a field validation study conducted in two rounds at a national museum with a total of 410 visitors, the proposed system demonstrated outstanding performance and utility. Search accuracy utilizing multimodal context achieved an MRR of 78.54% and a Recall@1 of 63.45%, demonstrating performance more than twice as high as that of text-only search. The domain-specific sLLM recorded a benchmark accuracy of 55.60%, indicating user preference that surpassed that of large-scale commercial models. Furthermore, in the second field trial, the system achieved a high average user satisfaction score of 4.3 out of 5, receiving particularly high ratings in the accessibility category. The synchronization errors for subtitles and sign language avatars were 15.3 ms and 28.7 ms, respectively—significantly below the perception threshold—confirming the system’s ability to provide stable service. This study holds significant academic and industrial implications as it experimentally demonstrates that a well-designed multimodal RAG system combined with a domain-specific small language model can serve as a cost-effective and highly satisfying alternative to large-scale general-purpose models in the public cultural services sector.
