Diagnostics-LLaVA: A Visual Language Model for Domain-Specific
Diagnostics of Equipment
Aman Kumar, Mahbubul Alam, Ahmed Farahat, Maheshjabu Somineni and Chetan Gupta
Industrial AI Lab, Research & Development, Hitachi America Ltd., Santa Clara, CA, 95054, USA
aman.kumar@hal.hitachi.com
mahbubul.alam@hal.hitachi.com
ahmed.farahat@hal.hitachi.com
maheshjabu.somineni@hal.hitachi.com
chetan.gupta@hal.hitachi.com
ABSTRACT
The recent advancements in the area of Large language
models (LLMs) have opened horizons for conversational
assistant-based intelligent models capable of interpreting im-
ages, and providing textual response, also known as Vi-
sual language models (VLMs). These models can assist
equipment operators and maintenance technicians in com-
plex Prognostics and Health Management (PHM) tasks such
as diagnostics of faults, root cause analysis, and repair rec-
ommendations. Significant open-source contributions in the
area of VLMs have been made. However, models trained in
general data fail to perform well in complex tasks in spe-
cialized domains such as diagnostics and the repair of in-
dustrial equipment. Therefore, in this paper, we discuss our
work on the development of Diagnostics-LLaVA, a VLM
suitable for interpreting images of specific industrial equip-
ment, and provide better response than existing open source
models in PHM tasks such as fault diagnostics and repair
recommendation. We introduce Diagnostics-LLaVA based
on the architecture of LLaVA and created one instance of
Diagnostics-LLaVA for the automotive repair domain, re-
ferred to as Automotive-LLaVA. We demonstrate that our
proposed Automotive-LLaVA model performs better than the
state-of-the-art open-source visual language models such as
mPlugOWL and LLaVA in both qualitative and quantitative
experiments.
1. INTRODUCTION
The development of domain-specific visual language mod-
els has emerged as an important area of research due to the
increasing demand for advanced artificial intelligence sys-
tems that can communicate, reason, and understand the visual
world effectively (Park & Kim, 2023). A Visual Language
Model (VLM) combines the capabilities of Computer Vision
(CV) and Natural Language Processing (NLP) to create a sys-
tem that comprehends and generates descriptions based on vi-
sual content with the help of large language models (LLMs)
(Wang et al., 2023). Within the field of prognostics and health
management (PHM), a domain-specific VLM tailored to the
needs of equipment operators and maintenance technicians
has the potential to revolutionize the maintenance and re-
pair of equipment in various industries (Lai et al., 2024). By
leveraging a domain-specific VLM, operators and technicians
can seamlessly interact with such intelligent systems, which
can automatically analyze equipment components, identify
issues, and communicate relevant information in an efficient
and intuitive manner. As technology continues to advance,
such a specialized VLM will enable technicians to stream-
line diagnosis and repair processes, increase operations and
maintenance efficiency, and ultimately enhance overall user
satisfaction and safety.
Recent advancements in Visual Language Models (VLMs)
have significantly improved the integration of computer vi-
sion and natural language processing (He et al., 2024).
Notable developments include the Multi-modal Instruction
Tuned LLMs with Fine-Grained Visual Perception (AnyRef)
model which generates pixel-wise object perceptions and nat-
ural language descriptions from multi-modality references
(X. Zhao et al., 2024). Additionally, the LLaVA model (Liu,
Li, Wu, & Lee, 2024) enhances visual processing by integrat-
ing multi-granularity images and introducing a novel visual
instruction tuning method for extending MLLMs to perform
various multi-modal tasks, surpassing previous state-of-the-
art performance on multiple visual instruction tuning bench-
marks. mPLUG-Owl (Ye et al., 2023) is another popular
open-source VLM. mPLUG-Owl2 (Ye et al., 2024), an exten-
sion of the mPLUG-Owl model, revolutionizes multi-modal
large language models by effectively leveraging modality col-
laboration to improve performance in both text and multi-
modal tasks. Despite these advancements, some VLMs do not
align with human vision illusions, particularly for question-
1