2024PHM Diagnostics-LLaVA：用于特定领域设备诊断的可视化语言模型

ID：72702

阅读量：2

大小：2.73 MB

页数：9页

时间：2025-01-03

金币：10

上传者：神经蛙1号

Diagnostics-LLaVA: A Visual Language Model for Domain-Speciﬁc

Diagnostics of Equipment

Aman Kumar, Mahbubul Alam, Ahmed Farahat, Maheshjabu Somineni and Chetan Gupta

Industrial AI Lab, Research & Development, Hitachi America Ltd., Santa Clara, CA, 95054, USA

aman.kumar@hal.hitachi.com

mahbubul.alam@hal.hitachi.com

ahmed.farahat@hal.hitachi.com

maheshjabu.somineni@hal.hitachi.com

chetan.gupta@hal.hitachi.com

ABSTRACT

The recent advancements in the area of Large language

models (LLMs) have opened horizons for conversational

assistant-based intelligent models capable of interpreting im-

ages, and providing textual response, also known as Vi-

sual language models (VLMs). These models can assist

equipment operators and maintenance technicians in com-

plex Prognostics and Health Management (PHM) tasks such

as diagnostics of faults, root cause analysis, and repair rec-

ommendations. Signiﬁcant open-source contributions in the

area of VLMs have been made. However, models trained in

general data fail to perform well in complex tasks in spe-

cialized domains such as diagnostics and the repair of in-

dustrial equipment. Therefore, in this paper, we discuss our

work on the development of Diagnostics-LLaVA, a VLM

suitable for interpreting images of speciﬁc industrial equip-

ment, and provide better response than existing open source

models in PHM tasks such as fault diagnostics and repair

recommendation. We introduce Diagnostics-LLaVA based

on the architecture of LLaVA and created one instance of

Diagnostics-LLaVA for the automotive repair domain, re-

ferred to as Automotive-LLaVA. We demonstrate that our

proposed Automotive-LLaVA model performs better than the

state-of-the-art open-source visual language models such as

mPlugOWL and LLaVA in both qualitative and quantitative

experiments.

1. INTRODUCTION

The development of domain-speciﬁc visual language mod-

els has emerged as an important area of research due to the

increasing demand for advanced artiﬁcial intelligence sys-

tems that can communicate, reason, and understand the visual

world effectively (Park & Kim, 2023). A Visual Language

Model (VLM) combines the capabilities of Computer Vision

(CV) and Natural Language Processing (NLP) to create a sys-

tem that comprehends and generates descriptions based on vi-

sual content with the help of large language models (LLMs)

(Wang et al., 2023). Within the ﬁeld of prognostics and health

management (PHM), a domain-speciﬁc VLM tailored to the

needs of equipment operators and maintenance technicians

has the potential to revolutionize the maintenance and re-

pair of equipment in various industries (Lai et al., 2024). By

leveraging a domain-speciﬁc VLM, operators and technicians

can seamlessly interact with such intelligent systems, which

can automatically analyze equipment components, identify

issues, and communicate relevant information in an efﬁcient

and intuitive manner. As technology continues to advance,

such a specialized VLM will enable technicians to stream-

line diagnosis and repair processes, increase operations and

maintenance efﬁciency, and ultimately enhance overall user

satisfaction and safety.

Recent advancements in Visual Language Models (VLMs)

have signiﬁcantly improved the integration of computer vi-

sion and natural language processing (He et al., 2024).

Notable developments include the Multi-modal Instruction

Tuned LLMs with Fine-Grained Visual Perception (AnyRef)

model which generates pixel-wise object perceptions and nat-

ural language descriptions from multi-modality references

(X. Zhao et al., 2024). Additionally, the LLaVA model (Liu,

Li, Wu, & Lee, 2024) enhances visual processing by integrat-

ing multi-granularity images and introducing a novel visual

instruction tuning method for extending MLLMs to perform

various multi-modal tasks, surpassing previous state-of-the-

art performance on multiple visual instruction tuning bench-

marks. mPLUG-Owl (Ye et al., 2023) is another popular

open-source VLM. mPLUG-Owl2 (Ye et al., 2024), an exten-

sion of the mPLUG-Owl model, revolutionizes multi-modal

large language models by effectively leveraging modality col-

laboration to improve performance in both text and multi-

modal tasks. Despite these advancements, some VLMs do not

align with human vision illusions, particularly for question-

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 9



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

2024PHM Diagnostics-LLaVA：用于特定领域设备诊断的可视化语言模型

最近更新

大家都在看

相关文章

相关标签