Issuu

International Research Journal of Engineering and Technology (IRJET) Volume: 12 Issue: 10 | Oct 2025

www.irjet.net

e-ISSN: 2395-0056 p-ISSN: 2395-0072

Reinforcement Learning from Human Feedback for Trustworthy Vision–Language Models Laraib Ahmad Siddiqui1, Mohd Shahzad2 1Program Control Services Analyst, Accenture, India 2AWS and DevOps Consultant, Deloitte, India

---------------------------------------------------------------------------***---------------------------------------------------------------------1.2 Problem Statement

Abstract - Multimodal foundation models that jointly

process vision and language, such as CLIP, BLIP-2, and GPT4V, demonstrate impressive perceptual and reasoning abilities but remain prone to hallucination, bias, and misalignment with human intent. This paper introduces an RLHF-based alignment framework for multimodal models, designed to teach them what humans actually value in visual understanding tasks. We construct a preference dataset of human-rated visual explanations across the domains of image captioning, visual question answering, and video narration. A reward model jointly optimizes for factual consistency, contextual grounding, and bias penalties, and a scalable evaluation harness built on top of Kubernetes enables the automated comparison of pre- and post-alignment model outputs. Empirical results show measurable improvements in factual accuracy (↑ 18%), bias reduction (↓ 22%), and overall human preference alignment (↑ 25%) on multimodal benchmarks. Our findings offer a reproducible path toward trustworthy vision–language alignment, laying the groundwork for safer multimodal agents in deployment contexts.

Traditional supervised fine-tuning optimizes likelihood on human-authored text but fails to encode the nuanced preferences underlying human visual understanding of truthfulness, contextual sensitivity, and social fairness. While RLHF has proven transformative for large language models, extending it to vision–language domains presents unique challenges:  Multi-modal reward modeling requires joint visual and textual grounding.  Human preferences depend on both factual alignment and perceptual saliency.  Evaluation must be scalable and reproducible across large, heterogeneous datasets.

1.3 Proposed Approach We propose a Reinforcement Learning from Human Feedback (RLHF) framework tailored to multimodal models.

Key Words: Reinforcement Learning from Human Feedback (RLHF), Vision–Language Models (VLMs), Human Preference Alignment, Factual Consistency, Bias Mitigation, Multimodal Model Evaluation, Visual Grounding, Trustworthy AI

The approach consists of: 1. Human-Preference Dataset Creation: Collect pairwise preferences on visual outputs (captions, VQA answers, rationales). 2. Reward Modeling: Train a composite reward model combining (a) relevance to visual content, (b) factual faithfulness, and (c) bias-sensitive regularization. 3. Policy Optimization: Fine-tune a CLIP- or BLIPbased encoder–decoder via proximal policy optimization (PPO) using the learned reward. 4. Evaluation Harness: Implement scalable comparison using a Python/Kubernetes pipeline with automatic metric logging (accuracy, bias, human preference scores).

1. INTRODUCTION 1.1 Motivation The next generation of AI assistants must see, talk, and act safely. Multimodal models such as GPT-4V, Gemini 1.5 Pro, and Flamingo have bridged language and perception, yet their reasoning often drifts from visual evidence, producing hallucinated captions, biased predictions, or contextually implausible narratives.

1.4 Contributions

As these systems increasingly drive applications in autonomous robotics, medical imaging, and content moderation, alignment with human intent becomes essential not only for user trust but also for regulatory compliance under frameworks like the EU AI Act.

Impact Factor value: 8.315

Our contributions are threefold: 1. A novel multimodal RLHF pipeline integrating human preferences directly into vision–language reasoning.

ISO 9001:2008 Certified Journal

Page 426