\hyphenation{op-tical net-works semi-conduc-tor} \...
作成日: 2025年12月24日
使用モデル GPT-5.2 Thinking by Chat01
作成日: 2025年12月24日
使用モデル GPT-5.2 Thinking by Chat01
\hyphenation{op-tical net-works semi-conduc-tor}
\documentclass[lettersize,journal]{IEEEtran}
\usepackage{amsmath,amsfonts}
%\usepackage{algorithmic}
%\usepackage{algorithm}
\usepackage{array}
%\usepackage[caption=false,font=normalsize,labelfont=sf,textfont=sf]{subfig}
\usepackage{textcomp}
\usepackage{stfloats}
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{cite}
\usepackage[colorlinks,linkcolor=blue,anchorcolor=blue,citecolor=blue]{hyperref}
\usepackage{subfigure}
\usepackage{bbold}
\usepackage{bbm}
\usepackage{mathrsfs}
\usepackage{tikz}
\usepackage{listings}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epstopdf}
\usepackage{setspace}
\usepackage{threeparttable}
\usepackage{lineno}
\usepackage{caption}
\usepackage{ctex}
\usepackage{ragged2e}
\usepackage{booktabs, makecell, multirow, tabularx}
\usepackage[linesnumbered, ruled, lined,boxed,commentsnumbered]{algorithm2e}[1]
\renewcommand{\algorithmcfname}{Algorithm}
\SetKwInput{KwIn}{Input}
\SetKwInput{KwOut}{Output}
\theoremstyle{plain}
\newtheorem{thm}{Theorem}[section]
\newtheorem{lem}[thm]{Lemma}
\newtheorem{prop}[thm]{Proposition}
\newtheorem{cor}[thm]{Corollary}
\theoremstyle{definition}
\newtheorem{defn}[thm]{Definition}
\newtheorem{ques}[thm]{Question}
\newtheorem{exmp}[thm]{Example}
\newtheorem{rem}[thm]{Remark}
\newtheorem{note}[thm]{Note}
\newcommand{\bv}{\bigvee}
\newcommand{\bw}{\bigwedge}
\newcommand{\ra}{\rightarrow}
\newcommand{\la}{\leftarrow}
\newcommand{\lam}{\lambda}
\newcommand{\laa}{\longrightarrow}
\newcommand{\gam}{\gamma}
\theoremstyle{break}
\newtheorem{bthm}{Theorem}[subsection]
\newtheorem{blem}[bthm]{Lemma}
\newtheorem{bprop}[bthm]{Proposition}
\newtheorem{bcor}[bthm]{Corollary}
\theoremstyle{definition}
\newtheorem{bdefn}[bthm]{Definition}
\newtheorem{bques}[bthm]{Question}
\newtheorem{bexmp}[bthm]{Example}
\newtheorem{brem}[bthm]{Remark}
\newtheorem{bnote}[bthm]{Note}
\usepackage{diagbox}
\makeatletter
\newcommand\figcaption{\def@captype{figure}\caption}
\newcommand\tabcaption{\def@captype{table}\caption}
\makeatother
\ifCLASSINFOpdf
\else
\fi
\begin{document}
\title{Title}
\markboth{IEEE TRANSACTIONS ON FUZZY SYSTEMS}%
{Shell \MakeLowercase{\textit{et al.}}: Bare Demo of IEEEtran.cls for IEEE Journals}
\maketitle
\begin{abstract}
Multimodal learning has emerged as a pivotal research direction, focusing on integrating diverse information sources to simulate human-like comprehensive understanding. However, traditional approaches often exhibit limitations in effectively modeling the inherent uncertainty and ambiguity present in complex real-world environments. Although Duan et al.\cite{duan2025fuzzy} recently introduced a Fuzzy Multimodal Learning (FUME) framework to address these issues in Cross-Modal Retrieval, we identify that this method lacks direct transferability, rendering it suboptimal for tasks with different logical structures. Addressing this gap, this paper proposes a specialized optimization of the FUME framework tailored for Visual Question Answering (VQA). Specifically, we adapt the fuzzy learning mechanism to the characteristics of the DAQUAR and Visual7W datasets. By reformulating the loss constraints to align with these tasks, our experiments demonstrate that the proposed method effectively mitigates the transferability issues, achieving enhanced robustness and accuracy on both datasets.
\end{abstract}
\begin{IEEEkeywords}
keyword1, keyword12.
\end{IEEEkeywords}
\section{Introduction}\label{s1}
In recent years, with the rapid advancement of artificial intelligence technology, unimodal information processing methods, such as those relying solely on text or images, have struggled to meet the growing demand for deep understanding and reliable interaction with the complex real world. Unimodal approaches are often limited by information constraints and environmental vulnerabilities—for instance, speech recognition in noisy environments is highly error-prone—making it impossible to achieve human-like cross-sensory association and comprehensive reasoning. In this context, multimodal learning has emerged as a critical solution. Its core principle lies in integrating and synergizing information from diverse sources, such as vision, language, and audio, enabling machines to acquire more comprehensive, robust, and generalizable semantic representations. This advancement drives artificial intelligence from "perception" toward "cognition" and "creation," establishing multimodal learning as a key direction in the era of large-scale models.Below are some examples of multimodal learning:
\begin{enumerate}
\item \textbf{Image/Video Captioning} \
Generating descriptive textual captions for given images or videos by understanding and translating visual content into natural language.
text\item \textbf{Visual Question Answering (VQA)} \\ Answering natural language questions about a given image or video by jointly reasoning over both the visual content and the textual query. \item \textbf{Multimodal Sentiment Analysis} \\ Determining a person's emotional state or opinion by integrating and analyzing complementary signals from text, speech, and visual expressions (e.g., face, gesture). \item \textbf{Cross-Modal Retrieval} \\ Retrieving relevant items from one modality (e.g., images) using a query from a different modality (e.g., text), by learning a shared representation space across modalities.
\end{enumerate}
Although current multimodal learning methods have demonstrated promising performance, most of these models are deterministic and struggle to effectively handle the uncertainties arising from the complex structure of multimodal data. To address this, Duan et al.\cite{duan2025fuzzy} proposed a Fuzzy Multimodal Learning method (FUME) for cross-modal retrieval, which enables trustworthy retrieval by self-estimating cognitive uncertainty. Specifically, FUME leverages fuzzy set theory to interpret the output of a classification network as a set of membership degrees, and quantifies category credibility by combining possibility and necessity measures. Extensive experiments on five benchmark datasets show that FUME significantly improves retrieval performance and reliability, offering a promising solution for cross-modal retrieval in high-stakes applications.
\newline
However,despite the shared macro-level framework across different multimodal learning tasks, the fundamental differences in datasets, algorithmic workflows, and training logic prevent the direct transfer of fuzzy multimodal learning methods from cross-modal retrieval to other tasks—particularly in the design of loss functions. Therefore, this work aims to extend FUME to the VQA task. Taking the DAQUAR and V7W datasets as examples, we detail how to adapt fuzzy multimodal learning for VQA and introduce an overlap function as a key component in designing the loss function, ensuring that it aligns more closely with the principles of fuzzy operations.Section 2 introduces related work and applications of overlap function and VQA. In Section 3,we propose VQA algorithms based on fuzzy multimodal learning, specifically tailored to the characteristics of the DAQUAR and V7W datasets, with loss functions optimized using overlap functions. In Section 4, experiments are conducted on both datasets and the results are analyzed. Section 5 draws conclusions based on the experimental outcomes and provides a focused analysis of how overlap functions optimize the loss function. Finally, Section 6 summarizes the strengths and limitations of the proposed model and suggests directions for future research.
%%\cite{2017TFS3} \cite{lu2024constructing}
\section{Related Work}\label{s2}
\subsection{Overlap function}
Fuzzy aggregation functions are a hot research topic in the field of fuzzy mathematics, among which overlap functions, as a type of fuzzy logical connective without associativity, have yielded substantial results both theoretically and in applications. Lu et al.\cite{lu2024constructing} studied overlap functions on bounded posets so as to lift the continuity in the notion of overlap functions from [0,1] to bounded posets mainly from the topological aspects. Qiao et al.\cite{qiao2019homogeneous} introduced the notions of pseudo-homogeneous overlap and grouping functions, which can be regarded
as the generalizations of the concepts of homogeneous and quasi-homogenous overlap and grouping functions, respectively. Zhang et al.\cite{zhang2023overlap} proposed two new groups of fuzzy mathematical morphology (FMM) operators with the aid of overlap functions, and applied them to image processing. Qiao et al.\cite{qiao2020alpha} stduied α-cross-migrativity for overlap functions. Liu et al.\cite{liu2020extensions} investigated Z-extended overlap functions and grouping functions on fuzzy truth values. Jurio et al.\cite{jurio2013some} studied under which conditions overlap and grouping functions satisfy some commonly demanded properties such as migrativity or homogeneity.
\subsection{VQA}
Multimodal learning stands as one of the leading directions in contemporary machine learning. Among its representative tasks, Visual Question Answering (VQA) has garnered extensive research attention.
Dua et al.\cite{dua2021beyond} presented a completely generative formulation where a multi=word answer is generated for a visual query. Mishra et al.\cite{mishra2020cq} proposed CQ-VQA, a novel 2-level hierarchical but end-to-end model to solve the task of VQA. Mensink et al.\cite{mensink2023encyclopedic} proposed Encyclopedic-VQA, a large scale visual question answering dataset featuring visual questions about detailed properties of fine-grained categories and instances. Wang et al.\cite{wang2022tag} proposed TAG, a text-aware visual question answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. Tang et al.\cite{tang2024multiple} presented Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. Madaka et al.\cite{madaka2025vqa} proposed pilot version, named VQA-Levels, which served as a new benchmark dataset designed to systematically test VQA systems and support researchers in advancing the field.
\section{Preliminaries}\label{s3}
\subsection{Aggregation Functions}
\begin{definition}[-ary Aggregation Function]
Let . A function
is called an \emph{-ary aggregation function} if it satisfies:
\begin{itemize}
\item \textbf{Boundary conditions:}
\item \textbf{Monotonicity (componentwise nondecreasing):} for any , if , then
\end{itemize}
\end{definition}
\subsection{Overlap Functions}
\begin{definition}[Overlap Function]
A binary function
is called an \emph{overlap function} if for all it holds that:
\begin{itemize}
\item[(O1)] \textbf{Commutativity:} .
\item[(O2)] \textbf{Boundary condition:} if and only if .
\item[(O3)] \textbf{Boundary condition:} if and only if .
\item[(O4)] \textbf{Monotonicity:} if , then .
\item[(O5)] \textbf{Continuity:} is continuous in both arguments (jointly continuous).
\end{itemize}
\end{definition}
\begin{definition}[1-partial contraction / expansion]
Let be an overlap function.
\begin{itemize}
\item is said to satisfy \emph{1-partial contraction} if for all ,
\item is said to satisfy \emph{1-partial expansion} if for all ,
\end{itemize}
\end{definition}
\subsection{VQA}
Visual Question Answering (VQA) takes an image and a natural-language question as input, and predicts an answer.
In the common \emph{classification-based} setting, the model outputs a distribution over a fixed answer set
:
\subsubsection{Image Stream}
Extract a global visual feature vector using an image encoder (e.g., CNN):
Optionally apply normalization:
Project to a shared latent space:
\subsubsection{Question Stream}
Tokenize the question and map tokens to embeddings. One generic way is:
Then encode the sequence with a text encoder (e.g., LSTM/GRU/Transformer) to obtain a question representation:
Project to the same shared space:
\subsubsection{Late Fusion (Element-wise Multiplication)}
Fuse the two modalities via element-wise (Hadamard) product:
\subsubsection{Answer Prediction (MLP + Softmax)}
Apply a small MLP to the fused representation:
Compute logits and softmax probabilities:
Prediction (open-ended classification):
\subsubsection{Training Objective (Cross-Entropy)}
For a dataset (or mini-batch) of samples with one-hot labels , minimize cross-entropy:
\begin{figure}[!t]
\centering
% width=\linewidth 保证图片宽度自动适应单栏宽度
\includegraphics[width=\linewidth]{VQA.png}
\caption{Overview of the Visual Question Answering (VQA) task structure on DAQUAR and Visual7W datasets.}
\label{fig:vqa}
\end{figure}
\subsection{FUML : Fuzzy multimodal learning}
\paragraph{Notation.}
Samples are indexed by , views by , classes by .
For view , a neural network outputs logits and a one-hot label is .
\subsubsection{Logits fuzzy memberships}
\begin{align}
\bm m_i^v &= \ReLU!\left(\frac{\bm a_i^v}{\lVert \bm a_i^v\rVert_p}\right)\in[0,1]^K,
\qquad
\bm m_i^v = [m_{i1}^v,\dots,m_{iK}^v]^\top .
\end{align}
\subsubsection{Category credibility (possibility + necessity)}
Define (implicit) necessity for class in view by
\begin{align}
e_{ik}^v &= 1-\max_{l\neq k} m_{il}^v,
\end{align}
and category credibility by
\begin{align}
c_{ik}^v &= \tfrac12\left(m_{ik}^v + e_{ik}^v\right)
= \tfrac12\left(m_{ik}^v + 1-\max_{l\neq k} m_{il}^v\right),
\qquad
\bm c_i^v\in[0,1]^K.
\end{align}
\subsubsection{Category-credibility learning (CCL)}
Let be the ground-truth class index (for one-hot labels).
Define training-time credibility via
\begin{align}
r_{ik}^v=
\begin{cases}
\dfrac{m_{ik}^v + 1-\max_{h\neq k} m_{ih}^v}{2}, & y_{ik}=1,\[8pt]
\dfrac{m_{ik}^v + 1- m_{il}^v}{2}, & y_{ik}=0,
\end{cases}
\end{align}
and optimize with a BCE-style loss over a mini-batch of size :
\begin{align}
\mathcal{L}{\mathrm{ccl}}(\bm r^v,\bm y)
&= \frac{1}{N_b}\sum{i=1}^{N_b}
\Big(
-\bm y_i^\top \log(\bm r_i^v)
-(\bm 1-\bm y_i)^\top \log(\bm 1-\bm r_i^v)
\Big).
\end{align}
\subsubsection{View-specific uncertainty (entropy over credibility)}
Define the binary entropy and
\begin{align}
u_i^v &= \frac{1}{K\ln 2}\sum_{k=1}^K \Hbin(c_{ik}^v)\in[0,1].
\end{align}
\subsubsection{Inter-view conflict (cosine disagreement of memberships)}
\begin{align}
o_i^v
&= \frac{1}{V-1}\sum_{j\neq v}\left(
1-\frac{\bm m_i^v\cdot \bm m_i^j}{\lVert \bm m_i^v\rVert,\lVert \bm m_i^j\rVert}
\right)\in[0,1].
\end{align}
\subsubsection{Dual-reliable fusion (DRF)}
Let be monotone increasing (commonly ). Define
\begin{align}
w_i^v
&=
\frac{g!\left((1-u_i^v)(1-o_i^v)\right)}
{\sum_{s=1}^V g!\left((1-u_i^s)(1-o_i^s)\right)},
\qquad
\bm m_i^a = \sum_{v=1}^V w_i^v,\bm m_i^v.
\end{align}
Prediction can be taken as .
\subsubsection{Multi-task overall objective}
Compute fused training credibility from using the same piecewise rule as above, then optimize
\begin{align}
\mathcal{L}{\mathrm{total}}
&=
\mathcal{L}{\mathrm{ccl}}(\bm r^a,\bm y)
+\sum_{v=1}^V \mathcal{L}_{\mathrm{ccl}}(\bm r^v,\bm y).
\end{align}
\begin{figure}[!t]
\centering
\includegraphics[width=\linewidth]{FUML.png}
\caption{The proposed Fuzzy Multimodal Learning (FUML) framework incorporating Overlap Functions for optimized loss calculation.}
\label{fig:fuml}
\end{figure}
\section{Proposed method and experiment}\label{s4}
In this section, we will introduce how to integrate fuzzy multimodal learning into specific VQA tasks. Given the significant differences in structure and task logic across various VQA datasets, we select DAQUAR and v7w as the subjects of our study. Based on the characteristics and structure of each dataset, we respectively incorporate fuzzy multimodal learning into their corresponding VQA algorithms for fusion and implementation.
\subsection{DAQUAR}
In Visual Question Answering (VQA), problems are generally treated as open-ended answer generation tasks, but can also be transformed into a classification problem that involves selecting from a predefined set of answers. However, the DAQUAR dataset generates answers through a sequence-to-sequence approach rather than from a fixed set of categories. To introduce fuzzy set theory, we need to reinterpret the model’s output.
Specifically, the sequence generation problem can be reframed as a classification task over the vocabulary at each time step, and fuzzy set theory can be applied to model the credibility of each category (word) at every step. At each time step, the probability distribution over the vocabulary output by the model can be regarded as a fuzzy set, where the probability of each word represents its degree of membership as the correct answer at that time step.
Furthermore, the FUME framework defines category credibility by combining possibility and necessity measures. Similarly, in the sequence generation task of VQA, we can define corresponding category credibility at each time step. Since answer generation in VQA is sequential, fuzzy set theory can be applied per time step, and a loss function for the entire sequence can be constructed on this basis.
Moreover, when adapting the FUME approach from cross-modal learning tasks to the VQA task, significant changes in semantics and logic necessitate corresponding modifications to the structure of the loss function. To address this, we reformulate the original consistency loss component using an overlap function to construct a corresponding loss function that better aligns with the characteristics of the VQA task.
\subsubsection*{0) Notation and Model Architecture}
Let denotes the batch size and the vocabulary size. A training interaction is represented by a sample , consisting of an image , a question sequence , and a target answer sequence . We employ teacher forcing during training, where the decoder input is denoted by . The padding indicator function is defined as , which equals if the condition holds (i.e., not a padded token) and otherwise.
\subsubsection{Image Encoding}
We employ a VGG-16 backbone to extract global visual features. The process involves extracting the fc7 features, applying normalization, and projecting them into a shared embedding space using a linear layer with a activation:
\begin{equation}
\mathbf x^{(b)}=\tanh!\Big(W_p \cdot \mathrm{norm}_2(\mathrm{fc7}(\mathrm{VGG}(I^{(b)})))\Big) \in \mathbb R^{d_I},
\end{equation}
where is the projection matrix and is the feature dimension.
\subsubsection{Question Encoding with Visual Injection}
Unlike standard late-fusion models, we adopt an input-level fusion strategy where visual information is injected at every time step. Let be the token embedding matrix. For the encoder LSTM, the input at step is the concatenation of the image feature and the current token embedding:
\begin{equation}
\mathbf u^{(b)}_t=\big[\mathbf x^{(b)}; E[q^{(b)}_t]\big] \in \mathbb R^{d_I+d_E}.
\end{equation}
The LSTM updates its hidden state and cell state recursively:
\begin{equation}
(\mathbf h^{(b)}_t,\mathbf c^{(b)}t)=\mathrm{LSTM}(\mathbf u^{(b)}t,\mathbf h^{(b)}{t-1},\mathbf c^{(b)}{t-1}).
\end{equation}
The final states are then used to initialize the decoder.
\subsubsection{Generative Answer Decoding}
The decoder generates the answer sequence auto-regressively. Consistent with the encoder, the image feature is concatenated with the embedding of the previous token (or ground truth during training) to form the decoder input :
\begin{equation}
\mathbf v^{(b)}_t=\big[\mathbf x^{(b)}; E[a^{(b)}_t]\big].
\end{equation}
The decoder LSTM updates its states initialized by the encoder's final states:
\begin{equation}
(\mathbf s^{(b)}_t,\tilde{\mathbf c}^{(b)}t)=\mathrm{LSTM}(\mathbf v^{(b)}t,\mathbf s^{(b)}{t-1},\tilde{\mathbf c}^{(b)}{t-1}), \quad (\mathbf s^{(b)}_0,\tilde{\mathbf c}^{(b)}_0)=(\mathbf h^{(b)}_0,\mathbf c^{(b)}_0).
\end{equation}
Finally, the output logits are computed via a linear projection of the hidden state .
\subsubsection{Fuzzy Logic Components}
To handle semantic ambiguity in VQA, we replace the standard cross-entropy loss with two fuzzy logic-based objectives. We first compute the softmax membership distribution at each step:
\begin{equation}
m^{(b)}_{t,v} = \mathrm{softmax}(\mathbf z^{(b)}_t)_v.
\end{equation}
\paragraph{Fuzzy Credibility Transformation (for FML)}
For the Fuzzy Multimodal Learning (FML) loss, we transform the membership into a Credibility score to explicitly maximize the margin between the target class and the strongest distractor. Let be the target class index. We define the maximum non-target membership as .
The credibility is computed as:
\begin{equation}
r^{(b)}{t,v} =
\begin{cases}
\frac{1}{2}\big(m^{(b)}{t,y} + 1 - m^{(b)}{t,\neg y}\big), & \text{if } v = y \text{ (Target)} \
\frac{1}{2}\big(m^{(b)}{t,v} + 1 - m^{(b)}_{t,y}\big), & \text{if } v \neq y \text{ (Non-target)}
\end{cases}
\end{equation}
This transformation ensures that a high credibility for the target requires not only a high probability but also a low probability for the most confusing alternative.
\paragraph{Possibility and Necessity (for Aggregation Loss)}
For the Fuzzy Aggregation Loss, we define the \textit{Possibility} and \textit{Necessity} of the target class. Optionally, a temperature can be applied to the logits before softmax.
\begin{equation}
p^{(b)}t = \pi^{(b)}{t,y}, \quad q^{(b)}t = 1 - \max{v\neq y}\pi^{(b)}_{t,v},
\end{equation}
where is the softmax distribution (scaled by ). Here, represents the necessity, derived from the complement of the strongest contending class.
\subsubsection{Training Objectives}
The total loss is composed of the Fuzzy Aggregation Loss () and the FML Loss ().
\paragraph{1. Fuzzy Aggregation Loss}
This loss maximizes the consistency between the possibility and necessity of the correct answer using an \textbf{Overlap Function} . In this work, we focus on the Cosine Overlap, though Geometric and Harmonic variants are also supported:
\begin{equation}
c^{(b)}t = F{\text{cos}}(p^{(b)}_t, q^{(b)}_t) = \frac{p^{(b)}_t q^{(b)}_t}{\sqrt{(p^{(b)}t)^2+(q^{(b)}t)^2+\varepsilon}+\varepsilon}.
\end{equation}
The aggregated loss is the average complement of the consistency score over valid tokens:
\begin{equation}
L{\text{agg}} = \frac{\sum{b,t} \mathbb I[y^{(b)}_t \neq \texttt{pad}] \cdot (1 - c^{(b)}t)}{N{\text{valid}}},
\end{equation}
where is the total count of non-padded tokens.
\paragraph{2. FML Loss}
The FML loss applies a cross-entropy-like penalty on the transformed credibility distribution , encouraging the model to produce confident distinct predictions:
\begin{equation}
L_{\text{FML}} = -\frac{1}{N_{\text{valid}}} \sum_{b,t} \mathbb I[y^{(b)}t \neq \texttt{pad}] \log \frac{\exp(r^{(b)}{t,y})}{\sum_{v\in\mathcal V}\exp(r^{(b)}_{t,v})}.
\end{equation}
\paragraph{Total Objective}
The final optimization objective completely replaces standard cross-entropy:
\begin{equation}
L_{\text{total}} = \lambda_{\text{agg}} L_{\text{agg}} + \lambda_{\text{FML}} L_{\text{FML}}.
\end{equation}
We use the Adam optimizer with gradient clipping (threshold ) to update the model parameters based on .
\begin{algorithm}[htb]
\caption{Fuzzy Logic-based Seq2Seq VQA Training (DAQUAR)}
\label{alg:fume_vqa}
\begin{algorithmic}[1]
\REQUIRE Batch of Images , Questions , Ground Truth Answers
\REQUIRE Hyperparameters , Learning rate
\ENSURE Updated Model Parameters
text\FOR{each batch $b \in \{1, \dots, B\}$} \STATE \textbf{// 1. Image Encoding} \STATE $f_{\text{raw}} \leftarrow \mathrm{fc7}(\mathrm{VGG}(I^{(b)}))$ \STATE $\mathbf{x}^{(b)} \leftarrow \tanh(W_p \cdot \mathrm{norm}_2(f_{\text{raw}}))$ \COMMENT{Eq. 22} \STATE \textbf{// 2. Question Encoding (with Visual Injection)} \FOR{$t = 1$ to length($\mathbf{q}^{(b)}$)} \STATE $\mathbf{u}_t \leftarrow [\mathbf{x}^{(b)}; E[q^{(b)}_t]]$ \COMMENT{Concatenate Img + Token} \STATE $(\mathbf{h}_t, \mathbf{c}_t) \leftarrow \mathrm{EncoderLSTM}(\mathbf{u}_t, \mathbf{h}_{t-1}, \mathbf{c}_{t-1})$ \ENDFOR \STATE Initialize Decoder states: $(\mathbf{s}_0, \tilde{\mathbf{c}}_0) \leftarrow (\mathbf{h}_{last}, \mathbf{c}_{last})$ \STATE \textbf{// 3. Answer Decoding \& Fuzzy Loss Calculation} \STATE $L_{\text{batch}} \leftarrow 0$ \FOR{$t = 1$ to length($\mathbf{y}^{(b)}$)} \STATE $y \leftarrow y^{(b)}_t$ \COMMENT{Target class index} \IF{$y$ is padding} \textbf{continue} \ENDIF \STATE \textbf{// Decoder Step (Teacher Forcing)} \STATE $\mathbf{v}_t \leftarrow [\mathbf{x}^{(b)}; E[y^{(b)}_{t-1}]]$ \COMMENT{Input: Img + Prev Truth} \STATE $(\mathbf{s}_t, \tilde{\mathbf{c}}_t) \leftarrow \mathrm{DecoderLSTM}(\mathbf{v}_t, \mathbf{s}_{t-1}, \tilde{\mathbf{c}}_{t-1})$ \STATE $\mathbf{z}_t \leftarrow W_{out} \mathbf{s}_t$ \COMMENT{Logits} \STATE $\mathbf{m}_t \leftarrow \mathrm{softmax}(\mathbf{z}_t)$ \COMMENT{Membership Distribution} \STATE \textbf{// Fuzzy Metrics Computation} \STATE $m_{t, \neg y} \leftarrow \max_{v \neq y} m_{t,v}$ \COMMENT{Max distractor score} \STATE \textbf{// A. Compute Aggregation Loss Component} \STATE $p_t \leftarrow m_{t,y}$ \COMMENT{Possibility} \STATE $q_t \leftarrow 1 - m_{t, \neg y}$ \COMMENT{Necessity} \STATE $c_t \leftarrow \frac{p_t q_t}{\sqrt{p_t^2 + q_t^2 + \varepsilon} + \varepsilon}$ \COMMENT{Cosine Overlap (Eq. 30)} \STATE $\ell_{\text{agg}} \leftarrow 1 - c_t$ \STATE \textbf{// B. Compute FML Loss Component} \STATE Construct Credibility $\mathbf{r}_t$: \STATE $r_{t,y} \leftarrow \frac{1}{2}(m_{t,y} + 1 - m_{t, \neg y})$ \FOR{$v \in \mathcal{V}, v \neq y$} \STATE $r_{t,v} \leftarrow \frac{1}{2}(m_{t,v} + 1 - m_{t,y})$ \ENDFOR \STATE $\ell_{\text{FML}} \leftarrow -\log \frac{\exp(r_{t,y})}{\sum_k \exp(r_{t,k})}$ \COMMENT{Eq. 32} \STATE $L_{\text{batch}} \leftarrow L_{\text{batch}} + (\lambda_{\text{agg}} \ell_{\text{agg}} + \lambda_{\text{FML}} \ell_{\text{FML}})$ \ENDFOR \ENDFOR \STATE $L_{\text{total}} \leftarrow \frac{1}{N_{\text{valid}}} L_{\text{batch}}$ \STATE $\theta \leftarrow \mathrm{Adam}(\theta, \nabla_\theta L_{\text{total}})$ \COMMENT{With gradient clipping} \end{algorithmic}
\end{algorithm}
\subsection{DAQUAR's experiment}
Based on the algorithmic workflow described above, we conducted experiments on the DAQUAR dataset. The experiments were primarily divided into two groups: \textbf{FUME}, which does not utilize the overlap function as a loss term, and \textbf{FUME+overlap}, which incorporates the overlap function.
The hyperparameters were configured as follows: a batch size of 32, 20 epochs, and a learning rate of . To mitigate the impact of randomness caused by data shuffling in a multi-threaded environment, we performed the experiments using five distinct random seeds (42, 84, 126, 168, and 210). Table~\ref{tab:daquar_performance} presents a comparison of the best results achieved, where ``Baseline'' refers to the results reported in the original paper.
\begin{table}[htbp]
\centering
\caption{Comparison of the best performance of different algorithms on the DAQUAR dataset.}
\label{tab:daquar_performance}
\begin{tabular}{lccccc}
\toprule
\textbf{Method} & \multicolumn{5}{c}{\textbf{Seed}} \
\cmidrule(lr){2-6}
& 42 & 84 & 126 & 168 & 210 \
\midrule
FUME & 28.11% & 28.59% & 29.55% & 28.71% & 28.99% \
FUME+overlap & 28.87% & 28.59% & 29.91% & 29.63% & 29.79% \
\midrule
% Baseline 是固定值,使用 multicolumn 居中显示,表示它是一个通用的参考标准
Baseline\cite{malinowski2017ask} & \multicolumn{5}{c}{25.74%} \
\bottomrule
\end{tabular}
\end{table}
As shown in Table~\ref{tab:daquar_performance}, the proposed FUME algorithm demonstrates a significant performance improvement over the baseline on the DAQUAR dataset and maintains consistent performance across multiple random seeds. While the \textbf{FUME+overlap} variant (incorporating the overlap function into the loss) slightly outperforms the standard \textbf{FUME} algorithm in terms of peak accuracy, the overall margin between the two remains within 1%. Consequently, we will proceed with a more in-depth comparison between FUME and FUME+overlap to analyze their specific differences.
To validate the robustness of our proposed method (\textbf{FUME+overlap}) compared to (\textbf{FUME}), we conducted experiments using five different random seeds: 42, 84, 126, 168, and 210. The sequence accuracy curves training over 50 epochs are presented in Figure~\ref{fig:all_seeds}.
\begin{figure}[htbp]
\centering
% 设置图片宽度为 0.3\textwidth (适中大小)
% \[0pt] 表示换行且不增加额外行高
% \tiny 用于种子编号,尽可能减少文字占用的垂直空间
text% Seed 42 \includegraphics[width=0.3\textwidth]{42.jpg}\\ \tiny{Seed 42} \\[0.5em] % 仅留极小的间隙给下一张图 % Seed 84 \includegraphics[width=0.3\textwidth]{84.jpg}\\ \tiny{Seed 84} \\[0.5em] % Seed 126 \includegraphics[width=0.3\textwidth]{126.jpg}\\ \tiny{Seed 126} \\[0.5em] % Seed 168 \includegraphics[width=0.3\textwidth]{168.jpg}\\ \tiny{Seed 168} \\[0.5em] % Seed 210 \includegraphics[width=0.3\textwidth]{210.jpg}\\ \tiny{Seed 210} \caption{\textbf{Vertical Comparison.} FUME (Blue) vs. FUME+overlap (Orange) across 5 seeds.} \label{fig:vertical_stack}
\end{figure}
he comparative results across multiple random seeds reveal a consistent pattern in the training dynamics between the baseline (FUME) and our proposed method (FUME+overlap). The analysis can be summarized into three key stages:
\begin{itemize}
\item \textbf{Early Convergence Dynamics:}
In the initial phase (approximately the first 15 epochs), the baseline model (blue line) exhibits a steeper learning curve and faster convergence. We acknowledge this objectively and attribute it to the nature of the proposed loss function. The introduction of the Overlap Loss imposes additional geometric constraints, thereby increasing the complexity of the optimization landscape and making the initial exploration more challenging compared to the unconstrained baseline.
text\item \textbf{The Crossover Phenomenon:} A defining characteristic of our method is the distinct ``crossover'' observed in the mid-training phase. Despite a slower start, the proposed method (orange line) consistently intercepts and surpasses the baseline curve in every experiment. This trajectory demonstrates the method's capacity for sustained learning (``stamina''), indicating that the model continues to refine its representations effectively even after the baseline begins to saturate. \item \textbf{Superior Peak Performance and Stability:} In the final stages of training, the proposed method not only achieves a higher sequence accuracy but also plateau at a more stable level. While the baseline tends to fluctuate or stagnate, our method demonstrates enhanced robustness. This validates the effectiveness of the proposed approach, confirming that the overlap constraint acts as a beneficial regularizer that trades off initial convergence speed for superior generalization and higher final performance.
\end{itemize}
Overall, the proposed \textbf{FUML+overlap} algorithm proves to be superior to the conventional FUML baseline on the DAQUAR dataset, offering significantly improved stability and greater generalization capabilities.
\subsection{v7w}
Regarding the V7W dataset, our focus is on its Telling sub-task. The structure of this task is as follows: the input consists of an image and a question, and the output is a single correct answer selected from four options. These four options are specifically designed for each given image-question pair.
We transform this 4-choice-1 task into four binary classification tasks to process. Specifically, each (image, question, candidate answer) combination is treated as an independent binary classification sample, predicting whether the candidate answer is the "correct answer" or an "incorrect answer."
Through this transformation, the binary classification, membership degree, and uncertainty modeling methods from the FUME framework can be directly applied to the VQA task. At the same time, following the approach used with the DAQUAR dataset, we replace the original consistency loss in the FUME method with a loss function constructed based on an overlap function, which serves as a regularization term during model training.
\subsubsection{Data Expansion (VQA Telling as Binary Classification)}
Given a dataset
where is the question index, is the image associated with question ,
is the text of the -th candidate (concatenation of question and answer choice), and denotes the index of the correct answer.
To formulate the problem as a binary classification task, we define per-candidate samples indexed by :
Here and represent the image and text inputs for sample , respectively.
We assign a binary one-hot label to each sample as follows:
The total number of training samples becomes .
\subsubsection{Common Space Projection}
Before feeding features into the fuzzy logic module, both modalities are projected into a shared semantic space .
\paragraph{Image Stream}
For each sample , the global visual feature is extracted using a pre-trained backbone (e.g., VGG-16). It is then projected to the common space via a linear layer:
where and are learnable projection parameters.
\paragraph{Text Stream}
The text input is encoded using a Spatial Attention LSTM. The final hidden state captures the semantic meaning of the question-answer pair. This feature is similarly projected:
where and map the text features to the same common space as .
\subsubsection{Fuzzy Multimodal Logic (FUME) Head}
The FUME head processes the embeddings and independently. For each modality :
where represents the shared FUME head layers. The outputs include:
\begin{itemize}
\item : Fuzzy possibility and necessity scores.
\item : Class confidence scores for "correct" and "incorrect" classes used for inference.
\item : Training confidence scores regressed against the ground truth.
\end{itemize}
\subsubsection{Training Objective}
The model is optimized using a comprehensive Fuzzy Multimodal Loss (), which integrates regressive accuracy, classification capability, and logical consistency.
Here, is fixed at based on empirical settings, and controls the strength of the overlap regularization.
The three components are defined as follows:
\paragraph{1) Regression Loss ()}
Minimizes the error between the training confidence and the label :
\paragraph{2) Classification Loss ()}
Ensures classification accuracy using class confidence :
where denotes the standard cross-entropy loss.
\paragraph{3) Fuzzy Aggregation Regularizer ()}
Enforces logical consistency between modalities using overlap functions. Let be the sample-level rule satisfaction derived from the overlap of confidences (as defined in Eq. [Ref-to-definitions]). This term acts as a regularizer:
The network parameters are updated via the Adam optimizer to minimize .
\subsubsection{Inference: 4-way Answer Selection}
During the test phase, for a question with candidates , the model computes class confidence vectors and .
We define the probability of candidate being correct by applying a softmax function and extracting the probability of the positive class (indexed as 0):
The final ranking score is the average of these probabilities:
The system then selects the candidate with the highest score:
where is the predicted answer index for question .
\begin{algorithm}[htb]
\caption{Fuzzy Binary Classification for V7W Telling Task}
\label{alg:v7w_fume}
\begin{algorithmic}[1]
\REQUIRE Dataset
\REQUIRE Hyperparameters , Learning rate
\ENSURE Updated Model Parameters
text\STATE \textbf{// Phase 1: Data Expansion (Pre-processing)} \STATE Initialize training set $\mathcal{S} \leftarrow \emptyset$ \FOR{each question $q \in \{1, \dots, Q\}$} \FOR{$k \in \{1, \dots, 4\}$} \STATE $x_I^i \leftarrow x_I^q$ \STATE $x_T^i \leftarrow \text{Concat}(x_T^{q}, x_T^{q,k})$ \COMMENT{Question + Candidate} \IF{$k == k_q^*$} \STATE $y^i \leftarrow (1, 0)$ \COMMENT{Label: Correct} \ELSE \STATE $y^i \leftarrow (0, 1)$ \COMMENT{Label: Incorrect} \ENDIF \STATE Add sample $(x_I^i, x_T^i, y^i)$ to $\mathcal{S}$ \ENDFOR \ENDFOR \STATE \textbf{// Phase 2: Training Loop} \FOR{each epoch} \FOR{each mini-batch $\mathcal{B} \subset \mathcal{S}$} \STATE $L_{\text{batch}} \leftarrow 0$ \FOR{each sample $i \in \mathcal{B}$} \STATE \textbf{// A. Feature Projection} \STATE $z_I^i \leftarrow W_I \cdot \text{VGG}(x_I^i) + b_I$ \STATE $z_T^i \leftarrow W_T \cdot \text{SpatialAttLSTM}(x_T^i) + b_T$ \STATE \textbf{// B. FUME Head Processing (Shared)} \FOR{modality $j \in \{I, T\}$} \STATE $(m_j^i, e_j^i, c_j^i, r_j^i) \leftarrow g(z_j^i; \theta_g)$ \ENDFOR \STATE \textbf{// C. Loss Computation} \STATE $\ell_{\text{mse}} \leftarrow \|r_I^i - y^i\|^2 + \|r_T^i - y^i\|^2$ \STATE $\ell_{\text{ce}} \leftarrow \text{CrossEntropy}(c_I^i, y^i) + \text{CrossEntropy}(c_T^i, y^i)$ \STATE Compute overlap satisfaction $s^i$ using $m, e$ \STATE $\ell_{\text{fuzzy}} \leftarrow -\log(s^i + \varepsilon)$ \STATE $L_{\text{batch}} \leftarrow L_{\text{batch}} + (\ell_{\text{mse}} + \lambda_{\text{ce}}\ell_{\text{ce}} + \lambda_{\text{fuzzy}}\ell_{\text{fuzzy}})$ \ENDFOR \STATE Update $\theta \leftarrow \text{Adam}(\theta, \nabla_\theta L_{\text{batch}})$ \ENDFOR \ENDFOR \STATE \textbf{// Phase 3: Inference (4-way Selection)} \FUNCTION{Predict($x_I^q, \{x_T^{q,k}\}_{k=1}^4$)} \FOR{$k \in \{1, \dots, 4\}$} \STATE Compute $c_I^{q,k}, c_T^{q,k}$ via forward pass \STATE $p_I \leftarrow \text{Softmax}(c_I^{q,k})[0]$ \COMMENT{Prob of 'Correct' class} \STATE $p_T \leftarrow \text{Softmax}(c_T^{q,k})[0]$ \STATE $s_{q,k} \leftarrow \frac{1}{2}(p_I + p_T)$ \ENDFOR \RETURN $\hat{k}_q = \arg\max_k s_{q,k}$ \ENDFUNCTION \end{algorithmic}
\end{algorithm}
\subsection{v7w's experiment}
Based on the algorithmic workflow described above, we conducted experiments on the v7w dataset. The experiments were primarily divided into two groups: \textbf{FUME}, which does not utilize the overlap function as a loss term, and \textbf{FUME+overlap}, which incorporates the overlap function.
The hyperparameters were configured as follows: a batch size of 32, 20 epochs, and a learning rate of . To mitigate the impact of randomness caused by data shuffling in a multi-threaded environment, we performed the experiments using five distinct random seeds (42, 84, 126, 168, and 210). Table~\ref{tab:v7w_performance} presents a comparison of the best results achieved, where ``Baseline'' refers to the results reported in the original paper.
\begin{table}[htbp]
\centering
% 标题已更新为 Visual7W 数据集
\caption{Comparison of the best performance of different algorithms on the Visual7W dataset.}
\label{tab:v7w_performance}
\begin{tabular}{lccccc}
\toprule
\textbf{Method} & \multicolumn{5}{c}{\textbf{Seed}} \
\cmidrule(lr){2-6}
& 42 & 84 & 126 & 168 & 210 \
\midrule
FUME & 56.93% & 55.42% & 28.46% & 30.10% & 34.73% \
FUME+overlap & 57.08% & 55.00% & 32.81% & 55.94% & 54.39% \
\midrule
% Baseline 固定值 (54.0%) 居中显示
Baseline\cite{zhu2016visual7w} & \multicolumn{5}{c}{54.0%} \
\bottomrule
\end{tabular}
\end{table}
\newline
\indent As indicated in Table~\ref{tab:v7w_performance}, the FUME-based algorithms achieve significantly better peak performance compared to the baseline on the Visual7W dataset. However, the standard FUME algorithm performs suboptimally under certain random seeds, exhibiting noticeable instability. In contrast, the \textbf{FUME+overlap} method (which utilizes the overlap function as a loss term) not only marginally outperforms the standard FUME in peak accuracy (with a margin of approximately 1%) but, more importantly, demonstrates significantly superior stability across multiple seeds. Consequently, \textbf{FUME+overlap} is considered the superior algorithm, offering a robust combination of enhanced stability and higher peak performance.
\section{Discussions}\label{s5}
In this work, we successfully extended the Fuzzy Multimodal Learning (FUME) framework from Cross-Modal Retrieval (CMR) to Visual Question Answering (VQA). While the core philosophy of modeling uncertainty via possibility and necessity measures remains consistent, our experimental analysis reveals that the direct transfer of loss functions—specifically the Consistency Learning Loss () used in retrieval—is suboptimal for VQA. Consequently, we proposed a novel optimization strategy centered on \textbf{Overlap Functions} (). The rationale behind this architectural shift is rooted in the fundamental mathematical and logical divergences between retrieval and reasoning tasks.
\subsection{Analysis on Generative Tasks (DAQUAR)}
For the DAQUAR dataset, which we treat as a sequence-to-sequence generation task, we replaced the standard Cross-Entropy and the original Consistency Loss with a combination of FML Loss and Overlap Aggregation Loss (). This replacement is justified by the transition from \textit{static geometric alignment} to \textit{dynamic logical reasoning}.
\textbf{1) Incompatibility of Metric Learning in Sequential Decoding:}
The original in CMR relies on metric learning to minimize the distance between static image and text embeddings in a shared manifold. However, in autoregressive generation, the hidden state at time step is dynamic. Enforcing a global static alignment via disrupts the temporal dependency required for sequence generation.
\textbf{2) Ambiguity Reduction via Overlap Functions:}
Standard Cross-Entropy maximizes the likelihood of the target token but fails to explicitly penalize high-confidence distractors. By introducing based on Overlap Functions , we impose a bijective logical constraint. Let be the possibility of the target and be the necessity. The overlap function satisfies the boundary condition:
\begin{equation}
\mathcal{O}(p_t, q_t) \approx 1 \iff p_t \approx 1 \quad \text{AND} \quad q_t \approx 1
\end{equation}
This property acts as a soft logical ``AND'' gate, creating a dynamic margin. It forces the model to produce predictions where the target probability is high \textit{and} the most confusing distractor is significantly suppressed, thereby mitigating exposure bias and hallucination in open-ended generation.
\subsection{Analysis on Discriminative Tasks (Visual7W)}
For the Visual7W Telling task, reformulated as a binary classification problem, we substituted with to address the issues of \textit{dense feature spaces} and \textit{hard negative mining}.
\textbf{1) Failure of in Dense Spaces:}
In retrieval tasks, a positive pair and a negative pair originate from distinct instances. However, in V7W, a sample consists of . The positive sample and the negative sample share the exact same visual input and question .
Mathematically, let be the embedding function. The distance between positive and negative representations in the feature space approaches zero:
\begin{equation}
\lim_{\text{sim}(A^+, A^-) \to 1} | \phi(I, Q, A^+) - \phi(I, Q, A^-) | \to \epsilon
\end{equation}
In such a ``crowded'' local neighborhood, the gradient contribution from distance-based becomes vanishingly small or noisy, as the loss function struggles to push apart vectors that are inherently highly correlated.
\textbf{2) From Matching to Verification:}
The nature of the task shifts from matching (ranking similarity) to verification (truth validation). The Overlap Function optimizes the \textbf{internal logical consistency} of the prediction rather than relative geometric distances. It penalizes ``hesitant'' predictions where the model correctly identifies the answer ( is high) but fails to reject a hard negative option ( is low). This forces the network to learn fine-grained semantic distinctions to satisfy the strict condition , ensuring robustness against strong interference.
\subsection{Summary}
In summary, the introduction of Overlap Functions marks a paradigm shift from \textbf{optimizing geometric proximity} (as in CMR) to \textbf{optimizing logical coherence} (as in VQA). This work demonstrates that while fuzzy uncertainty modeling is universally applicable, the loss landscape must be tailored to the specific logical structure—generative or discriminative—of the downstream multimodal task.
\section{Conclusions}\label{s6}
This paper extends the Fuzzy Multimodal Learning (FUME) framework to Visual Question Answering (VQA) by integrating \emph{Overlap Functions} into the loss design. Our experiments on DAQUAR and Visual7W demonstrate that shifting from geometric proximity optimization to logical coherence constraints significantly enhances model robustness and accuracy, particularly in distinguishing hard negative samples in dense feature spaces.
Future research will focus on two directions: first, exploring generalized fuzzy aggregation functions (e.g., grouping functions) to handle multi-class tasks directly, thereby reducing the computational cost of binary transformations; second, extending this fuzzy logic-guided strategy to large-scale generative models to mitigate hallucinations and improve interpretability.
\bibliographystyle{IEEEtran}
\bibliography{referencea}
\end{document}
不用修改 但是这里面的“摘要”和“参考文献”在pdf中显示的是中文 请给出修改方案
你这个现象基本就是 \usepackage{ctex} 把一些“固定名称”(如 abstract、bibliography 标题)按中文方案本地化了,所以 Abstract 变成了“摘要”,References 变成了“参考文献”。
下面给你两种常用修改方案(都只改导言区几行,不需要改正文内容)。
ctex,手动把标题改回英文把下面这些命令放在 \usepackage{ctex} 之后(导言区任意位置,但要在 \begin{document} 之前):
latex% --- Force English names even with ctex loaded --- \renewcommand{\abstractname}{Abstract} % IEEEtran 里参考文献标题可能用到下面任意一个(都改掉最稳) \renewcommand{\refname}{References} \renewcommand{\bibname}{References} \renewcommand{\IEEEbibname}{References} % 如果你的 Keywords 也被中文化了,一并改回去 \renewcommand{\IEEEkeywordsname}{Keywords}
这样即使继续用 ctex(支持中文内容/字体),PDF 里的标题也会显示英文。
ctex,只用中文支持包(更“干净”)如果你只是偶尔需要排中文,而不希望 ctex 改动整套标题本地化,可以:
\usepackage{ctex}latex\usepackage{fontspec} \usepackage{xeCJK} % 或者 LuaLaTeX 用 luatexja/ctex 另一套
这种方式一般不会自动把 Abstract/References 变中文,但需要你自己配置中文字体(按你的环境设置)。
如果你用了方案 A 之后 “References” 还是显示中文,把你编译用的是 XeLaTeX 还是 pdfLaTeX,以及最终 PDF 里参考文献标题长什么样(例如是否全大写 “REFERENCES”)贴一句,我可以给你把覆盖命令再精确到 IEEEtran 的具体分支。