Bridging Math and Meaning: Dual-Objective Optimization in Image Captioning

- July 06, 2025

In the ever-evolving space of Image Captioning (IC), a persistent challenge has been the loss evaluation mismatch — where models trained to minimize conventional losses like cross-entropy often produce captions that fail to resonate with human evaluators. My recent publication in Springer addresses this gap with a Dual-Objective Optimization (DOO) Framework that directly aligns training with human-centric evaluation.

The Problem

Traditional image captioning models focus heavily on minimizing prediction error, usually via cross-entropy loss. However, what these models miss is what really matters to humans — captions that are linguistically rich, contextually accurate, and meaningful.

This misalignment often results in captions that are technically correct but lack depth, emotional resonance, or visual nuance.

The Solution

The DOO framework simultaneously minimizes training loss and maximizes the BLEU score — a human-centric evaluation metric — during model training.

Mathematical Formulation

The optimization objective is:

where:

is the training loss (cross-entropy)
is the BLEU score-based reward

Key Components

Composite Loss Function: Combines traditional cross-entropy loss with a BLEU-based reward.
Gradient Approximation: Uses Gumbel-Softmax to make BLEU score differentiable for gradient descent.
Reinforcement Learning: Employs policy gradient to optimize for long-term rewards tied to caption quality.
Multi-Objective Optimization (NSGA-II): Balances conflicting objectives for better linguistic and contextual results.

Use Case Example

Given an image of a young boy swinging in a park:

Base Caption: "A child on a swing."
DOO-Generated Caption: "A young boy is swinging on a swing at the playground."

The DOO-generated caption adds nuance and context, capturing both the scene and the human-like expression.

Feel free to read the complete paper here: https://link.springer.com/article/10.1007/s42979-025-04111-0

Search This Blog

DeepInsight Chronicles: Unveiling the Depths of AI and Data Science

🚀 From Static Models to Living Systems: How Agentic AI is Redefining Enterprise Workflows