🚀 From Static Models to Living Systems: How Agentic AI is Redefining Enterprise Workflows
In the ever-evolving space of Image Captioning (IC), a persistent challenge has been the loss evaluation mismatch — where models trained to minimize conventional losses like cross-entropy often produce captions that fail to resonate with human evaluators. My recent publication in Springer addresses this gap with a Dual-Objective Optimization (DOO) Framework that directly aligns training with human-centric evaluation.
Traditional image captioning models focus heavily on minimizing prediction error, usually via cross-entropy loss. However, what these models miss is what really matters to humans — captions that are linguistically rich, contextually accurate, and meaningful.
This misalignment often results in captions that are technically correct but lack depth, emotional resonance, or visual nuance.
The DOO framework simultaneously minimizes training loss and maximizes the BLEU score — a human-centric evaluation metric — during model training.
The optimization objective is:
where:
Composite Loss Function: Combines traditional cross-entropy loss with a BLEU-based reward.
Gradient Approximation: Uses Gumbel-Softmax to make BLEU score differentiable for gradient descent.
Reinforcement Learning: Employs policy gradient to optimize for long-term rewards tied to caption quality.
Multi-Objective Optimization (NSGA-II): Balances conflicting objectives for better linguistic and contextual results.
Base Caption: "A child on a swing."
DOO-Generated Caption: "A young boy is swinging on a swing at the playground."
The DOO-generated caption adds nuance and context, capturing both the scene and the human-like expression.
Feel free to read the complete paper here: https://link.springer.com/article/10.1007/s42979-025-04111-0
Comments
Post a Comment