Enhancing Image Captioning with a Multi-Encoder Ensemble Framework
Students & Supervisors
Student Authors
Supervisors
Abstract
Image captioning has traditionally relied on encoder-decoder architectures such as CNN-LSTM and, more recently, Transformer-based models. Even though these methods have shown promise, single architectures frequently fall short, resulting in captions that are either biased toward dominant patterns or fluent but lack semantic depth. We propose an ensemble framework that combines Transformer decoders with the complementary advantages of several CNN encoders, such as ResNet-101, InceptionV3, and EfficientNetB3, to overcome these constraints. Hard voting (n-gram frequency) and soft voting (token probability averaging) are used to refine the multiple candidate captions generated by the system for each image during inference. This design uses a variety of visual representations while keeping the strong contextual modeling of Transformers. The evaluations performed on the Flickr8k and Flickr30k datasets demonstrate that our ensemble model always performs better than individual models on the BLEU, METEOR, ROUGE, and CIDEr metrics. The captions of our model are not only more accurate, but also more coherent and descriptive.
Keywords
Publication Details
- Type of Publication:
- Conference Name: 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)
- Date of Conference: 25/09/2025 - 25/09/2025
- Venue: Dhaka International University, Bangladesh
- Organizer: Dhaka International University, Bangladesh Computer Society