Engineered an advanced image captioning system that seamlessly combines a Vision Transformer (ViT) and GPT-2 to generate highly descriptive and contextually relevant captions for images. This project showcases the integration of state-of-the-art deep learning models to bridge the gap between visual data and natural language, enabling the automatic generation of meaningful descriptions from images.