Leonardo Scales Image Captioning with the NVIDIA GH200

Leonardo and Lambda

Lambda provides cloud support for accelerating deep learning workflows vital to cutting edge generative technology. Leonardo.Ai is a startup creating and serving advanced text-to-image services & uses Lambda’s compute resources for their production systems and research programs for their flexibility and reliability.


  • We captioned thirty million images from our internal dataset using CogVLM-17B on the NVIDIA GH200 architecture.
  • The GH200 alleviates a communication bottleneck commonly seen in VRAM-limited A100-40GN nodes and allows us to reach a batch size that saturates GPU utilisation.
  • Porting our existing captioning pipeline to the GH200 cluster configured by Lambda took only a day & improved throughput by over a factor of 3x.
  • Throughput gains allow finer-grained synthetic captions, used to train more performant text-to-image models, ultimately creating better results at inference stage.

The Benchmarked Task

Our pipeline entails loading a batch of images from disk onto memory, processing it with CogVLM, saving the associated image-text pairs, and uploading them back to our datastore. CogVLM’s forward pass begins by running the EVA2-CLIP-E (https://arxiv.org/abs/2303.15389) image encoder and then autoregressively producing tokens from CogVLM’s language model until the end-of-sequence token is reached.

Leonardo and Lambda

Porting Over

We were initially concerned that moving our pipeline over to an arm CPU architecture would pose challenges, but ultimately found that using an NVIDIA NGC Docker image met all of our requirements. We were able to quickly begin captioning after only updating some paths & environmental variables and optimising our batch size, which was a pleasant surprise!


Background: the NVIDIA GH200 chips feature 96GB of GPU memory, 7 times the bandwidth connecting the GPU and CPU compared to typical accelerators, a powerful 72-core CPU, and 480GB of system memory (https://lambdalabs.com/nvidia-gh200).

In our case: we normally caption image data using pairs of A100 40GB GPUs within a cluster. This allows us to use a reasonable batch size of 16 images per GPU while sharing weights across the two GPUs, though splitting processing across two GPUs leads to overhead. The ten NVIDIA GH200 nodes that we used for captioning with Lambda allowed us to reach a doubled batch size of 32 images per GPU and a higher throughput without having to share weights between GPUs.


CogVLM (https://arxiv.org/abs/2311.03079) is often not benchmarked against models from the U.S. or Europe in papers; however, it was used for Stable Diffusion 3’s synthetic captions (along with many other research projects).

From our qualitative evaluations assessing the extent of hallucinations by running head-to-head with other captioning models (e.g. LLaVA https://arxiv.org/abs/2304.08485) on our image data, it appears to be the Open-Source state-of-the-art of Vision-Language Models for now. 

The 17-billion-parameter, instruction-finetuned version of CogVLM is relatively large, so running it on the latest hardware is ideal.


We captioned around 30 million images over one month on the Lambda GH200 cluster that we piloted. We found that we could caption at 0.4375 seconds per image per GPU on the cluster; in terms of tokens, we ran at 245 tokens per second per GPU. This was over 3 times faster per GPU compared to an A100 40GB cluster.

What’s Next

Using these fine-grained synthetic captions that we’ve generated enables us to train more performant text-to-image models, improving their ability to adhere to prompts and produce work that users are after. Lambda’s clusters will allow for training and also serving future text-to-image models to our customers.