Scaling Image Captioning with the NVIDIA GH200 Grace Hopper™ Superchip

Leonardo and Lambda

Lambda provides cloud support for accelerating deep learning workflows vital to cutting edge generative technology. Leonardo AI is a startup creating and serving advanced text-to-image services & uses Lambda’s compute resources for their production systems and research programs for their flexibility and reliability.

TL;DR 

  • We captioned thirty million images from our internal dataset using CogVLM-17B on the NVIDIA GH200 architecture.
  • The GH200 alleviates a communication bottleneck commonly seen in VRAM-limited A100-40GN nodes and allows us to reach a batch size that saturates GPU utilisation.
  • Porting our existing captioning pipeline to the GH200 cluster configured by Lambda took only a day & improved throughput by over a factor of 3x.
  • Throughput gains allow finer-grained synthetic captions, used to train more performant text-to-image models, ultimately creating better results at inference stage.

The Benchmarked Task

Our pipeline entails loading a batch of images from disk onto memory, processing it with CogVLM, saving the associated image-text pairs, and uploading them back to our data store. CogVLM’s forward pass begins by running the EVA-02-CLIP-E (https://arxiv.org/abs/2303.15389) image encoder and then autoregressively producing tokens from CogVLM’s language model until the end-of-sequence token is reached.

Leonardo and Lambda

Porting Over

We were initially concerned that moving our pipeline over to the Arm-based NVIDIA Grace CPU architecture would pose challenges, but ultimately found that using an NVIDIA NGC Docker image met all of our requirements. We were able to quickly begin captioning after only updating some paths & environmental variables and optimising our batch size, which was a pleasant surprise!

NVIDIA GH200 Grace Hopper Superchip

Background: the NVIDIA GH200 chips feature 96GB of HBM3 GPU memory, and the NVLink-C2C link connecting the GPU and CPU with 7 times the bandwidth compared to typical accelerators, a powerful 72 Arm Neoverse V2core CPU, and 480GB of LPDDR5X system memory (https://lambdalabs.com/nvidia-gh200).

In our case: we normally caption image data using pairs of NVIDIA A100 40 GBTensor Core GPUs within a cluster. This allows us to use a reasonable batch size of 16 images per GPU while sharing weights across the two GPUs, though splitting processing across two GPUs leads to overhead. The ten NVIDIA GH200 nodes that we used for captioning with Lambda allowed us to reach a doubled batch size of 32 images per GPU and a higher throughput without having to share weights between GPUs.

CogVLM-17B

CogVLM (https://arxiv.org/abs/2311.03079) is often not benchmarked against models from the U.S. or Europe in papers; however, it was used for Stable Diffusion 3’s synthetic captions (along with many other research projects).

From our qualitative evaluations assessing the extent of hallucinations by running head-to-head with other captioning models (e.g. LLaVA https://arxiv.org/abs/2304.08485) on our image data, it appears to be the Open-Source state-of-the-art of Vision-Language Models for now. 

The 17-billion-parameter, instruction-finetuned version of CogVLM is relatively large, so running it on the latest hardware is ideal.

Throughput

We captioned around 30 million images over one month on the Lambda GH200 cluster that we piloted. We found that we could caption at 0.4375 seconds per image per NVIDIA GH200 Grace Hopper Superchip on the cluster; in terms of tokens, we ran at 245 tokens per second per GPU. This was over 3 times faster per NVIDIA GH200 Grace Hopper Superchip GPU compared to the NVIDIA A100 Tensor Core GPU 40GB cluster.

What’s Next

Using these fine-grained synthetic captions that we’ve generated enables us to train more performant text-to-image models, improving their ability to adhere to prompts and produce work that users are after. Lambda’s clusters will allow for training and also serving future text-to-image models to our customers.