The next generation of encoder-decoder models

T5Gemma 2 is the next evolution of our encoder-decoder family based on Gemma 3, with the first multi-modal and long-context encoder-decoder models.

Unlike T5Gemma, T5Gemma uses 2 bound word embeddings (over encoder and decoder) and fused decoder self- and cross-attention to store model parameters. It offers compact pretrained models in sizes of 270M-270M (~370M total, excluding vision codes), 1B-1B (~1.7B) and 4B-4B (~7B) parameters, making them ideal for rapid experimentation and deployment in on-device applications.

Background

With the original T5Gemma, we demonstrated that we could successfully adapt modern, pre-trained decoder-only models to an encoder-decoder architecture, unlocking new versatility. By initializing with weights from a powerful decoder-only model and then applying continuous pretraining, we created high-quality, inference-efficient models while bypassing the computational cost of training from scratch.

T5Gemma 2 extends this to include vision language models by incorporating key innovations from Gemma 3.

What’s new

T5Gemma 2 is more than a rehab. It incorporates significant architectural changes while inheriting many of the powerful, next-generation features of the Gemma 3 family.

Architectural innovations for efficiency

To maximize efficiency at smaller scales, we have introduced key structural improvements:

Bound embeddings: We now bind the embeddings between encoders and decoders. This significantly reduces the overall parameter count, allowing us to pack more active functions into the same memory footprint – essential for our new compact 270M-270M model.
Aggregate Attention: In the decoder, we use a fused attention mechanism that combines self- and cross-attention into a single, unified attention layer. This reduces model parameters and architectural complexity, improves model parallelization, and benefits inference.

Next generation capabilities

Drawing from the Gemma 3, the T5Gemma 2 also represents a significant upgrade in model capabilities:

Multimodality: T5Gemma 2 models can understand and process images along with text. By using a highly efficient vision encoder, the models can seamlessly perform visual question answering and multimodal reasoning tasks.
Extended long context: We’ve dramatically expanded the context window. By leveraging Gemma 3’s switching local and global attention mechanism, T5Gemma 2 can handle context windows of up to 128K tokens.
Massively multilingual: Trained on a larger, more diverse dataset, these models now support over 140 languages out of the box.

Performance

The T5Gemma 2 sets a new standard for what compact encoder-decoder models can achieve. Our new models demonstrate strong performance across key functional areas and inherit the powerful multi-modal and long-context capabilities of the Gemma 3 architecture.