Differentiable Prompt Learning for Vision Language Models

1. Rensselaer Polytechnic Institute
2. IBM Resaerch

Abstract

Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60 % on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance.

Pipeline

There are two stages in differentiable prompt learning (DPL): Search Stage and Train Stage. Search Stage determines context length for each transformer block. Train Stage optimizes soft prompts.

Search Stage

The DPL method automatically determines heterogeneous prompts (See Figure a) and b)). Context lengths at different depth and different branches can be different. Context lengths rapidly come to a state close to the converged state (See Figure c)).

Train Stage

The DPL method improves the alignment between the text and image by optimizing soft prompts.

BibTeX

@article{huang2025dpl
    title   = {Differentiable Prompt Learning for Vision Language Models},
    author  = {Huang, Zhenhan and Pedapati, Tejaswini and Chen, Pin-Yu and Gao, Jianxi},
    journal = {34th International Joint Conference on Artificial Intelligence},
    year    = {2025}
}