FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

Abstract

Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Analysis and ablation studies further validate its effectiveness.

Method

Figure 2: Overview of FineCLIPER. The framework can be divided into three main components: Label Encoder, Multi-Modal Encoders, and Similarity Calculation. The Label Encoder augments labels using PN descriptors, followed by PN adaptors within text encoder; The Multi-Modal Encoders handle hierarchical information mined from low semantic levels to high semantic levels of human face; The Similarity Calculation module further integrates and computes the similarities of the representations obtained earlier via contrastive learning.

Main Results

Table 1: Comparisons of our FineCLIPER with the state-of-the-art Supervised DFER methods on DFEW, FERV39k, and MAFW. ^*: FineCLIPER with face parsing and landmarks modalities; ^†: FineCLIPER with fine-grained text modality. The best results are highlighted in Bold, and the second-best Underlined.

Table 2: Comparative analyses of accuracy across various emotion categories: FineCLIPER vs. other approaches on DFEW.

Table 3: Comparison with state-of-the-art Zero-Shot DFER methods. ^†: FineCLIPER with fine-grained text modality.

Abalation Study

Figure 3: Visualizations of class-wise cosine similarity values between video and text embeddings in DFEW, where the positive value is in green and the negative one is in red.

Figure 4: Comparison between our adaptive weighting strategy and fixed weights on the DFEW dataset, where the x-axis represents the weights of video features.

We present a detailed analysis of our FineCLIPER framework, providing comprehensive quantitative and qualitative evaluations that further validate the superiority of FineCLIPER.

Qualitative Analysis

Figure 5: Parameter-Performance comparison on the DFEW testing set. The bubble size indicates the model size.

Figure 6: Attention visualizations for DFEW w.r.t. two ground-truth expression labels: 'Happiness' (Top) and 'Surprise' (Bottom).

Figure 7: Examples of the generated text and the refined text.

BibTeX

@article{chen2024finecliper,
  title={FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs},
  author={Chen, Haodong and Huang, Haojian and Dong, Junhao and Zheng, Mingzhe and Shao, Dian},
  journal={arXiv preprint arXiv:2407.02157},
  year={2024}
  }