[ICLR 2024]CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding

for 2 subjects with the same stimuli, existing works mapping the fMRI signals to voxel space. Then aggregate/cluster the fMRI signals with the same stimuli. However, it is hard to generalize. So another method clusters each categories.

2.3. Methodology

2.3.1. Overview

①Overall framework:

② ${\mathcal{X}}^{(n)}$ : voxel space

③ $\mathbf{X}^{(n)}\in\mathbb{R}^{n_{i}\times d_{i}}$ : the neural responses of the $n$ -th subject in total $N$ , where $n_i$ is the number of image stimuli and $d_i$ is the number of voxels（那为什么不是 $\mathbf{X}^{(i)}$ ）

④ ${\mathcal{I}}$ : pixel space of image stimuli

⑤ ${\mathcal{Y}}$ : label space of image stimuli

⑥Classifier: $C:\mathcal{X}^{(1)}\times\cdots\times\mathcal{X}^{(N)}\to\mathcal{Y}$

2.3.2. CLIP-based feature extraction of visual stimuli

①They extract low-level feature and high level feature by the first layer of image encoder and the average value of the last layer of the whole CLIP (image and text encoder)

②Representation similarity matrices (RSMs) $\mathbf{M}_{llv}^{\mathbf{I}}$ and $\mathbf{M}_{hlv}^{\mathbf{I}}$ is to quantify the similarity between $B$ (batch size) visual stimuli in low-level and high-level feature spaces. $i$ and $j$ in $\mathbf{M}_{llv}^{\mathbf{I}}[i,j]$ and $\mathbf{M}_{hlv}^{\mathbf{I}}[i,j]$ denote image similarity in $\mathcal{F}_{llv}$ and $\mathcal{F}_{hlv}$

2.3.3. Transformer-bsed fMRI feature extraction

①Transformer process:

$\begin{aligned} & \mathbf{z}_{0}= \begin{bmatrix} \mathbf{x}_{class};\mathbf{x}^{1}\mathbf{E};\cdots;\mathbf{x}^{M}\mathbf{E} \end{bmatrix}+\mathbf{E}_{pos}, \\ & \mathbf{z}_{l}^{\prime}=\mathrm{MHSA}\left(\mathrm{LN}\left(\mathbf{z}_{l-1}\right)\right)+\mathbf{z}_{l-1}, & & l=1,2,\ldots,L \\ & \mathrm{z}=\mathrm{MLP}\left(\mathrm{LN}\left(\mathbf{z}_{l}^{\prime}\right)\right)+\mathbf{z}_{l}^{\prime}, & & l=1,2,\ldots,L \\ & \mathbf{z}=\mathrm{LN}\left(\mathbf{z}_{L}^{0}\right). \end{aligned}$

②Transformer-based fMRI feature extractor:

(a) they firstly reduce the dimension of BOLD volume and then patchify and flatten them, functions of (b) and (c) are:

$\begin{aligned} & z_{0} & & = \begin{bmatrix} \mathbf{x}_{llv};\mathbf{x}_{hlv};\mathbf{x}^{1}\mathbf{E};\cdots;\mathbf{x}^{M}\mathbf{E} \end{bmatrix}+\mathbf{E}_{pos}, \\ & \mathbf{z}_{l}^{\prime} & & =\mathrm{MHSA}\left(\mathrm{LN}\left(\mathbf{z}_{l-1}\right)\right)+\mathbf{z}_{l-1}, & & & & l=1,2,\ldots,L \\ & \mathbf{z}_{l} & & =\mathrm{MLP}\left(\mathrm{LN}\left(\mathbf{z}_{l}^{\prime}\right)\right)+\mathbf{z}_{l-1}, & & & & l=1,2,\ldots,L \\ & \mathbf{z}_{llv} & & =\mathrm{LN}\left(\mathbf{z}_{L}^{0}\right), \\ & \mathbf{z}_{hlv} & & =\mathrm{LN}\left(\mathbf{z}_{L}^{1}\right). \end{aligned}$

2.3.4. Multi-subject shared neural response representation

①Similarity loss:

$\mathcal{L}_{llv}=\left\|\mathbf{M}_{llv}^{\mathbf{I}}-\mathbf{M}_{llv}^{\mathbf{X}}\right\|_{F}^{2}/B^{2},\mathcal{L}_{hlv}=\left\|\mathbf{M}_{hlv}^{\mathbf{I}}-\mathbf{M}_{hlv}^{\mathbf{X}}\right\|_{F}^{2}/B^{2}.$

2.3.5. Semantic classifier

①Final representation:

$\begin{aligned} & \mathbf{z}=\mathrm{CONCAT}(\mathbf{z}_{llv},\mathbf{z}_{hlv}), \\ & \mathbf{\hat{y}}=\mathrm{MLP}(\mathbf{z}). \end{aligned}$

②Semantic classification CE loss:

$\mathcal{L}_{c}=-\frac{1}{C}\sum_{j=1}^{C}\left[\mathbf{y}_{j}\log(\mathbf{\hat{y}}_{j})+(1-\mathbf{y}_{j})\log(1-\mathbf{\hat{y}}_{j})\right]$

2.3.6. Optimization objective

①Orthogonal constraint to encourage the difference between low-level and high-level token representations:

$\min\mathcal{L}_{\perp}= \begin{Vmatrix} \mathbf{z}_{llv}\mathbf{z}_{hlv}^{T} \end{Vmatrix}_{F}^{2}/B^{2}.$

②Final loss:

$\min\mathcal{L}=\mathcal{L}_{c}+\lambda_{\perp}\mathcal{L}_{\perp}+\lambda_{llv}\mathcal{L}_{llv}+\lambda_{hlv}\mathcal{L}_{hlv},$

2.4. Experiments

2.4.1. Datasets

（1）HCP

①Subjects: 9 of 158

②Visual stimuli: four dynamic movie clips, each annotated with an 859-dimensional WordNet label

③Categories: labels which frequency higher than 0.1

（2）NSD

①Subjects: 8

②Visual stimuli: MSCOCO, each image has multiple labels from 80 categories. Different 9k stimulus for each subject and the same 1k for all the subjects

2.4.2. Baseline methods

①Types of compared methods: Single-subject decoding methods, multi-subject data aggregation methods, MS-EMB, Shared response model (SRM)

2.4.3. Parameter settings

（1）HCP

①3D-CNN: 6 layers, obtaining 7×8×7×512, and reshape to 392×512 (359 patches)

②Transformer blocks: 2

②Optimizer: Adam with 0.001 learning rate

③Batch size: 64

④Hyperparameter of loss weights: $\lambda_{\perp}=0.001,\lambda_{hlv}=0.001,\lambda_{llv}=0.1$

（2）NSD

①Embedding layers: 512, 24

②Head: 8

③Batch size: 64

④Optimizer: Adam with 0.0001 learning rate

⑤Hyperparameter of loss weights: $\lambda_{\perp}=0.001,\lambda_{hlv}=0.001,\lambda_{llv}=0.0001$

2.4.4. Evaluation metrics

①Mean Average Precision (mAP), the area under the receiver operating characteristic curve (AUC) and Hamming distance

2.4.5. Comparative experimental results

①Performance on HCP:

②Performance on NSD:

2.4.6. Ablation study

①Loss ablation:

2.4.7. Visualization

①(a) low-level tokens, (b) high-level tokens, (c) attention map of MS-EMB method:

2.5. Discussion and Conclusion

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐