**ATD: Augmenting CP Tensor Decomposition by Self Supervision.**

--

Author: Chaoqi Yang

This blog introduces our recent NeurIPS 2022 paper [1]: **ATD: Augmenting CP Tensor Decomposition by Self Supervision****. **We have open-sourced our code at https://github.com/ycq091044/ATD and shared the corresponding poster.

**Tensor decomposition** and **self-supervised learning** are two popular unsupervised learning methods. We combine them in this paper to achieve better predictive performance for downstream classification tasks. More specifically, our paper proposes a new canonical polyadic tensor decomposition (CPD) approach empowered by self-supervised learning (SSL), which generates unsupervised embeddings (Step 1) that can give better downstream classification performance (Step 2).

# A 5-min Summary

Tensor decomposition is an effective dimensionality reduction tool for downstream classification. However, traditional tensor decomposition methods focus on low-rank fitness and do not consider the downstream tasks (such as classification).

**Contribution 1:** This paper solves the problem of *"how to learn better tensor decomposition subspaces and generate predictive low-rank features for downstream classification."* We consider injecting class-preserving perturbations by tensor augmentation and then decomposing the tensor and the perturbed tensor together with the self-supervised loss (shown below).

**Contribution 2:** We improve the alternating least square (ALS) optimization with our new loss functions (including the non-convex self-supervised loss). Specifically, we build a new optimization algorithm that uses least squares optimization and fix-point iteration for solving the non-convex subproblem.

**Experimental Results:** Our method gives competitive results (with much fewer parameters) on multiple human signal datasets, compared to contrastive learning methods, autoencoders, and other tensor decomposition methods.

Next, we explain more details about the ATD method.

# 1. Feature Dimensionality Reduction

**Step 1 — Unsupervised Learning:** Tensor decomposition and self-supervised learning are two different unsupervised learning methods. They learn the encoders from unlabeled datasets and generate feature representations (e.g., 128-dim vectors) for each data sample. During the whole learning process, no label information is needed.

**Step 2 — Downstream Tasks:** The learned representations are used as the feature inputs for downstream classification where a separate linear model is trained (e.g., logistic regression, which takes 128-dim vectors as input and predicts the label).

**Canonical polyadic tensor decomposition**

Canonical polyadic tensor decomposition (CPD) [2] is commonly used to learn the low-rank factors of a tensor (e.g., a higher-dimensional matrix). Standard CPD follows the *fitness principle*: approximating the original tensor as far as possible with the low-rank factors.

Formally, assume the tensor is ** T**, the resulting low-rank factors can be: 𝐀₁

**,**𝐀₂

**, …,**𝐀ₖ (one for each dimension). Frobenius mean square error (MSE) is commonly chosen as the fitness loss for learning the low-rank factors.

**Example:** we use multi-channel EEG signals as examples. Assume each EEG signal has two channels (two blue time series in one slice). Now, we stack *N* data samples (denoted as *𝐓₁**, **𝐓₂*** , …**) together and make it a 3-dimensional tensor

**: N samples times number of channels times number of timesteps.**

*T*We use CPD to learn low-rank factors: ** X** for the sample dimension,

**for the channel dimension, and**

*A***for the timesteps dimension. Corresponding columns of these three matrices will generate rank-one components one by one, and collectively the components can approximate the tensor**

*B***. With more components (i.e., higher decomposition rank), the approximation accuracy often becomes better. However, more components will also likely capture unnecessary noise in the data (overfitting).**

*T*With a properly chosen rank, the learned factors can capture the low-rank structure of the tensor and will be the feature representations for different information aspects. For example, **rows in X are the representation of data samples**; each row in ** A** is the representation of a channel.

**Self-supervised contrastive learning**

Self-supervised contrastive learning (SSL) [3][4] has become popular in the recent few years. SSL methods are mostly deep learning-based and can be used as unsupervised feature extractors as well.

First, given an unlabeled sample *𝐓ᵢ*, SSL methods will apply two class-preserving data augmentations (i.e., though we do not know the label of *𝐓ᵢ*, we know that after applying the data augmentations, *𝐓ᵢ* will change a bit while the underlying label will not change) and obtain the perturbed samples *𝐓ᵢ*** ’** and

*𝐓ᵢ*

**. Second, SSL methods apply the parameterized feature encoder Enc(⋅) on**

*’’**𝐓ᵢ*

**and**

*’**𝐓ᵢ*

**and obtain two representations**

*’’***x’**and

**x’’**. Optionally, people may also append one non-linear projection to obtain

**z’**and

**z’’**. Third, a common contrastive loss (e.g., noise contrastive estimation, NCE [5]) is used to maximize the similarity of

**z’**and

**z’’**over the similarity of

**z’**and other embedding vectors

**from the same data batch.**

**Example:** For the same examples of unlabeled EEG signals, we stack them as a data batch and feed them into the typical SSL pipeline. During the learning process, the SSL models try to align the embeddings of perturbed samples from the same data while disaligning embeddings of perturbed samples from different data.

In sum, from the *alignment principle*, **the SSL model can generate meaningful unsupervised features** **X **by leveraging the deep Enc(⋅) function.

# 2. Introducing SSL to CPD

So far, we know that both the CPD-type and the SSL-type methods can extract feature representations from data samples in an unsupervised way. The CPD-type methods need far fewer parameters than deep-learning methods. But CPD typically does not consider the downstream classification. The SSL methods are flexible and generalizable to many frameworks. However, SSL methods often need much more parameters than tensor methods.

## Augmenting Tensor Decomposition

Inspired by the above observation, **our proposed ATD method introduces the self-supervised learning concept into tensor decomposition and combines their advantages.** In the figure, each tensor slice is an unlabeled data sample. We integrate the idea of self-supervised learning by the following steps:

**Data augmentation:**We apply data augmentations to each tensor slice (such as bandpass filtering, and coordinate rotation, see [1] for details);**Tensor decomposition:**We stack the tensorand the perturbed tensor*T*together, and apply tensor decomposition algorithm (we use CPD here). The loss functions are the standard regularizer, the fitness loss, and the self-supervised loss (we show the framework again).*T'*

There are some technical details in loss design and optimization. In this blog, we briefly mention the main ideas, and please check out our paper [1] for details.

## Challenges and Solutions in Optimization

**Challenges:** In the SSL domain, noise contrastive estimation (NCE) loss is widely used, which aims to maximize the similarity of positive pairs and minimize the similarity of negative pairs. However, NCE loss is based on the non-convex softmax form and may not be amenable for alternating least squares type algorithms used in tensor factorizations. Also, the claimed negative samples in common SSL practice are usually just random samples.

**Solutions:** In this paper, we first utilize the law of total probability to *find the negative samples in an unbiased way*. Then, we propose *a new subtraction-formed self-supervised loss*, which follows the alignment principle (maximizing positive pairs and minimizing negative pairs) but is amenable to work with traditional optimization tools. Though the new self-supervised loss form is still non-convex (to the unsupervised features **X** and **X’),** we propose *a combination of fix-point iteration and least squares optimization* in the paper for solving this alternating non-convex problem. In the implementation, we do not rely on the auto-grad backpropagation function of PyTorch or TensorFlow.

# 3. Experiments

**Datasets:** Let us look at the performance of our prosed ATD on four human signal datasets: (i) an EEG dataset Sleep-EDF; (ii) an ECG dataset PTB-XL; (iii) an human activity recognition (HAR) dataset; and (iv) a proprietary EEG dataset from Massachusetts General Hospital (MGH), while the first three are open. Their statistics are shown below.

The overall comparison with baseline models is given in the Summary above. The results show that ATD gives comparable or better performance over the baselines. We can conclude that it is useful to consider both fitness from tensor factorization and alignment from self-supervised learning as part of the objective. The result table also shows that tensor-based models require fewer parameters, i.e., less than 5% of parameters compared to deep learning models.

Additionally, we show the effect of varying the amount of training data on MGH dataset (below) while fixing the test set. As a reference, we include an end-to-end supervised CNN model, called Reference CNN. To prevent overlapping, we separate the comparison figure into two sub-figures: the left compares with self-supervised and auto-encoder baselines, and the right one compares with tensor baselines and the reference model. We find that all unsupervised models outperform the supervised reference CNN model in scenarios with fewer training samples. With more training data, the performance of all models gets improved, especially the reference CNN model.

[1] Yang, Chaoqi, Cheng Qian, Navjot Singh, Cao Xiao, M. Brandon Westover, Edgar Solomonik, and Jimeng Sun. "ATD: Augmenting CP Tensor Decomposition by Self Supervision." Advances in Neural Information Processing Systems. 2022.

[2] Kolda, Tamara G., and Brett W. Bader. "Tensor decompositions and applications." SIAM review 51.3 (2009): 455–500.

[3] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[4] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[5] Gutmann, Michael, and Aapo Hyvärinen. "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010.