TQA: Creating Valid Prediction Intervals for Cross-sectional Time Series Regression

--

Author: Zhen Lin (UIUC)

This blog describes our NeurIPS’22 paper [3]: Conformal Prediction with Temporal Quantile Adjustments.

In this blog, we will go over:

  • A brief introduction of conformal prediction, a powerful tool that provides coverage guarantees with minimal assumptions on the distributions.
  • The challenges in creating valid prediction intervals for time series forecasting, and our approach to it (TQA).

Resources: code, poster, paper.

The Problem

Imagine we are a hospital and want to predict a biomarker of patients. The biomarker could be, for example, white blood cell counts, blood pressure, or any continuous response variable. Since each patient typically has multiple visits, their data form a time series. Typically we have many patients, and they jointly form a cross-section. Cross-sectional time series are frequently seen in electronic health records (EHR), social science, and econometrics, often referred to as panel data. We refer to such tasks as cross-sectional time series regression.

Our goal is to accurately predict the future observations of the time series with a coverage guarantee. Formally, we are interested in providing a point estimate of the response Y and a prediction interval (PI) for the estimate. In terms of the coverage guarantee, we would like to say something like, “with a probability of 90%, the patient’s blood pressure is within [a,b]”. A few questions arise: How do we decide [a,b]? Can we provide a guarantee for this? Can we construct PIs for any underlying model?

To articulate the problem, we will have to introduce some notations:

We will use lowercase to denote the realization of these random variables. Our goal is to construct a prediction interval (PI) that will cover the corresponding response Y with probability ≥1-α. For example, for a 90% PI, α=0.1. We will assume our patients are independent and identically distributed (i.i.d), but each patient’s time series has its own (potentially complex) temporal dependence. Familiar readers might realize that conformal prediction could come in handy. Don’t worry if this sounds alien to you, as we will go over this simple but powerful tool in the next section.

Caveat: Panel data / cross-sectional time series are sometimes synchronous: for example, stock returns are also panel data, but we observed all stocks’ returns for the same period simultaneously. We do not focus on such cases. For EHR data, for example, one could use the full history of other patients when predicting the response of a new patient (with minimal distribution shift, if any).

A Quick Tutorial on (Split) Conformal Prediction

Before we talk about the main idea of the paper, we need to take a quick detour to review some basics. Conformal prediction is a framework that aims to provide provable coverage guarantees, with minimal assumptions on the underlying model. Its flexibility attracts a lot of recent applications to complex deep learning models. (We skip some technical details in this section, and interested readers should refer to [1] or our paper [3] for a more comprehensive treatment on the topic of Conformal Prediction and related works.)

Suppose for now that we focus on a particular t. In the simplest form, we collect the residuals from a held-out set of patients:

then for a new patient with prediction ŷ at time t, we could construct:

Here, Q(β; A) means the β-quantile for the set A. With our i.i.d assumption, this PI will cover the corresponding Y with the target probability 1-α, as formally stated by the coverage guarantee below:

Here, the probability is taken over a random time series (indexed by N+1).

The General Case

For a moment, let’s ignore the subscript t since it is fixed, and consider any classification or regression task trying to estimate Y from X. Our task is to use conformal prediction to construct a set of values Ĉ such that the probability of Y∈Ĉ is at least 1-α. Note that a prediction interval could be viewed as a subset of ℝ in this case, and Ĉ need not be continuous. In general, conformal prediction requires a “nonconformity score” function s that measures how “nonconformal” a realized datum (x,y) is. For example, we could interpret our “residual” above as s(x,y) = |y- ŷ(x)|. The larger the residual, the more nonconformal this data point is with respect to the fitted model (and, to some extent, the training set). The conformal prediction set/interval is then constructed as:

Note that PI in Eq.(1) is just a special case of this general formula Eq.(3).

We could then show that the response indeed falls in this PI at least 1-α of the time. Let’s first consider the rank of our test point among all nonconformity scores, denoted by r:

Due to our i.i.d. assumption, r follows a uniform distribution. This means for any β∈(0,1), including β=1−α in particular, the probability of r being smaller than the β-quantile of Scores is β¹! Note that we do NOT know the nonconformity score for our test point, which is why we used an ∞ in Eq.(1) — if α(N+1)<1, it should be clear that the set in Eq.(3) could include arbitrarily large nonconformity score/residual (and thus arbitrarily large Y).

The case of classification

We focused on regression in the paper. You might wonder what the conformal prediction looks like for classification? (Yes, this is another further detour from the paper.) For that, we introduce the key concept of prediction set. Let’s assume we have a trained classifier f(y|x) that predicts the probability of x belonging to class y. We could simply use s(x,y) = 1-f(y|x) as the nonconformity score. In this case, Ĉ is a prediction set consisting of discrete values, such as {cat, dog}. The cardinality of Ĉ increases as α decreases — for example, when α=0, Ĉ contains all classes. In practice, we want a small prediction set Ĉ for any fixed α.

Note that there are many choices for nonconformity scores for different cases. In fact, a very important part of conformal prediction research is about designing better nonconformity scores. In our paper, however, we try to improve the pipeline in a different way, intending to work with any nonconformity scores.

Temporal Quantile Adjustment

Now let’s come back to the main story of the paper.

The split conformal prediction interval is great. But are we done? Not really! When a PI covers Y at least 1-α of the time, we say it’s valid. However, from Figure 1 below, we can see that there are several validity profiles. In particular, both B and C in Figure 1 satisfy the guarantee above, which we call cross-sectional validity. There is another important notion called longitudinal validity. Let’s imagine a patient that our model just does not predict very well. If her Y is outside our PI 20 out of 20 times, should we keep repeating constructing the aforementioned PIs for future visits? Probably not. That is exactly what happened for the first row of B in Figure 1, which is not longitudinally invalid. Instead, we will prefer C as the PIs are both cross-sectionally and longitudinally valid.

Figure 1: PI estimators with different cross-sectional or longitudinal coverages/validities. Red crosses represent Y outside the corresponding PIs. C exhibits both cross-sectional and longitudinal validity, which is ideal.

The previous discussion brings us to the main idea of our paper. Intuitively, we want to achieve both validity profiles like C in Figure 1. It turns out it is quite hard to provide a theoretical guarantee of longitudinal validity. As a practical compromise, we aim to maintain cross-sectional validity while improving longitudinal validity. Note that Y is not covered if and only if the rank of residual is greater than 1-α. The idea is that we can actually replace the queried quantile with a dynamic value:

We could change the quantile-to-query by predicting the error and use a higher (lower) adjustment when we believe the error will be high (low). Below we present two temporal quantile adjustment (TQA) methods:

TQA: Budgeting

Our first method is called TQA-B (B stands for Budgeting). It is not hard to show that, if our adjustment is random noise that has no relation with the rank r (as affecting rank will be very “bad” adjustments), then cross-sectional validity is achieved when the expectation of the adjustment is non-negative. A positive adjustment in expectation will lead to more conservative (i.e., wider) PIs. To give efficient/narrow PIs, we will set this expectation to be precisely 0. This constraint is also where the name “budgeting” comes from. We divide the pipeline into two steps:

  1. Quantile Prediction: Predict the quantile of the residual/nonconformity score of our test point. We use the rank of the Exponentially Weighted Moving Average (EWMA) of residuals.
  2. Budgeting: Given the prediction of rank, we transform it into an adjustment with:

The following figure (Figure 2) shows why this should help improve the coverage when our rank prediction is indeed predictive of the realized rank of the test point’s nonconformity score.

Figure 2: Coverage profiles with hypothetical realized rank vs. prediction, with α = 0.2 for readability. “Budgeting” means that the areas of “sacrificed” and “gained” are the same. If the rank is predictable and we do not perform TQA-B (left), we keep the 80% coverage. If the rank is predictable and we perform TQA-B (middle), then we improve the coverage, as there are more points in “gained” than “sacrificed”. If we perform the TQA-B even when the rank is not predictable(right), we still maintain the target coverage of 80%.

TQA: Error-based Adjustment

The second method is called TQA-E. In this version, we simply use the error (whether Y falls in the corresponding PI) to update our adjustment. This method handles each time series independently. To be specific:

This update rule is inspired by [2]. Similar to [2], we also allow the queried quantile to be lower than 0, in which case we have infinitely wide PIs. This means TQA-E has a better asymptotic coverage guarantee (please refer to our paper for details) but also tends to be wider and less efficient.

For both TQA-B and TQA-E, we could show that even if our adjustment is “bad” (e.g. when the errors for each time series are temporally independent), we still have the cross-sectionally validity. However, because they both try to adapt to the “abnormal-ness” — or nonconformity — they exhibit much better longitudinal coverage empirically.

Experiments

To verify TQA’s effectiveness, we conduct experiments on several datasets, including:

  • MIMIC: White blood cell counts (WBCC) prediction from patient records of MIMIC-III dataset. Sequential visits of one patient are considered a time series.
  • CLAIM: claim amount prediction using insurance data. Several sequential claims of one patient are considered one time series, and the X includes features like ICD-10 or CPT/HCPCS codes.
  • COVID²: COVID-19 cases prediction task. Each time series is the COVID cases for one region in UK.
  • EEG: Electroencephalogram (EEG) signals trajectory prediction after visual stimuli. Each time series is a short EEG recording.
  • GEFCom: Energy load data from the Probabilistic Electric Load Forecasting task in Global Energy Forecasting Competition 2014. It has hourly temperature (X) and electricity load (Y) data of one utility for 9 years. We treat different days as the cross-section and each time series has a length of 24.

TQA-B and TQA-E are compared with several methods (some conformal) on three metrics:

  1. Average coverage rate: The proportion of Y that falls in the corresponding Ĉ (over all i and t).
  2. Tail coverage rate: like average coverage rate, but only look at the least-covered 10% of all time series (i.e., the performance on the most difficult time series).
  3. Inverse efficiency: average PI width divided by the average coverage rate. We prefer lower inverse efficiency because it means the PI could achieve the desired coverage with narrower intervals.

For a good PI, we would like the first two metrics to be as close as 90% (our target), and the last metric to be small. Note that if we go back to Figure 1, B and C both have a 90% average coverage rate, but only C, being longitudinally valid, has a high tail coverage rate.

In Table 1, 2 and 3, we present the results. Both TQA-B and TQA-E significantly improve the tail coverage rate over the baselines. TQA-B also maintains very competitive efficiency.

Table 1: Average coverage rate. All conformal methods (TQA and middle columns) achieve a 90% target coverage rate.
Table 2: Tail coverage rate. TQA-B does not generate infinitely-wide PIs, so it tends to achieve a lower tail coverage rate than TQA-E. However, both significantly outperforms all baselines.
Table 3: Inverse efficiency. TQA-B maintains competitive efficiency, despite the fact that to cover extreme outliers, the PI gets wide very quickly (see our paper for more discussion).

Conclusion

We proposed Temporal Quantile Adjustment, or TQA, to create prediction intervals in time series forecasting with a cross-section. TQA belongs to the framework of conformal prediction, and the main idea is to adjust the quantile to query using temporal information collected so far. This allows TQA to work with any model and any nonconformity score design. Please check out our paper [3] for more details, including details on the theoretical guarantees and a comparison between a few different alternative ways to perform the adjustments. We also included a demo notebook demonstrating how to apply TQA to any model you already have!

References

[1] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005

[2] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, 2021.

[3] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal Prediction with Temporal Quantile Adjustments. In Advances in Neural Information Processing Systems, 2022. arxiv

Footnotes

  1. Note that the quantiles technically can only take the discrete values in [0,1,…,N+1]/(N+1), so it is not possible to pick such quantile for any continuously valued β. To achieve the desired result, we will need to use the “smoothed conformal predictors” in [1].
  2. COVID could be viewed as a synchronous dataset as well (if we ignore the asynchronous update part). Here, we just followed a baseline paper (CFRNN) and used it for evaluation purposes.

--

--