Title: | Quantile Recalibration for Regression Models |
---|---|
Description: | Enables the diagnostics and enhancement of regression model calibration.It offers both global and local visualization tools for calibration diagnostics and provides one recalibration method: Torres R, Nott DJ, Sisson SA, Rodrigues T, Reis JG, Rodrigues GS (2024) <doi:10.48550/arXiv.2403.05756>. The method leverages on Probabilistic Integral Transform (PIT) values to both evaluate and perform the calibration of statistical models. For a more detailed description of the package, please refer to the bachelor's thesis available bellow. |
Authors: | Carolina Musso [aut, cre, cph]
|
Maintainer: | Carolina Musso <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1 |
Built: | 2025-02-18 07:14:14 UTC |
Source: | https://github.com/cmusso86/recalibratinn |
Visualizes the predicted vs. empirical cumulative distributions of PIT-values using ggplot.
This function creates a ggplot graph that compares the cumulative distributions of predicted and empirical Probability Integral Transform (PIT) values. It shows the calibration quality of a regression model by examining how well the predicted values conform to the observed values.
gg_CD_global(pit, ycal, yhat, mse)
gg_CD_global(pit, ycal, yhat, mse)
pit |
Numeric vector of global PIT-values. It is recommended to calculate these using the |
ycal |
Numeric vector representing the true observations (y-values) of the response variable from the calibration dataset. |
yhat |
Numeric vector of predicted response (y-hat-values) on the calibration dataset. |
mse |
Mean Squared Error calculated from the calibration dataset. |
A ggplot
object displaying a point graph of the empirical versus predicted cumulative distributions of PIT-values.
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit <- PIT_global( y_cal, y_hat, MSE_cal) gg_CD_global(pit,y_cal, y_hat, MSE_cal)
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit <- PIT_global( y_cal, y_hat, MSE_cal) gg_CD_global(pit,y_cal, y_hat, MSE_cal)
This function generates a ggplot visual representation to compare the predicted versus empirical cumulative distributions of Probability Integral Transform (PIT) values at a local level. It is useful for diagnosing the calibration in different regions within the dataset, since miscalibration patterns may differ across the covariate space. The function allows for customization of the plot layers to suit specific needs. For advanced customization of the plot layers, refer to the ggplot2 User Guide.
gg_CD_local( pit_local, mse, psz = 0.01, abline = "black", pal = "Set2", facet = FALSE, ... )
gg_CD_local( pit_local, mse, psz = 0.01, abline = "black", pal = "Set2", facet = FALSE, ... )
pit_local |
A data frame of local PIT-values, typically obtained from |
mse |
Mean Squared Error calculated from the calibration dataset. |
psz |
Double indicating the size of the points on the plot. Default is 0.001. |
abline |
Color of the diagonal line. Default color is "red". |
pal |
Palette name from RColorBrewer for coloring the plot. Default is "Set2". |
facet |
Logical value indicating if a separate visualization for each subgroup is preferred. Default is FALSE. |
... |
Additional parameters to customize the ggplot. |
This funcion will work with the output of the PIT_local()
function, which provides the PIT-values for each subgroup pf the covariate space in the appropriate format.
A ggplot
object displaying the cumulative distributions of PIT-values that that can be customized as needed.
n <- 10000 split <- 0.8 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit_local <- PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_CD_local(pit_local, mse=MSE_cal) gg_CD_local(pit_local, facet=TRUE, mse=MSE_cal)
n <- 10000 split <- 0.8 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit_local <- PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_CD_local(pit_local, mse=MSE_cal) gg_CD_local(pit_local, facet=TRUE, mse=MSE_cal)
This function generates a ggplot visual representation of the density of Probability Integral Transform (PIT) values globally. For advanced customization of the plot layers, refer to the ggplot2 User Guide.
gg_PIT_global( pit, type = "density", fill = "steelblue4", alpha = 0.8, print_p = TRUE )
gg_PIT_global( pit, type = "density", fill = "steelblue4", alpha = 0.8, print_p = TRUE )
pit |
Vector of PIT values to be plotted. |
type |
Character string specifying the type of plot: either "density" or "histogram". This determines the representation style of the PIT values. |
fill |
Character string defining the fill color of the plot. Default is 'steelblue4'. |
alpha |
Numeric value for the opacity of the plot fill, with 0 being fully transparent and 1 being fully opaque. Default is 0.8. |
print_p |
Logical value indicating whether to print the p-value from the Kolmogorov-Smirnov test. Useful for statistical diagnostics. |
This function also tests the PIT-values for uniformity using the Kolmogorov-Smirnov test (ks.test
).
The p-value from the test is printed on the plot if print_p
is set to TRUE
.
A ggplot
object depicting a density graph of PIT-values, which can be further customized.
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit <- PIT_global(ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_PIT_global(pit)
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit <- PIT_global(ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_PIT_global(pit)
A function based on ggplot2 to observe the density of PIT-values locally. It is recommended
to use PIT-values obtained via the PIT_local
function from this package or an object of
equivalent format. For advanced customization of the plot layers, refer to the ggplot2 User Guide.This function also tests the PIT-values
for uniformity using the Kolmogorov-Smirnov test (ks.test
). The p-value from the test is printed on the plot if facet
is set to TRUE
.
gg_PIT_local( pit_local, alpha = 0.4, linewidth = 1, pal = "Set2", facet = FALSE )
gg_PIT_local( pit_local, alpha = 0.4, linewidth = 1, pal = "Set2", facet = FALSE )
pit_local |
A tibble with five columns: "part", "y_cal", "y_hat", "pit", and "n", representing the partitions, calibration data, predicted values, PIT-values, and the count of observations, respectively. |
alpha |
Numeric value between 0 and 1 indicating the transparency of the plot fill. Default is set to 0.4. |
linewidth |
Integer specifying the linewidth of the density line. Default is set to 1. |
pal |
A character string specifying the RColorBrewer palette to be used for coloring the plot. Default is "Set2". |
facet |
Logical indicating whether to use |
A ggplot
object representing the local density distributions of PIT-values,
which can be further customized through ggplot2 functions.
n <- 10000 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 2, 20) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*0.8)] y_train <- y[1:(n*0.8)] x_cal <- x[(n*0.8+1):n] y_cal <- y[(n*0.8+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit_local <- PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_PIT_local(pit_local) gg_PIT_local(pit_local, facet=TRUE)
n <- 10000 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 2, 20) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*0.8)] y_train <- y[1:(n*0.8)] x_cal <- x[(n*0.8+1):n] y_cal <- y[(n*0.8+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) pit_local <- PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal) gg_PIT_local(pit_local) gg_PIT_local(pit_local, facet=TRUE)
A function to calculate the Probability Integral Transform (PIT) values for any fitted model that assumes a normal distribution of the output.
PIT_global(ycal, yhat, mse)
PIT_global(ycal, yhat, mse)
ycal |
Numeric vector representing the true observations (y-values) of the response variable from the calibration dataset. |
yhat |
Numeric vector of predicted y-values on the calibration dataset. |
mse |
Mean Squared Error calculated from the calibration dataset. |
This function is designed to work with models that is, even implicitly, assuming normal distribution of the response variable.
This includes, but is not limited to, linear models created using lm()
or neural networks utilizing Mean Squared Error as the loss function.
The OLS method is used to minimized residuals in these models. This mathematical optimization will also yield a probabilistic optimization
when normal distribution of the response variable is assumed, since OLS and maximum likelihood estimation are equivalent under normality.
Therefore, in order to render a probabilistic interpretation of the predictions, the model is intrinsically assuming a normal distribution of the response variable.
Returns a numeric vector of PIT-values.
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) PIT_global(ycal=y_cal, yhat=y_hat, mse=MSE_cal)
n <- 10000 split <- 0.8 # generating heterocedastic data mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) PIT_global(ycal=y_cal, yhat=y_hat, mse=MSE_cal)
This function calculates local Probability Integral Transform (PIT) values using localized subregions of the covariate space from the calibration set.
The output will be used for visualization of calibration quality using the gg_CD_local()
and gg_PIT_local()
function.
PIT_local( xcal, ycal, yhat, mse, clusters = 6, p_neighbours = 0.2, PIT = PIT_global )
PIT_local( xcal, ycal, yhat, mse, clusters = 6, p_neighbours = 0.2, PIT = PIT_global )
xcal |
Numeric matrix or data frame of features/covariates (x-values) from the calibration dataset. |
ycal |
Numeric vector representing the true observations (y-values) of the response variable from the calibration dataset. |
yhat |
Numeric vector of predicted response (y-hat-values) from the calibration dataset. |
mse |
Mean Squared Error calculated from the calibration dataset. |
clusters |
Integer specifying the number of partitions to create for local calibration using the k-means method. Default is set to 6. |
p_neighbours |
Proportion of xcal used to localize neighbors in the KNN method. Default is 0.2. |
PIT |
Function used to calculate the PIT-values. Default is set to |
It calculates local Probability Integral Transform (PIT) values using localized subregions of the covariate space from the calibration set.
The centroids of such regions are derived from a k-means clustering method (from the stats
package). The local areas around these centroids
are defined through an approximate k-nearest neighbors method from the RANN
package.
Then, for this subregion, the PIT-values are calculated using the PIT
function provided by the user. At the moment this function is tested to
work with the PIT_global()
function from this package, which assumes a Gaussian distribution. Eventually, it can be used with other distributions.
A tibble with five columns containing unique names for each partition ("part"), "y_cal" (true observations), "y_hat" (predicted values), "pit" (PIT-values), and "n" (number of neighbors) for each partition.
n <- 10000 split <- 0.8 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal)
n <- 10000 split <- 0.8 mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] model <- lm(y_train ~ x_train) y_hat <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat - y_cal)^2) PIT_local(xcal = x_cal, ycal=y_cal, yhat=y_hat, mse=MSE_cal)
This function offers recalibration techniques for regression models that assume Gaussian distributions by using the Mean Squared Error (MSE) as the loss function. Based on the work by Torres R. et al. (2024), it supports both local and global recalibration approaches to provide samples from a recalibrated predictive distribution. A detailed algorithm can also be found in Musso C. (2023).
recalibrate( yhat_new, pit_values, mse, space_cal = NULL, space_new = NULL, type = c("local", "global"), p_neighbours = 0.1, epsilon = 0 )
recalibrate( yhat_new, pit_values, mse, space_cal = NULL, space_new = NULL, type = c("local", "global"), p_neighbours = 0.1, epsilon = 0 )
yhat_new |
Numeric vector with predicted response values for the new (or test) set. |
pit_values |
Numeric vector of Global Probability Integral Transform (PIT) values calculated on the calibration set. We recommend using the PIT_global function. |
mse |
Mean Squared Error calculated from the calibration/validation set. |
space_cal |
Numeric matrix or data frame representing the covariates/features of the calibration/validation set, or any intermediate representation (like an intermediate layer of a neural network). |
space_new |
Similar to space_cal, but for a new set of covariates/features, ensuring they are in the same space as those in space_cal for effective local recalibration. |
type |
Character string to choose between 'local' or 'global' calibration. |
p_neighbours |
Proportion (0,1] of the calibration dataset to be considered for determining the number of neighbors in the KNN method. Default is set to 0.1. With p_neighbours=1, calibration is global but weighted by distance. |
epsilon |
Numeric value for approximation in the K-nearest neighbors (KNN) method. Default is 0, indicating exact distances. |
The calibration technique implemented here draws inspiration from Approximate Bayesian Computation and Inverse Transform Theorem, allowing for recalibration either locally or globally. The global method employs a uniform kernel, while the local method employs an Epanechnikov kernel.
It's important to note that the least squares method will only yield a probabilistic interpretation if the output to be modeled follows a normal distribution, and this assumption was used to implement this function.
The local recalibration method is expected to improve the predictive performance of the model, especially when the model is not able to capture the heteroscedasticity of the data. However, there is a trade off between refinement of localization and the Monte Carlo error, which can be controlled by the number of neighbors. That is, when more localized, the recalibration will grasp local changes better, but the Monte Carlo error will increase, because of the reduced number of neighbors.
When p_neighbours=1, recalibration is performed using the entire calibration dataset but with distance-weighted contributions.
A list containing the calibrated predicted mean and variance, along with samples from the recalibrated predictive distribution and their respective weights calculated using an Epanechnikov kernel over the distances obtained from KNN.
Torres R, Nott DJ, Sisson SA, Rodrigues T, Reis JG, Rodrigues GS (2024). “Model-Free Local Recalibration of Neural Networks.” arXiv preprint arXiv:2403.05756. doi:10.48550/arXiv.2403.05756. Musso C (2023). “Recalibration of Gaussian Neural Network Regression Models: The RecalibratiNN Package.” Undergraduate Thesis (Bachelor in Statistics), University of Brasília. Available at: https://bdm.unb.br/handle/10483/38504.
n <- 1000 split <- 0.8 # Auxiliary functions mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } # Generating heteroscedastic data. x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) # Train set x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] # Calibration/Validation set. x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] # New observations or the test set. x_new <- runif(n/5, 1, 10) # Fitting a simple linear regression, which will not capture the heteroscedasticity model <- lm(y_train ~ x_train) y_hat_cal <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat_cal - y_cal)^2) y_hat_new <- predict(model, newdata=data.frame(x_train=x_new)) pit <- PIT_global(ycal=y_cal, yhat= y_hat_cal, mse=MSE_cal) recalibrate( space_cal=x_cal, space_new=x_new, yhat_new=y_hat_new, pit_values=pit, mse= MSE_cal, type="local")
n <- 1000 split <- 0.8 # Auxiliary functions mu <- function(x1){ 10 + 5*x1^2 } sigma_v <- function(x1){ 30*x1 } # Generating heteroscedastic data. x <- runif(n, 1, 10) y <- rnorm(n, mu(x), sigma_v(x)) # Train set x_train <- x[1:(n*split)] y_train <- y[1:(n*split)] # Calibration/Validation set. x_cal <- x[(n*split+1):n] y_cal <- y[(n*split+1):n] # New observations or the test set. x_new <- runif(n/5, 1, 10) # Fitting a simple linear regression, which will not capture the heteroscedasticity model <- lm(y_train ~ x_train) y_hat_cal <- predict(model, newdata=data.frame(x_train=x_cal)) MSE_cal <- mean((y_hat_cal - y_cal)^2) y_hat_new <- predict(model, newdata=data.frame(x_train=x_new)) pit <- PIT_global(ycal=y_cal, yhat= y_hat_cal, mse=MSE_cal) recalibrate( space_cal=x_cal, space_new=x_new, yhat_new=y_hat_new, pit_values=pit, mse= MSE_cal, type="local")