Kernel-Smoothed ROC Curves with AUC and Youden Index Summary
smoothROC.RdThis function estimates a smooth receiver operating characteristic (ROC) curve for a continuous biomarker using kernel-smoothing of the cumulative distribution functions (CDFs) in the non-diseased and diseased groups. It returns a smooth ROC curve, a kernel-based estimator of the area under the curve (AUC) with confidence interval, and the kernel-smoothed Youden index summary with its optimal cutoff.
Usage
smoothROC(
data,
biomarker,
status,
diseased,
kernel = c("gaussian", "biweight", "epanechnikov"),
bw_method = c("pdf", "AL", "PB", "BHP", "AR"),
alpha = 0.05,
logtrans = FALSE,
grid_n = 2000
)
# S3 method for class 'smoothROC'
print(x, ...)
# S3 method for class 'smoothROC'
plot(x, label = TRUE, ...)
# S3 method for class 'smoothROC'
summary(object, ...)Arguments
- data
A data frame containing the biomarker and status variables.
- biomarker
Character string; name of the numeric column containing biomarker values.
- status
Character string; name of the column containing binary disease status.
- diseased
The value in
statusindicating the diseased class.- kernel
Character string; kernel function for smoothing. One of
"gaussian","biweight", or"epanechnikov".- bw_method
Character string; bandwidth selection method. One of
"pdf"(Silverman rule for density),"AL"(Altman-Leger),"PB"(Polansky-Baker multistage),"BHP"(normal-reference), or"AR"(rule-of-thumb).- alpha
Numeric; significance level for (1 - alpha) confidence intervals (default: 0.05).
- logtrans
Logical; apply log-transformation to biomarker values? (default: FALSE).
- grid_n
Integer; number of grid points for evaluating the ROC curve (default: 2000).
Value
An object of class "smoothROC" with components that include:
curve: true- and false-positive rates on the grid of cutoff values (thresholds).auc: kernel-smoothed AUC estimate.auc_ci: Confidence interval of AUC based on the kernel-smoothed DeLong-type variance.youden,youden_ci: Youden index estimate and confidence interval.cutoff: estimated optimal cutoff associated with the Youden index.sensitivity,specificity: estimated sensitivity, and specificity on the Youden point.gg: Plot of the ROC curvebw_x,bw_y: selected CDF bandwidths for the non-diseased and diseased groups.
Print, plot, and summary methods are available for objects of class
"smoothROC".
Details
Let \(X\) and \(Y\) denote biomarker values from non-diseased and diseased subjects with CDFs \(F\) and \(G\), and survival functions \(\bar F = 1 - F\) and \(\bar G = 1 - G\). The ROC curve is defined as $$ROC(p) = \bar G\{\bar F^{-1}(p)\}, \quad p \in [0,1],$$ that is, the true positive rate plotted against the false positive rate as the threshold varies over the real line. The AUC is the integral of the ROC curve over \([0,1]\) and can be written as $$AUC = \int_{-\infty}^{\infty} F(x)\, dG(x) = P(Y > X),$$ representing the probability that a randomly chosen diseased subject has a higher biomarker value than a randomly chosen non-diseased subject.
Nonparametric ROC estimation based on the empirical distributions of \(X\) and \(Y\) yields a stepwise ROC curve that may be jagged and sensitive to sampling variability, especially in small or moderate samples. To obtain a smooth ROC curve while avoiding parametric distributional assumptions, this function applies kernel-based CDF estimators of the form $$\hat F(x) = \frac{1}{m} \sum_{i=1}^m K\!\left(\frac{x - X_i}{h_m}\right), \quad \hat G(x) = \frac{1}{n} \sum_{j=1}^n K\!\left(\frac{x - Y_j}{h_n}\right),$$ where \(K\) is the integrated kernel and \(h_m\), \(h_n\) are bandwidths for the non-diseased and diseased groups. The smooth ROC curve is then obtained by plugging \(\hat F\) and \(\hat G\) into the ROC functional.
Bandwidth selection is critical for balancing bias and variance in the
smoothed CDFs and the resulting ROC curve. The argument bw_method
implements several rules:
"pdf": a density-based rule-of-thumb (Silverman) that uses a kernel density bandwidth and is convenient but does not satisfy the usual asymptotic conditions for CDF estimation."BHP": a CDF-based normal-reference bandwidth that minimizes an approximation to the integrated mean squared error of \(\hat F\), with a robust scale estimate based onmin(SD, IQR/1.34)."AR": an adjusted CDF reference bandwidth for the Gaussian kernel, obtained by shrinking the normal-reference constant to improve performance for non-Gaussian data while preserving the \(m^{-1/3}\) CDF rate."AL": a fully data-driven CDF-based bandwidth in the spirit of Altman and Leger, where the unknown roughness functional is estimated via an auxiliary kernel estimator."PB": a multistage plug-in CDF-based bandwidth (two-stage version of Polansky and Baker) that uses a pilot normal-reference step followed by data-driven refinement.
These choices focus on optimal smoothing of the CDFs rather than the density, which is more directly aligned with ROC curve estimation.
#' The smooth ROC estimators in this package are based on kernel CDF estimators constructed from a univariate kernel function \(k(u)\) and its integral \(K(u) = \int_{-\infty}^u k(v)\,dv\). The following kernels are implemented:
Gaussian kernel: \(k(u) = (2\pi)^{-1/2}\exp(-u^2/2)\), with CDF \(K(u)\) equal to the standard normal distribution function. This kernel has infinite support and is often used as a default choice in smooth distribution and ROC estimation.
Epanechnikov kernel: \(k(u) = \tfrac{3}{4}(1-u^2)\mathbf{1}_{\{|u|\le 1\}}\), with compact support on \([-1,1]\) and optimal second‑order efficiency under many mean squared error criteria.
Biweight kernel: \(k(u) = \tfrac{15}{16}(1-u^2)^2\mathbf{1}_{\{|u|\le 1\}}\), a smoother higher‑order alternative with support on \([-1,1]\) that produces more rounded ROC and CDF estimates near the boundaries.
The AUC is estimated from the kernel CDFs via $$\hat \delta = \int_{-\infty}^{\infty} \hat F(x)\, d\hat G(x),$$ and is asymptotically equivalent to the empirical AUC based on the Mann–Whitney statistic. To quantify uncertainty, a kernel-smoothed analogue of DeLong's variance estimator is used: empirical placement values are replaced by their kernel-smoothed counterparts, producing a more stable variance estimate in small samples while retaining the large-sample properties of the Mann–Whitney-based estimator.
The Youden index is defined as $$J = \max_t \{\mathrm{sensitivity}(t) + \mathrm{specificity}(t) - 1\} = \max_t \{F(t) - G(t)\},$$ with corresponding optimal cutoff \(t_0 = \arg\max_t \{F(t) - G(t)\}\). Using the kernel CDFs, the function computes a smoothed Youden index \(\hat J\) and its maximizing cutoff \(\hat t_0\) on a search grid. When multiple cutoffs achieve the same maximum, secondary criteria can be applied (e.g., favoring higher sensitivity or higher specificity), or the median of all maximizers can be reported. A Delta-method approximation is used for the variance of \(\hat J\), from which a Wald-type confidence interval is obtained.
References
Khan, R. A., & Ghebremichael, M. (2025). Smooth ROC Curve Estimation. Journal Name. (Preprint)
Zou, K. H., Hall, W. J., & Shapiro, D. E. (1997). Smooth non-parametric receiver operating characteristic (ROC) curves for continuous diagnostic tests. Statistics in Medicine, 16(19), 2143-2156.
Lloyd, C. J. (1998). Using smoothed receiver operating characteristic curves to summarize and compare diagnostic systems. Journal of the American Statistical Association, 93(444), 1356-1364.
Zhou, X.-H., & Harezlak, J. (2002). Comparison of bandwidth selection methods for kernel smoothing of ROC curves. Statistics in Medicine, 21(14), 2045-2055.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837-845.
Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32-35.
Altman, N., and Leger, C. (1995). Bandwidth selection for kernel distribution function estimation. Journal of Statistical Planning and Inference, 46(2), 195–214.
Bowman, A., Hall, P., and Prvan, T. (1998). Bandwidth selection for the smoothing of distribution functions. Biometrika, 85(4), 799–808.
Polansky, A. M., and Baker, E. R. (2000). Multistage plug-in bandwidth selection for kernel distribution function estimates. Journal of Statistical Computation and Simulation, 65(1–4), 63–80.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
Andrews, D. F., & Herzberg, A. M. (2012). Data: a collection of problems from many fields for the student and research worker. Springer Science & Business Media.
The example dataset dystrophy (aliases: dystrophyData contains serum biomarkers for Duchenne muscular dystrophy
carriers and non-carriers and is used to illustrate kernel-smoothed ROC analysis.
Examples
data(dystrophy)
roc <- smoothROC(
data = dystrophy,
biomarker = "CK",
status = "Class",
diseased = "carrier",
kernel = "biweight",
bw_method = "PB",
alpha = 0.05,
logtrans = TRUE,
grid_n = 2000
)
## Basic use
print(roc) # Summary print method
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
summary(roc) # Summary method
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
plot(roc) # ROC plot with annotation label
plot(roc, label = FALSE) # ROC plot without annotation label
## Optional: direct slot access
roc$plot # ggplot object for the ROC curve
head(roc$curve) # Threshold, FPR, TPR, and J on the grid
#> threshold FPR TPR J
#> 1 -0.2326416 1 1 0
#> 2 -0.2274720 1 1 0
#> 3 -0.2223023 1 1 0
#> 4 -0.2171326 1 1 0
#> 5 -0.2119629 1 1 0
#> 6 -0.2067933 1 1 0
roc$AUC # Kernel-smoothed AUC estimate
#> [1] 0.8646697
roc$AUC_ci # Confidence interval for AUC
#> [1] 0.8105293 0.9188102
roc$J # Youden index estimate
#> [1] 0.5723741
roc$J_ci # Confidence interval for the Youden index
#> [1] 0.4542642 0.6904840
roc$t0 # Estimated optimal cutoff (Youden point)
#> [1] 4.068528
roc$sensitivity # Sensitivity at the Youden point
#> [1] 0.7036682
roc$specificity # Specificity at the Youden point
#> [1] 0.8687059
roc$hX # Selected CDF bandwidth for the non-diseased group
#> [1] 0.3146762
roc$hY # Selected CDF bandwidth for the diseased group
#> [1] 0.9802306