Kernel-Smoothed ROC Curves with smoothROC
Ruhul ALi Khan & Musie Ghebremichel
2025-11-24
smoothROC.RmdIntroduction
In medical diagnostics, a central task is to evaluate how accurately a continuous biomarker distinguishes between diseased and non-diseased individuals. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) are standard tools for this purpose. The ROC curve summarizes the trade-off between sensitivity and specificity over all possible decision thresholds, while the AUC provides a single-number summary of overall discriminative ability (the probability that a randomly chosen diseased subject has a higher biomarker value than a randomly chosen non-diseased subject).
Let and denote biomarker values from non-diseased and diseased subjects with cumulative distribution functions (CDFs) and , and survival functions and . The ROC curve can be written as and the AUC as
Empirical ROC curves, constructed directly from the empirical CDFs of and , are fully nonparametric but stepwise and potentially jagged, especially in small or moderate samples. Parametric ROC models are smooth but require strong distributional assumptions that may be unrealistic in practice.
The smoothROC package implements a
kernel-based, distribution-free approach that produces smooth
ROC curves, a kernel-based AUC estimator with confidence intervals, and
a smooth Youden index summary with an associated optimal cutoff. The
core function is smoothROC(), which this vignette
introduces and illustrates.
The smoothROC() function
The main user-facing function is:
smoothROC(
data,
biomarker,
status,
diseased,
kernel = c("gaussian", "biweight", "epanechnikov"),
bw_method = c("pdf", "AL", "PB", "BHP", "AR"),
alpha = 0.05,
logtrans = FALSE,
grid_n = 2000
)It estimates a kernel-smoothed ROC curve for a continuous biomarker using kernel CDF estimators in the non-diseased and diseased groups. It returns:
- a smooth ROC curve on a fine grid of thresholds,
- a kernel-based estimator of the AUC with a confidence interval, and
- a kernel-smoothed Youden index with its optimal cutoff and confidence interval.
Arguments
data
A data frame containing the biomarker and status variables.biomarker
Character string; name of the numeric column containing biomarker values.status
Character string; name of the column containing binary disease status.diseased
The value instatusindicating the diseased class (e.g."carrier").kernel
Character string; kernel function for smoothing. One of"gaussian","biweight", or"epanechnikov".-
bw_method
Character string; bandwidth selection method. One of:-
"pdf"– density-based rule-of-thumb (Silverman), -
"BHP"– CDF-based normal-reference (Bowman–Hall–Prvan), -
"AR"– adjusted CDF reference bandwidth, -
"AL"– Altman–Leger CDF plug-in, -
"PB"– Polansky–Baker multistage plug-in.
-
alpha
Numeric; significance level for confidence intervals (default:0.05).logtrans
Logical; ifTRUE, applies a log-transformation to biomarker values prior to ROC estimation (useful for right-skewed biomarkers). Default:FALSE.grid_n
Integer; number of grid points for evaluating the ROC curve (default:2000).
Returned value
smoothROC() returns an object of class
"smoothROC" with components including:
curve
Data frame giving threshold, false-positive rate (FPR), true-positive rate (TPR), and Youden index on the evaluation grid.AUC,AUC_ci,AUC_ci_lo,AUC_ci_hi
Kernel-smoothed AUC estimate and its confidence interval.J,J_ci,J_ci_lo,J_ci_hi
Kernel-smoothed Youden index estimate and its confidence interval.t0
Estimated optimal cutoff associated with the Youden index.sensitivity,specificity
Sensitivity and specificity at the Youden cutoff.kernel,bandwidth_method
The chosen kernel and bandwidth selection method.hX,hY
Selected CDF bandwidths for the non-diseased and diseased groups.plot
Aggplot2ROC plot object including the Youden point and a textual annotation.
Print, summary, and plot methods are available:
and are invoked automatically via print(),
summary(), and plot().
Method overview
Kernel CDF estimators and the smooth ROC curve
To obtain a smooth ROC curve while remaining nonparametric,
smoothROC() applies kernel-based CDF estimators of the form
where
is the integrated kernel, and
,
are bandwidths for the non-diseased and diseased groups, respectively.
The smooth ROC curve is then obtained by plugging
and
into the ROC functional.
Three univariate kernels are implemented:
Gaussian kernel
with CDF equal to the standard normal distribution function. This kernel has infinite support and is a default choice in many smoothing problems.Epanechnikov kernel
, with compact support on and optimal second-order efficiency under many mean squared error criteria.Biweight kernel
, a higher-order, compactly supported kernel that produces more rounded estimates near the boundaries.
Bandwidth selection strategies
Bandwidth selection is critical for balancing bias and variance in
the smoothed CDFs and the resulting ROC curve. smoothROC()
focuses on bandwidths that are optimal for CDF estimation
(rather than densities), which aligns more directly with ROC
functionals.
The bw_method argument implements several
strategies:
"pdf"
A density-based rule-of-thumb (Silverman) using a kernel density bandwidth. Convenient and widely used, but it does not satisfy the usual asymptotic conditions for CDF estimation."BHP"
A CDF-based normal-reference bandwidth that approximately minimizes the integrated mean squared error of . It uses a robust scale estimate based on ."AR"
An adjusted CDF reference bandwidth for the Gaussian kernel, obtained by shrinking the normal-reference constant to reduce oversmoothing for non-Gaussian data while preserving the CDF rate."AL"
A fully data-driven CDF-based bandwidth in the spirit of Altman and Leger, where the unknown roughness functional is estimated using an auxiliary kernel estimator."PB"
A multistage plug-in CDF-based bandwidth (a two-stage version of Polansky and Baker) that uses an initial normal-reference pilot followed by a data-driven refinement.
In simulation studies (not shown here), the "PB" method
often provides stable performance across a range of underlying
distributions, especially when sample sizes are small to moderate.
AUC estimation and kernel DeLong-type variance
Given the kernel CDFs, smoothROC() computes a
kernel-based AUC estimator
which is asymptotically equivalent to
the empirical AUC based on the Mann–Whitney statistic. This link ensures
that classical large-sample results for the empirical AUC remain valid
in the smoothed setting.
To quantify uncertainty, smoothROC() implements a
kernel-smoothed analogue of DeLong’s variance estimator. Instead of
using empirical placement values, the variance expression replaces them
by their kernel-smoothed counterparts. This typically yields a more
stable variance estimate in small samples while retaining the
large-sample properties of the original DeLong method.
The resulting AUC estimate and confidence interval are available via:
roc$AUC
roc$AUC_ciYouden index and optimal cutoff
The Youden index provides a summary of the optimal trade-off between sensitivity and specificity, with . The corresponding optimal cutoff is
Using the kernel CDFs, smoothROC() computes a smoothed
Youden index
and the maximizing cutoff
on a grid of thresholds. When multiple cutoffs achieve the same maximum,
secondary rules (favoring higher sensitivity or specificity) or
median-based summaries can be used. A Delta-method approximation
provides the variance of
,
from which a Wald-type confidence interval is constructed.
These quantities are returned as:
roc$J
roc$J_ci
roc$t0
roc$sensitivity
roc$specificityExample: Duchenne muscular dystrophy dataset
The package includes an example dataset, dystrophy, with
biomarker measurements for Duchenne muscular dystrophy (DMD) carriers
and non-carriers. We treat the serum marker CK as the
primary biomarker and Class as the disease status.
data(dystrophy)
str(dystrophy)
#> spc_tbl_ [209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ OBS : num [1:209] 1 1 1 1 1 1 1 1 1 1 ...
#> $ HospID: num [1:209] 1007 786 778 1306 895 ...
#> $ AGE : num [1:209] 22 32 36 22 23 30 27 30 25 26 ...
#> $ M : num [1:209] 6 8 7 11 1 5 8 11 10 2 ...
#> $ Y : num [1:209] 79 78 78 79 78 79 78 78 79 79 ...
#> $ CK : num [1:209] 52 20 28 30 40 24 15 22 42 130 ...
#> $ H : num [1:209] 83.5 77 86.5 104 83 78.8 87 91 65.5 80.3 ...
#> $ PK : num [1:209] 10.9 11 13.2 22.6 15.2 9.6 13.5 17.5 13.3 17.1 ...
#> $ LD : num [1:209] 176 200 171 230 205 151 232 198 216 211 ...
#> $ Class : Factor w/ 2 levels "normal","carrier": 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "spec")=List of 3
#> ..$ cols :List of 10
#> .. ..$ OBS : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ HospID: list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ AGE : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ M : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ Y : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ CK : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ H : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ PK : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ LD : list()
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#> .. ..$ Class :List of 3
#> .. .. ..$ levels : chr [1:2] "normal" "carrier"
#> .. .. ..$ ordered : logi FALSE
#> .. .. ..$ include_na: logi FALSE
#> .. .. ..- attr(*, "class")= chr [1:2] "collector_factor" "collector"
#> ..$ default: list()
#> .. ..- attr(*, "class")= chr [1:2] "collector_guess" "collector"
#> ..$ delim : chr ","
#> ..- attr(*, "class")= chr "col_spec"
#> - attr(*, "problems")=<externalptr>A basic smooth ROC analysis is:
roc <- smoothROC(
data = dystrophy,
biomarker = "CK",
status = "Class",
diseased = "carrier",
kernel = "biweight",
bw_method = "PB",
alpha = 0.05,
logtrans = TRUE,
grid_n = 2000
)Inspecting the result
The print and summary methods provide a concise summary:
roc
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
summary(roc)
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687They report the kernel and bandwidth method, AUC with confidence interval, the Youden index and its confidence interval, and the Youden point (FPR, TPR, cutoff, sensitivity, specificity).
We can visualize the ROC curve:
plot(roc)
ROC curve produced by smoothROC
By default, the plot includes:
- the 45-degree reference line (no-discrimination),
- the smooth ROC curve,
- the Youden point marked in red, and
- a label showing AUC, its confidence interval, the Youden index, and the Youden cutoff.
If you prefer a clean ROC curve without annotation, use:
plot(roc, label = FALSE)
ROC curve produced by smoothROC
The underlying ROC data and key summaries can be accessed directly:
head(roc$curve) # threshold, FPR, TPR, J
#> threshold FPR TPR J
#> 1 -0.2326416 1 1 0
#> 2 -0.2274720 1 1 0
#> 3 -0.2223023 1 1 0
#> 4 -0.2171326 1 1 0
#> 5 -0.2119629 1 1 0
#> 6 -0.2067933 1 1 0
roc$AUC # AUC estimate
#> [1] 0.8646697
roc$AUC_ci # AUC confidence interval
#> [1] 0.8105293 0.9188102
roc$J # Youden index estimate
#> [1] 0.5723741
roc$J_ci # Youden index CI
#> [1] 0.4542642 0.6904840
roc$t0 # Youden cutoff
#> [1] 4.068528
roc$sensitivity # Sensitivity at Youden point
#> [1] 0.7036682
roc$specificity # Specificity at Youden point
#> [1] 0.8687059
roc$hX # Bandwidth for non-diseased CDF
#> [1] 0.3146762
roc$hY # Bandwidth for diseased CDF
#> [1] 0.9802306Advanced options
This section summarizes the more technical aspects of
smoothROC() that may be useful for advanced users.
Log transformation
Setting logtrans = TRUE applies a natural log
transformation to the biomarker prior to ROC estimation. This is often
appropriate for biomarkers with strong right skew or multiplicative
variability (e.g. enzyme concentrations, cytokines). When
logtrans = TRUE, biomarker values must be strictly
positive.
Controlling the ROC grid
The argument grid_n controls the resolution of the ROC
curve. Larger values produce a smoother ROC curve and more precise
localization of the Youden cutoff, at the cost of increased computation.
Reasonable values include:
-
grid_n = 1000– fast and adequate for exploratory work; -
grid_n = 2000– default, smoother curve and better stability; -
grid_n = 5000– more refined, useful in simulation studies.
Choosing a bandwidth method
For most applications, a good starting point is:
-
kernel = "biweight", -
bw_method = "PB".
The "PB" method tends to perform well across a range of
scenarios and sample sizes. When computation time is a concern, the
"AR" method offers a simple, robust alternative that is
easy to compute.
In large samples, the differences between bandwidth methods may be minor. However, in small or moderate samples, bandwidth selection can substantially affect ROC shape, AUC estimates, and the stability of the Youden index.
References
Altman, N., & Leger, C. (1995). Bandwidth selection for kernel distribution function estimation. Journal of Statistical Planning and Inference, 46(2), 195–214.
Andrews, D. F., & Herzberg, A. M. (2012). Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer.
Bowman, A., Hall, P., & Prvan, T. (1998). Bandwidth selection for the smoothing of distribution functions. Biometrika, 85(4), 799–808.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837–845.
Khan, R. A., & Ghebremichael, M. (2025). Smooth ROC Curve Estimation. Journal Name (preprint).
Lloyd, C. J. (1998). Using smoothed receiver operating characteristic curves to summarize and compare diagnostic systems. Journal of the American Statistical Association, 93(444), 1356–1364.
Polansky, A. M., & Baker, E. R. (2000). Multistage plug-in bandwidth selection for kernel distribution function estimates. Journal of Statistical Computation and Simulation, 65(1–4), 63–80.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35.
Zhou, X.-H., & Harezlak, J. (2002). Comparison of bandwidth selection methods for kernel smoothing of ROC curves. Statistics in Medicine, 21(14), 2045–2055.
Zou, K. H., Hall, W. J., & Shapiro, D. E. (1997). Smooth nonparametric receiver operating characteristic (ROC) curves for continuous diagnostic tests. Statistics in Medicine, 16(19), 2143–2156.