Skip to contents

Introduction

In medical diagnostics, a central task is to evaluate how accurately a continuous biomarker distinguishes between diseased and non-diseased individuals. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) are standard tools for this purpose. The ROC curve summarizes the trade-off between sensitivity and specificity over all possible decision thresholds, while the AUC provides a single-number summary of overall discriminative ability (the probability that a randomly chosen diseased subject has a higher biomarker value than a randomly chosen non-diseased subject).

Let XX and YY denote biomarker values from non-diseased and diseased subjects with cumulative distribution functions (CDFs) FF and GG, and survival functions F=1F\bar F = 1 - F and G=1G\bar G = 1 - G. The ROC curve can be written as ROC(p)=G{F1(p)},p[0,1], \mathrm{ROC}(p) = \bar G\{\bar F^{-1}(p)\}, \quad p \in [0,1], and the AUC as AUC=F(x)dG(x)=P(Y>X). \mathrm{AUC} = \int_{-\infty}^{\infty} F(x)\,dG(x) = P(Y > X).

Empirical ROC curves, constructed directly from the empirical CDFs of XX and YY, are fully nonparametric but stepwise and potentially jagged, especially in small or moderate samples. Parametric ROC models are smooth but require strong distributional assumptions that may be unrealistic in practice.

The smoothROC package implements a kernel-based, distribution-free approach that produces smooth ROC curves, a kernel-based AUC estimator with confidence intervals, and a smooth Youden index summary with an associated optimal cutoff. The core function is smoothROC(), which this vignette introduces and illustrates.

The smoothROC() function

The main user-facing function is:

smoothROC(
  data,
  biomarker,
  status,
  diseased,
  kernel    = c("gaussian", "biweight", "epanechnikov"),
  bw_method = c("pdf", "AL", "PB", "BHP", "AR"),
  alpha     = 0.05,
  logtrans  = FALSE,
  grid_n    = 2000
)

It estimates a kernel-smoothed ROC curve for a continuous biomarker using kernel CDF estimators in the non-diseased and diseased groups. It returns:

  • a smooth ROC curve on a fine grid of thresholds,
  • a kernel-based estimator of the AUC with a confidence interval, and
  • a kernel-smoothed Youden index with its optimal cutoff and confidence interval.

Arguments

  • data
    A data frame containing the biomarker and status variables.

  • biomarker
    Character string; name of the numeric column containing biomarker values.

  • status
    Character string; name of the column containing binary disease status.

  • diseased
    The value in status indicating the diseased class (e.g. "carrier").

  • kernel
    Character string; kernel function for smoothing. One of "gaussian", "biweight", or "epanechnikov".

  • bw_method
    Character string; bandwidth selection method. One of:

    • "pdf" – density-based rule-of-thumb (Silverman),
    • "BHP" – CDF-based normal-reference (Bowman–Hall–Prvan),
    • "AR" – adjusted CDF reference bandwidth,
    • "AL" – Altman–Leger CDF plug-in,
    • "PB" – Polansky–Baker multistage plug-in.
  • alpha
    Numeric; significance level for (1α)(1 - \alpha) confidence intervals (default: 0.05).

  • logtrans
    Logical; if TRUE, applies a log-transformation to biomarker values prior to ROC estimation (useful for right-skewed biomarkers). Default: FALSE.

  • grid_n
    Integer; number of grid points for evaluating the ROC curve (default: 2000).

Returned value

smoothROC() returns an object of class "smoothROC" with components including:

  • curve
    Data frame giving threshold, false-positive rate (FPR), true-positive rate (TPR), and Youden index JJ on the evaluation grid.

  • AUC, AUC_ci, AUC_ci_lo, AUC_ci_hi
    Kernel-smoothed AUC estimate and its confidence interval.

  • J, J_ci, J_ci_lo, J_ci_hi
    Kernel-smoothed Youden index estimate and its confidence interval.

  • t0
    Estimated optimal cutoff associated with the Youden index.

  • sensitivity, specificity
    Sensitivity and specificity at the Youden cutoff.

  • kernel, bandwidth_method
    The chosen kernel and bandwidth selection method.

  • hX, hY
    Selected CDF bandwidths for the non-diseased and diseased groups.

  • plot
    A ggplot2 ROC plot object including the Youden point and a textual annotation.

Print, summary, and plot methods are available:

and are invoked automatically via print(), summary(), and plot().

Method overview

Kernel CDF estimators and the smooth ROC curve

To obtain a smooth ROC curve while remaining nonparametric, smoothROC() applies kernel-based CDF estimators of the form F̂(x)=1mi=1mK(xXihm),Ĝ(x)=1nj=1nK(xYjhn), \hat F(x) = \frac{1}{m} \sum_{i=1}^m K\!\left(\frac{x - X_i}{h_m}\right), \quad \hat G(x) = \frac{1}{n} \sum_{j=1}^n K\!\left(\frac{x - Y_j}{h_n}\right), where K(u)=uk(v)dvK(u) = \int_{-\infty}^u k(v)\,dv is the integrated kernel, and hmh_m, hnh_n are bandwidths for the non-diseased and diseased groups, respectively. The smooth ROC curve is then obtained by plugging F̂\hat F and Ĝ\hat G into the ROC functional.

Three univariate kernels are implemented:

  • Gaussian kernel
    k(u)=(2π)1/2exp(u2/2)k(u) = (2\pi)^{-1/2}\exp(-u^2/2) with CDF K(u)K(u) equal to the standard normal distribution function. This kernel has infinite support and is a default choice in many smoothing problems.

  • Epanechnikov kernel
    k(u)=34(1u2)𝟏{|u|1}k(u) = \tfrac{3}{4}(1-u^2)\mathbf{1}_{\{|u|\le 1\}}, with compact support on [1,1][-1,1] and optimal second-order efficiency under many mean squared error criteria.

  • Biweight kernel
    k(u)=1516(1u2)2𝟏{|u|1}k(u) = \tfrac{15}{16}(1-u^2)^2\mathbf{1}_{\{|u|\le 1\}}, a higher-order, compactly supported kernel that produces more rounded estimates near the boundaries.

Bandwidth selection strategies

Bandwidth selection is critical for balancing bias and variance in the smoothed CDFs and the resulting ROC curve. smoothROC() focuses on bandwidths that are optimal for CDF estimation (rather than densities), which aligns more directly with ROC functionals.

The bw_method argument implements several strategies:

  • "pdf"
    A density-based rule-of-thumb (Silverman) using a kernel density bandwidth. Convenient and widely used, but it does not satisfy the usual asymptotic conditions for CDF estimation.

  • "BHP"
    A CDF-based normal-reference bandwidth that approximately minimizes the integrated mean squared error of F̂\hat F. It uses a robust scale estimate based on min(SD,IQR/1.34)\min(\mathrm{SD}, \mathrm{IQR}/1.34).

  • "AR"
    An adjusted CDF reference bandwidth for the Gaussian kernel, obtained by shrinking the normal-reference constant to reduce oversmoothing for non-Gaussian data while preserving the m1/3m^{-1/3} CDF rate.

  • "AL"
    A fully data-driven CDF-based bandwidth in the spirit of Altman and Leger, where the unknown roughness functional is estimated using an auxiliary kernel estimator.

  • "PB"
    A multistage plug-in CDF-based bandwidth (a two-stage version of Polansky and Baker) that uses an initial normal-reference pilot followed by a data-driven refinement.

In simulation studies (not shown here), the "PB" method often provides stable performance across a range of underlying distributions, especially when sample sizes are small to moderate.

AUC estimation and kernel DeLong-type variance

Given the kernel CDFs, smoothROC() computes a kernel-based AUC estimator δ̂=F̂(x)dĜ(x), \hat \delta = \int_{-\infty}^{\infty} \hat F(x)\, d\hat G(x), which is asymptotically equivalent to the empirical AUC based on the Mann–Whitney statistic. This link ensures that classical large-sample results for the empirical AUC remain valid in the smoothed setting.

To quantify uncertainty, smoothROC() implements a kernel-smoothed analogue of DeLong’s variance estimator. Instead of using empirical placement values, the variance expression replaces them by their kernel-smoothed counterparts. This typically yields a more stable variance estimate in small samples while retaining the large-sample properties of the original DeLong method.

The resulting AUC estimate and confidence interval are available via:

roc$AUC
roc$AUC_ci

Youden index and optimal cutoff

The Youden index J=maxt{sensitivity(t)+specificity(t)1}=maxt{F(t)G(t)} J = \max_t \{\mathrm{sensitivity}(t) + \mathrm{specificity}(t) - 1\} = \max_t \{F(t) - G(t)\} provides a summary of the optimal trade-off between sensitivity and specificity, with J[0,1]J \in [0,1]. The corresponding optimal cutoff is t0=arg maxt{F(t)G(t)}. t_0 = \operatorname*{arg\,max}_t \{F(t) - G(t)\}.

Using the kernel CDFs, smoothROC() computes a smoothed Youden index Ĵ\hat J and the maximizing cutoff t̂0\hat t_0 on a grid of thresholds. When multiple cutoffs achieve the same maximum, secondary rules (favoring higher sensitivity or specificity) or median-based summaries can be used. A Delta-method approximation provides the variance of Ĵ\hat J, from which a Wald-type confidence interval is constructed.

These quantities are returned as:

roc$J
roc$J_ci
roc$t0
roc$sensitivity
roc$specificity

Example: Duchenne muscular dystrophy dataset

The package includes an example dataset, dystrophy, with biomarker measurements for Duchenne muscular dystrophy (DMD) carriers and non-carriers. We treat the serum marker CK as the primary biomarker and Class as the disease status.

data(dystrophy)
str(dystrophy)
#> spc_tbl_ [209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ OBS   : num [1:209] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ HospID: num [1:209] 1007 786 778 1306 895 ...
#>  $ AGE   : num [1:209] 22 32 36 22 23 30 27 30 25 26 ...
#>  $ M     : num [1:209] 6 8 7 11 1 5 8 11 10 2 ...
#>  $ Y     : num [1:209] 79 78 78 79 78 79 78 78 79 79 ...
#>  $ CK    : num [1:209] 52 20 28 30 40 24 15 22 42 130 ...
#>  $ H     : num [1:209] 83.5 77 86.5 104 83 78.8 87 91 65.5 80.3 ...
#>  $ PK    : num [1:209] 10.9 11 13.2 22.6 15.2 9.6 13.5 17.5 13.3 17.1 ...
#>  $ LD    : num [1:209] 176 200 171 230 205 151 232 198 216 211 ...
#>  $ Class : Factor w/ 2 levels "normal","carrier": 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "spec")=List of 3
#>   ..$ cols   :List of 10
#>   .. ..$ OBS   : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ HospID: list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ AGE   : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ M     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ Y     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ CK    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ H     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ PK    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ LD    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ Class :List of 3
#>   .. .. ..$ levels    : chr [1:2] "normal" "carrier"
#>   .. .. ..$ ordered   : logi FALSE
#>   .. .. ..$ include_na: logi FALSE
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_factor" "collector"
#>   ..$ default: list()
#>   .. ..- attr(*, "class")= chr [1:2] "collector_guess" "collector"
#>   ..$ delim  : chr ","
#>   ..- attr(*, "class")= chr "col_spec"
#>  - attr(*, "problems")=<externalptr>

A basic smooth ROC analysis is:

roc <- smoothROC(
  data      = dystrophy,
  biomarker = "CK",
  status    = "Class",
  diseased  = "carrier",
  kernel    = "biweight",
  bw_method = "PB",
  alpha     = 0.05,
  logtrans  = TRUE,
  grid_n    = 2000
)

Inspecting the result

The print and summary methods provide a concise summary:

roc
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
summary(roc)
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687

They report the kernel and bandwidth method, AUC with confidence interval, the Youden index and its confidence interval, and the Youden point (FPR, TPR, cutoff, sensitivity, specificity).

We can visualize the ROC curve:

plot(roc)
ROC curve produced by smoothROC

ROC curve produced by smoothROC

By default, the plot includes:

  • the 45-degree reference line (no-discrimination),
  • the smooth ROC curve,
  • the Youden point marked in red, and
  • a label showing AUC, its confidence interval, the Youden index, and the Youden cutoff.

If you prefer a clean ROC curve without annotation, use:

plot(roc, label = FALSE)
ROC curve produced by smoothROC

ROC curve produced by smoothROC

The underlying ROC data and key summaries can be accessed directly:

head(roc$curve)     # threshold, FPR, TPR, J
#>    threshold FPR TPR J
#> 1 -0.2326416   1   1 0
#> 2 -0.2274720   1   1 0
#> 3 -0.2223023   1   1 0
#> 4 -0.2171326   1   1 0
#> 5 -0.2119629   1   1 0
#> 6 -0.2067933   1   1 0
roc$AUC             # AUC estimate
#> [1] 0.8646697
roc$AUC_ci          # AUC confidence interval
#> [1] 0.8105293 0.9188102
roc$J               # Youden index estimate
#> [1] 0.5723741
roc$J_ci            # Youden index CI
#> [1] 0.4542642 0.6904840
roc$t0              # Youden cutoff
#> [1] 4.068528
roc$sensitivity     # Sensitivity at Youden point
#> [1] 0.7036682
roc$specificity     # Specificity at Youden point
#> [1] 0.8687059
roc$hX              # Bandwidth for non-diseased CDF
#> [1] 0.3146762
roc$hY              # Bandwidth for diseased CDF
#> [1] 0.9802306

Advanced options

This section summarizes the more technical aspects of smoothROC() that may be useful for advanced users.

Log transformation

Setting logtrans = TRUE applies a natural log transformation to the biomarker prior to ROC estimation. This is often appropriate for biomarkers with strong right skew or multiplicative variability (e.g. enzyme concentrations, cytokines). When logtrans = TRUE, biomarker values must be strictly positive.

Controlling the ROC grid

The argument grid_n controls the resolution of the ROC curve. Larger values produce a smoother ROC curve and more precise localization of the Youden cutoff, at the cost of increased computation. Reasonable values include:

  • grid_n = 1000 – fast and adequate for exploratory work;
  • grid_n = 2000 – default, smoother curve and better stability;
  • grid_n = 5000 – more refined, useful in simulation studies.

Choosing a bandwidth method

For most applications, a good starting point is:

  • kernel = "biweight",
  • bw_method = "PB".

The "PB" method tends to perform well across a range of scenarios and sample sizes. When computation time is a concern, the "AR" method offers a simple, robust alternative that is easy to compute.

In large samples, the differences between bandwidth methods may be minor. However, in small or moderate samples, bandwidth selection can substantially affect ROC shape, AUC estimates, and the stability of the Youden index.

References

  • Altman, N., & Leger, C. (1995). Bandwidth selection for kernel distribution function estimation. Journal of Statistical Planning and Inference, 46(2), 195–214.

  • Andrews, D. F., & Herzberg, A. M. (2012). Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer.

  • Bowman, A., Hall, P., & Prvan, T. (1998). Bandwidth selection for the smoothing of distribution functions. Biometrika, 85(4), 799–808.

  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837–845.

  • Khan, R. A., & Ghebremichael, M. (2025). Smooth ROC Curve Estimation. Journal Name (preprint).

  • Lloyd, C. J. (1998). Using smoothed receiver operating characteristic curves to summarize and compare diagnostic systems. Journal of the American Statistical Association, 93(444), 1356–1364.

  • Polansky, A. M., & Baker, E. R. (2000). Multistage plug-in bandwidth selection for kernel distribution function estimates. Journal of Statistical Computation and Simulation, 65(1–4), 63–80.

  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.

  • Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35.

  • Zhou, X.-H., & Harezlak, J. (2002). Comparison of bandwidth selection methods for kernel smoothing of ROC curves. Statistics in Medicine, 21(14), 2045–2055.

  • Zou, K. H., Hall, W. J., & Shapiro, D. E. (1997). Smooth nonparametric receiver operating characteristic (ROC) curves for continuous diagnostic tests. Statistics in Medicine, 16(19), 2143–2156.