# Kernel-Smoothed ROC Curves with smoothROC

## Introduction

In medical diagnostics, a central task is to evaluate how accurately a
continuous biomarker distinguishes between diseased and non-diseased
individuals. The receiver operating characteristic (ROC) curve and the
area under the curve (AUC) are standard tools for this purpose. The ROC
curve summarizes the trade-off between sensitivity and specificity over
all possible decision thresholds, while the AUC provides a single-number
summary of overall discriminative ability (the probability that a
randomly chosen diseased subject has a higher biomarker value than a
randomly chosen non-diseased subject).

Let $`X`$ and $`Y`$ denote biomarker values from non-diseased and
diseased subjects with cumulative distribution functions (CDFs) $`F`$
and $`G`$, and survival functions $`\bar F = 1 - F`$ and
$`\bar G = 1 - G`$. The ROC curve can be written as
``` math

\mathrm{ROC}(p)
= \bar G\{\bar F^{-1}(p)\}, \quad p \in [0,1],
```
and the AUC as
``` math

\mathrm{AUC}
= \int_{-\infty}^{\infty} F(x)\,dG(x)
= P(Y > X).
```

Empirical ROC curves, constructed directly from the empirical CDFs of
$`X`$ and $`Y`$, are fully nonparametric but stepwise and potentially
jagged, especially in small or moderate samples. Parametric ROC models
are smooth but require strong distributional assumptions that may be
unrealistic in practice.

The **`smoothROC`** package implements a kernel-based, distribution-free
approach that produces *smooth* ROC curves, a kernel-based AUC estimator
with confidence intervals, and a smooth Youden index summary with an
associated optimal cutoff. The core function is
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md), which
this vignette introduces and illustrates.

## The `smoothROC()` function

The main user-facing function is:

``` r
smoothROC(
  data,
  biomarker,
  status,
  diseased,
  kernel    = c("gaussian", "biweight", "epanechnikov"),
  bw_method = c("pdf", "AL", "PB", "BHP", "AR"),
  alpha     = 0.05,
  logtrans  = FALSE,
  grid_n    = 2000
)
```

It estimates a kernel-smoothed ROC curve for a continuous biomarker
using kernel CDF estimators in the non-diseased and diseased groups. It
returns:

- a smooth ROC curve on a fine grid of thresholds,
- a kernel-based estimator of the AUC with a confidence interval, and
- a kernel-smoothed Youden index with its optimal cutoff and confidence
  interval.

### Arguments

- `data`  
  A data frame containing the biomarker and status variables.

- `biomarker`  
  Character string; name of the numeric column containing biomarker
  values.

- `status`  
  Character string; name of the column containing binary disease status.

- `diseased`  
  The value in `status` indicating the diseased class
  (e.g. `"carrier"`).

- `kernel`  
  Character string; kernel function for smoothing. One of `"gaussian"`,
  `"biweight"`, or `"epanechnikov"`.

- `bw_method`  
  Character string; bandwidth selection method. One of:

  - `"pdf"` – density-based rule-of-thumb (Silverman),
  - `"BHP"` – CDF-based normal-reference (Bowman–Hall–Prvan),
  - `"AR"` – adjusted CDF reference bandwidth,
  - `"AL"` – Altman–Leger CDF plug-in,
  - `"PB"` – Polansky–Baker multistage plug-in.

- `alpha`  
  Numeric; significance level for $`(1 - \alpha)`$ confidence intervals
  (default: `0.05`).

- `logtrans`  
  Logical; if `TRUE`, applies a log-transformation to biomarker values
  prior to ROC estimation (useful for right-skewed biomarkers). Default:
  `FALSE`.

- `grid_n`  
  Integer; number of grid points for evaluating the ROC curve (default:
  `2000`).

### Returned value

[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) returns
an object of class `"smoothROC"` with components including:

- `curve`  
  Data frame giving threshold, false-positive rate (FPR), true-positive
  rate (TPR), and Youden index $`J`$ on the evaluation grid.

- `AUC`, `AUC_ci`, `AUC_ci_lo`, `AUC_ci_hi`  
  Kernel-smoothed AUC estimate and its confidence interval.

- `J`, `J_ci`, `J_ci_lo`, `J_ci_hi`  
  Kernel-smoothed Youden index estimate and its confidence interval.

- `t0`  
  Estimated optimal cutoff associated with the Youden index.

- `sensitivity`, `specificity`  
  Sensitivity and specificity at the Youden cutoff.

- `kernel`, `bandwidth_method`  
  The chosen kernel and bandwidth selection method.

- `hX`, `hY`  
  Selected CDF bandwidths for the non-diseased and diseased groups.

- `plot`  
  A `ggplot2` ROC plot object including the Youden point and a textual
  annotation.

Print, summary, and plot methods are available:

``` r
print.smoothROC()
summary.smoothROC()
plot.smoothROC()
```

and are invoked automatically via
[`print()`](https://rdrr.io/r/base/print.html),
[`summary()`](https://rdrr.io/r/base/summary.html), and
[`plot()`](https://rdrr.io/r/graphics/plot.default.html).

## Method overview

### Kernel CDF estimators and the smooth ROC curve

To obtain a smooth ROC curve while remaining nonparametric,
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) applies
kernel-based CDF estimators of the form
``` math

\hat F(x)
= \frac{1}{m} \sum_{i=1}^m K\!\left(\frac{x - X_i}{h_m}\right),
\quad
\hat G(x)
= \frac{1}{n} \sum_{j=1}^n K\!\left(\frac{x - Y_j}{h_n}\right),
```
where $`K(u) = \int_{-\infty}^u k(v)\,dv`$ is the integrated kernel, and
$`h_m`$, $`h_n`$ are bandwidths for the non-diseased and diseased
groups, respectively. The smooth ROC curve is then obtained by plugging
$`\hat F`$ and $`\hat G`$ into the ROC functional.

Three univariate kernels are implemented:

- **Gaussian kernel**  
  $`k(u) = (2\pi)^{-1/2}\exp(-u^2/2)`$ with CDF $`K(u)`$ equal to the
  standard normal distribution function. This kernel has infinite
  support and is a default choice in many smoothing problems.

- **Epanechnikov kernel**  
  $`k(u) = \tfrac{3}{4}(1-u^2)\mathbf{1}_{\{|u|\le 1\}}`$, with compact
  support on $`[-1,1]`$ and optimal second-order efficiency under many
  mean squared error criteria.

- **Biweight kernel**  
  $`k(u) = \tfrac{15}{16}(1-u^2)^2\mathbf{1}_{\{|u|\le 1\}}`$, a
  higher-order, compactly supported kernel that produces more rounded
  estimates near the boundaries.

### Bandwidth selection strategies

Bandwidth selection is critical for balancing bias and variance in the
smoothed CDFs and the resulting ROC curve.
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) focuses
on bandwidths that are optimal for *CDF* estimation (rather than
densities), which aligns more directly with ROC functionals.

The `bw_method` argument implements several strategies:

- `"pdf"`  
  A density-based rule-of-thumb (Silverman) using a kernel density
  bandwidth. Convenient and widely used, but it does not satisfy the
  usual asymptotic conditions for CDF estimation.

- `"BHP"`  
  A CDF-based normal-reference bandwidth that approximately minimizes
  the integrated mean squared error of $`\hat F`$. It uses a robust
  scale estimate based on $`\min(\mathrm{SD}, \mathrm{IQR}/1.34)`$.

- `"AR"`  
  An adjusted CDF reference bandwidth for the Gaussian kernel, obtained
  by shrinking the normal-reference constant to reduce oversmoothing for
  non-Gaussian data while preserving the $`m^{-1/3}`$ CDF rate.

- `"AL"`  
  A fully data-driven CDF-based bandwidth in the spirit of Altman and
  Leger, where the unknown roughness functional is estimated using an
  auxiliary kernel estimator.

- `"PB"`  
  A multistage plug-in CDF-based bandwidth (a two-stage version of
  Polansky and Baker) that uses an initial normal-reference pilot
  followed by a data-driven refinement.

In simulation studies (not shown here), the `"PB"` method often provides
stable performance across a range of underlying distributions,
especially when sample sizes are small to moderate.

### AUC estimation and kernel DeLong-type variance

Given the kernel CDFs,
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) computes
a kernel-based AUC estimator
``` math

\hat \delta
= \int_{-\infty}^{\infty} \hat F(x)\, d\hat G(x),
```
which is asymptotically equivalent to the empirical AUC based on the
Mann–Whitney statistic. This link ensures that classical large-sample
results for the empirical AUC remain valid in the smoothed setting.

To quantify uncertainty,
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md)
implements a kernel-smoothed analogue of DeLong’s variance estimator.
Instead of using empirical placement values, the variance expression
replaces them by their kernel-smoothed counterparts. This typically
yields a more stable variance estimate in small samples while retaining
the large-sample properties of the original DeLong method.

The resulting AUC estimate and confidence interval are available via:

``` r
roc$AUC
roc$AUC_ci
```

### Youden index and optimal cutoff

The Youden index
``` math

J
= \max_t \{\mathrm{sensitivity}(t) + \mathrm{specificity}(t) - 1\}
= \max_t \{F(t) - G(t)\}
```
provides a summary of the optimal trade-off between sensitivity and
specificity, with $`J \in [0,1]`$. The corresponding optimal cutoff is
``` math

t_0
= \operatorname*{arg\,max}_t \{F(t) - G(t)\}.
```

Using the kernel CDFs,
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) computes
a smoothed Youden index $`\hat J`$ and the maximizing cutoff
$`\hat t_0`$ on a grid of thresholds. When multiple cutoffs achieve the
same maximum, secondary rules (favoring higher sensitivity or
specificity) or median-based summaries can be used. A Delta-method
approximation provides the variance of $`\hat J`$, from which a
Wald-type confidence interval is constructed.

These quantities are returned as:

``` r
roc$J
roc$J_ci
roc$t0
roc$sensitivity
roc$specificity
```

## Example: Duchenne muscular dystrophy dataset

The package includes an example dataset, `dystrophy`, with biomarker
measurements for Duchenne muscular dystrophy (DMD) carriers and
non-carriers. We treat the serum marker **CK** as the primary biomarker
and **Class** as the disease status.

``` r
data(dystrophy)
str(dystrophy)
#> spc_tbl_ [209 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ OBS   : num [1:209] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ HospID: num [1:209] 1007 786 778 1306 895 ...
#>  $ AGE   : num [1:209] 22 32 36 22 23 30 27 30 25 26 ...
#>  $ M     : num [1:209] 6 8 7 11 1 5 8 11 10 2 ...
#>  $ Y     : num [1:209] 79 78 78 79 78 79 78 78 79 79 ...
#>  $ CK    : num [1:209] 52 20 28 30 40 24 15 22 42 130 ...
#>  $ H     : num [1:209] 83.5 77 86.5 104 83 78.8 87 91 65.5 80.3 ...
#>  $ PK    : num [1:209] 10.9 11 13.2 22.6 15.2 9.6 13.5 17.5 13.3 17.1 ...
#>  $ LD    : num [1:209] 176 200 171 230 205 151 232 198 216 211 ...
#>  $ Class : Factor w/ 2 levels "normal","carrier": 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "spec")=List of 3
#>   ..$ cols   :List of 10
#>   .. ..$ OBS   : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ HospID: list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ AGE   : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ M     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ Y     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ CK    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ H     : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ PK    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ LD    : list()
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
#>   .. ..$ Class :List of 3
#>   .. .. ..$ levels    : chr [1:2] "normal" "carrier"
#>   .. .. ..$ ordered   : logi FALSE
#>   .. .. ..$ include_na: logi FALSE
#>   .. .. ..- attr(*, "class")= chr [1:2] "collector_factor" "collector"
#>   ..$ default: list()
#>   .. ..- attr(*, "class")= chr [1:2] "collector_guess" "collector"
#>   ..$ delim  : chr ","
#>   ..- attr(*, "class")= chr "col_spec"
#>  - attr(*, "problems")=<externalptr>
```

A basic smooth ROC analysis is:

``` r
roc <- smoothROC(
  data      = dystrophy,
  biomarker = "CK",
  status    = "Class",
  diseased  = "carrier",
  kernel    = "biweight",
  bw_method = "PB",
  alpha     = 0.05,
  logtrans  = TRUE,
  grid_n    = 2000
)
```

### Inspecting the result

The print and summary methods provide a concise summary:

``` r
roc
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
summary(roc)
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
```

They report the kernel and bandwidth method, AUC with confidence
interval, the Youden index and its confidence interval, and the Youden
point (FPR, TPR, cutoff, sensitivity, specificity).

We can visualize the ROC curve:

``` r
plot(roc)
```

![ROC curve produced by
smoothROC](smoothROC_files/figure-html/roc-plot2-1.png)

ROC curve produced by smoothROC

By default, the plot includes:

- the 45-degree reference line (no-discrimination),
- the smooth ROC curve,
- the Youden point marked in red, and
- a label showing AUC, its confidence interval, the Youden index, and
  the Youden cutoff.

If you prefer a clean ROC curve without annotation, use:

``` r
plot(roc, label = FALSE)
```

![ROC curve produced by
smoothROC](smoothROC_files/figure-html/roc-plot3-1.png)

ROC curve produced by smoothROC

The underlying ROC data and key summaries can be accessed directly:

``` r
head(roc$curve)     # threshold, FPR, TPR, J
#>    threshold FPR TPR J
#> 1 -0.2326416   1   1 0
#> 2 -0.2274720   1   1 0
#> 3 -0.2223023   1   1 0
#> 4 -0.2171326   1   1 0
#> 5 -0.2119629   1   1 0
#> 6 -0.2067933   1   1 0
roc$AUC             # AUC estimate
#> [1] 0.8646697
roc$AUC_ci          # AUC confidence interval
#> [1] 0.8105293 0.9188102
roc$J               # Youden index estimate
#> [1] 0.5723741
roc$J_ci            # Youden index CI
#> [1] 0.4542642 0.6904840
roc$t0              # Youden cutoff
#> [1] 4.068528
roc$sensitivity     # Sensitivity at Youden point
#> [1] 0.7036682
roc$specificity     # Specificity at Youden point
#> [1] 0.8687059
roc$hX              # Bandwidth for non-diseased CDF
#> [1] 0.3146762
roc$hY              # Bandwidth for diseased CDF
#> [1] 0.9802306
```

## Advanced options

This section summarizes the more technical aspects of
[`smoothROC()`](https://smoothroc.local/reference/smoothROC.md) that may
be useful for advanced users.

### Log transformation

Setting `logtrans = TRUE` applies a natural log transformation to the
biomarker prior to ROC estimation. This is often appropriate for
biomarkers with strong right skew or multiplicative variability
(e.g. enzyme concentrations, cytokines). When `logtrans = TRUE`,
biomarker values must be strictly positive.

### Controlling the ROC grid

The argument `grid_n` controls the resolution of the ROC curve. Larger
values produce a smoother ROC curve and more precise localization of the
Youden cutoff, at the cost of increased computation. Reasonable values
include:

- `grid_n = 1000` – fast and adequate for exploratory work;
- `grid_n = 2000` – default, smoother curve and better stability;
- `grid_n = 5000` – more refined, useful in simulation studies.

### Choosing a bandwidth method

For most applications, a good starting point is:

- `kernel = "biweight"`,
- `bw_method = "PB"`.

The `"PB"` method tends to perform well across a range of scenarios and
sample sizes. When computation time is a concern, the `"AR"` method
offers a simple, robust alternative that is easy to compute.

In large samples, the differences between bandwidth methods may be
minor. However, in small or moderate samples, bandwidth selection can
substantially affect ROC shape, AUC estimates, and the stability of the
Youden index.

## References

- Altman, N., & Leger, C. (1995). Bandwidth selection for kernel
  distribution function estimation. *Journal of Statistical Planning and
  Inference*, 46(2), 195–214.

- Andrews, D. F., & Herzberg, A. M. (2012). *Data: A Collection of
  Problems from Many Fields for the Student and Research Worker*.
  Springer.

- Bowman, A., Hall, P., & Prvan, T. (1998). Bandwidth selection for the
  smoothing of distribution functions. *Biometrika*, 85(4), 799–808.

- DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988).
  Comparing the areas under two or more correlated receiver operating
  characteristic curves: a nonparametric approach. *Biometrics*, 44(3),
  837–845.

- Khan, R. A., & Ghebremichael, M. (2025). Smooth ROC Curve Estimation.
  *Journal Name* (preprint).

- Lloyd, C. J. (1998). Using smoothed receiver operating characteristic
  curves to summarize and compare diagnostic systems. *Journal of the
  American Statistical Association*, 93(444), 1356–1364.

- Polansky, A. M., & Baker, E. R. (2000). Multistage plug-in bandwidth
  selection for kernel distribution function estimates. *Journal of
  Statistical Computation and Simulation*, 65(1–4), 63–80.

- Silverman, B. W. (1986). *Density Estimation for Statistics and Data
  Analysis*. Chapman & Hall, London.

- Youden, W. J. (1950). Index for rating diagnostic tests. *Cancer*,
  3(1), 32–35.

- Zhou, X.-H., & Harezlak, J. (2002). Comparison of bandwidth selection
  methods for kernel smoothing of ROC curves. *Statistics in Medicine*,
  21(14), 2045–2055.

- Zou, K. H., Hall, W. J., & Shapiro, D. E. (1997). Smooth nonparametric
  receiver operating characteristic (ROC) curves for continuous
  diagnostic tests. *Statistics in Medicine*, 16(19), 2143–2156.
