# Kernel-Smoothed ROC Curves with AUC and Youden Index Summary

This function estimates a smooth receiver operating characteristic (ROC)
curve for a continuous biomarker using kernel-smoothing of the
cumulative distribution functions (CDFs) in the non-diseased and
diseased groups. It returns a smooth ROC curve, a kernel-based estimator
of the area under the curve (AUC) with confidence interval, and the
kernel-smoothed Youden index summary with its optimal cutoff.

## Usage

``` r
smoothROC(
  data,
  biomarker,
  status,
  diseased,
  kernel = c("gaussian", "biweight", "epanechnikov"),
  bw_method = c("pdf", "AL", "PB", "BHP", "AR"),
  alpha = 0.05,
  logtrans = FALSE,
  grid_n = 2000
)

# S3 method for class 'smoothROC'
print(x, ...)

# S3 method for class 'smoothROC'
plot(x, label = TRUE, ...)

# S3 method for class 'smoothROC'
summary(object, ...)
```

## Arguments

- data:

  A data frame containing the biomarker and status variables.

- biomarker:

  Character string; name of the numeric column containing biomarker
  values.

- status:

  Character string; name of the column containing binary disease status.

- diseased:

  The value in `status` indicating the diseased class.

- kernel:

  Character string; kernel function for smoothing. One of `"gaussian"`,
  `"biweight"`, or `"epanechnikov"`.

- bw_method:

  Character string; bandwidth selection method. One of `"pdf"`
  (Silverman rule for density), `"AL"` (Altman-Leger), `"PB"`
  (Polansky-Baker multistage), `"BHP"` (normal-reference), or `"AR"`
  (rule-of-thumb).

- alpha:

  Numeric; significance level for (1 - alpha) confidence intervals
  (default: 0.05).

- logtrans:

  Logical; apply log-transformation to biomarker values? (default:
  FALSE).

- grid_n:

  Integer; number of grid points for evaluating the ROC curve (default:
  2000).

## Value

An object of class `"smoothROC"` with components that include:

- `curve`: true- and false-positive rates on the grid of cutoff values
  (thresholds).

- `auc`: kernel-smoothed AUC estimate.

- `auc_ci`: Confidence interval of AUC based on the kernel-smoothed
  DeLong-type variance.

- `youden`, `youden_ci`: Youden index estimate and confidence interval.

- `cutoff`: estimated optimal cutoff associated with the Youden index.

- `sensitivity`, `specificity`: estimated sensitivity, and specificity
  on the Youden point.

- `gg`: Plot of the ROC curve

- `bw_x`, `bw_y`: selected CDF bandwidths for the non-diseased and
  diseased groups.

Print, plot, and summary methods are available for objects of class
`"smoothROC"`.

## Details

Let \\X\\ and \\Y\\ denote biomarker values from non-diseased and
diseased subjects with CDFs \\F\\ and \\G\\, and survival functions
\\\bar F = 1 - F\\ and \\\bar G = 1 - G\\. The ROC curve is defined as
\$\$ROC(p) = \bar G\\\bar F^{-1}(p)\\, \quad p \in \[0,1\],\$\$ that is,
the true positive rate plotted against the false positive rate as the
threshold varies over the real line. The AUC is the integral of the ROC
curve over \\\[0,1\]\\ and can be written as \$\$AUC =
\int\_{-\infty}^{\infty} F(x)\\ dG(x) = P(Y \> X),\$\$ representing the
probability that a randomly chosen diseased subject has a higher
biomarker value than a randomly chosen non-diseased subject.

Nonparametric ROC estimation based on the empirical distributions of
\\X\\ and \\Y\\ yields a stepwise ROC curve that may be jagged and
sensitive to sampling variability, especially in small or moderate
samples. To obtain a smooth ROC curve while avoiding parametric
distributional assumptions, this function applies kernel-based CDF
estimators of the form \$\$\hat F(x) = \frac{1}{m} \sum\_{i=1}^m
K\\\left(\frac{x - X_i}{h_m}\right), \quad \hat G(x) = \frac{1}{n}
\sum\_{j=1}^n K\\\left(\frac{x - Y_j}{h_n}\right),\$\$ where \\K\\ is
the integrated kernel and \\h_m\\, \\h_n\\ are bandwidths for the
non-diseased and diseased groups. The smooth ROC curve is then obtained
by plugging \\\hat F\\ and \\\hat G\\ into the ROC functional.

Bandwidth selection is critical for balancing bias and variance in the
smoothed CDFs and the resulting ROC curve. The argument `bw_method`
implements several rules:

- `"pdf"`: a density-based rule-of-thumb (Silverman) that uses a kernel
  density bandwidth and is convenient but does not satisfy the usual
  asymptotic conditions for CDF estimation.

- `"BHP"`: a CDF-based normal-reference bandwidth that minimizes an
  approximation to the integrated mean squared error of \\\hat F\\, with
  a robust scale estimate based on `min(SD, IQR/1.34)`.

- `"AR"`: an adjusted CDF reference bandwidth for the Gaussian kernel,
  obtained by shrinking the normal-reference constant to improve
  performance for non-Gaussian data while preserving the \\m^{-1/3}\\
  CDF rate.

- `"AL"`: a fully data-driven CDF-based bandwidth in the spirit of
  Altman and Leger, where the unknown roughness functional is estimated
  via an auxiliary kernel estimator.

- `"PB"`: a multistage plug-in CDF-based bandwidth (two-stage version of
  Polansky and Baker) that uses a pilot normal-reference step followed
  by data-driven refinement.

These choices focus on optimal smoothing of the CDFs rather than the
density, which is more directly aligned with ROC curve estimation.

\#' The smooth ROC estimators in this package are based on kernel CDF
estimators constructed from a univariate kernel function \\k(u)\\ and
its integral \\K(u) = \int\_{-\infty}^u k(v)\\dv\\. The following
kernels are implemented:

- **Gaussian kernel**: \\k(u) = (2\pi)^{-1/2}\exp(-u^2/2)\\, with CDF
  \\K(u)\\ equal to the standard normal distribution function. This
  kernel has infinite support and is often used as a default choice in
  smooth distribution and ROC estimation.

- **Epanechnikov kernel**: \\k(u) =
  \tfrac{3}{4}(1-u^2)\mathbf{1}\_{\\\|u\|\le 1\\}\\, with compact
  support on \\\[-1,1\]\\ and optimal second‑order efficiency under many
  mean squared error criteria.

- **Biweight kernel**: \\k(u) =
  \tfrac{15}{16}(1-u^2)^2\mathbf{1}\_{\\\|u\|\le 1\\}\\, a smoother
  higher‑order alternative with support on \\\[-1,1\]\\ that produces
  more rounded ROC and CDF estimates near the boundaries.

The AUC is estimated from the kernel CDFs via \$\$\hat \delta =
\int\_{-\infty}^{\infty} \hat F(x)\\ d\hat G(x),\$\$ and is
asymptotically equivalent to the empirical AUC based on the Mann–Whitney
statistic. To quantify uncertainty, a kernel-smoothed analogue of
DeLong's variance estimator is used: empirical placement values are
replaced by their kernel-smoothed counterparts, producing a more stable
variance estimate in small samples while retaining the large-sample
properties of the Mann–Whitney-based estimator.

The Youden index is defined as \$\$J = \max_t
\\\mathrm{sensitivity}(t) + \mathrm{specificity}(t) - 1\\ = \max_t
\\F(t) - G(t)\\,\$\$ with corresponding optimal cutoff \\t_0 =
\arg\max_t \\F(t) - G(t)\\\\. Using the kernel CDFs, the function
computes a smoothed Youden index \\\hat J\\ and its maximizing cutoff
\\\hat t_0\\ on a search grid. When multiple cutoffs achieve the same
maximum, secondary criteria can be applied (e.g., favoring higher
sensitivity or higher specificity), or the median of all maximizers can
be reported. A Delta-method approximation is used for the variance of
\\\hat J\\, from which a Wald-type confidence interval is obtained.

## References

Khan, R. A., & Ghebremichael, M. (2025). Smooth ROC Curve Estimation.
*Journal Name*. (Preprint)

Zou, K. H., Hall, W. J., & Shapiro, D. E. (1997). Smooth non-parametric
receiver operating characteristic (ROC) curves for continuous diagnostic
tests. *Statistics in Medicine*, 16(19), 2143-2156.

Lloyd, C. J. (1998). Using smoothed receiver operating characteristic
curves to summarize and compare diagnostic systems. *Journal of the
American Statistical Association*, 93(444), 1356-1364.

Zhou, X.-H., & Harezlak, J. (2002). Comparison of bandwidth selection
methods for kernel smoothing of ROC curves. *Statistics in Medicine*,
21(14), 2045-2055.

DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing
the areas under two or more correlated receiver operating characteristic
curves: a nonparametric approach. *Biometrics*, 44(3), 837-845.

Youden, W. J. (1950). Index for rating diagnostic tests. *Cancer*, 3(1),
32-35.

Altman, N., and Leger, C. (1995). Bandwidth selection for kernel
distribution function estimation. *Journal of Statistical Planning and
Inference*, 46(2), 195–214.

Bowman, A., Hall, P., and Prvan, T. (1998). Bandwidth selection for the
smoothing of distribution functions. *Biometrika*, 85(4), 799–808.

Polansky, A. M., and Baker, E. R. (2000). Multistage plug-in bandwidth
selection for kernel distribution function estimates. *Journal of
Statistical Computation and Simulation*, 65(1–4), 63–80.

Silverman, B. W. (1986). *Density Estimation for Statistics and Data
Analysis*. Chapman & Hall, London.

Andrews, D. F., & Herzberg, A. M. (2012). Data: a collection of problems
from many fields for the student and research worker. Springer Science &
Business Media.

The example dataset
[`dystrophy`](https://smoothroc.local/reference/dystrophy.md) (aliases:
`dystrophyData` contains serum biomarkers for Duchenne muscular
dystrophy carriers and non-carriers and is used to illustrate
kernel-smoothed ROC analysis.

## Examples

``` r
data(dystrophy)

roc <- smoothROC(
  data      = dystrophy,
  biomarker = "CK",
  status    = "Class",
  diseased  = "carrier",
  kernel    = "biweight",
  bw_method = "PB",
  alpha     = 0.05,
  logtrans  = TRUE,
  grid_n    = 2000
)

## Basic use
print(roc)         # Summary print method
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
summary(roc)       # Summary method
#> Kernel-smoothed ROC
#> Kernel: biweight
#> Bandwidth: PB
#> AUC = 0.8647 (95.0% CI: 0.8105, 0.9188)
#> Youden Index, J = 0.5724 (95.0% CI: 0.4543, 0.6905)
#> Youden point: (FPR = 0.1313, TPR = 0.7037)
#> Threshold (cutoff) = 4.0685
#> At Youden point Sensitivity = 0.7037, Specificity = 0.8687
plot(roc)          # ROC plot with annotation label

plot(roc, label = FALSE)  # ROC plot without annotation label


## Optional: direct slot access
roc$plot           # ggplot object for the ROC curve

head(roc$curve)    # Threshold, FPR, TPR, and J on the grid
#>    threshold FPR TPR J
#> 1 -0.2326416   1   1 0
#> 2 -0.2274720   1   1 0
#> 3 -0.2223023   1   1 0
#> 4 -0.2171326   1   1 0
#> 5 -0.2119629   1   1 0
#> 6 -0.2067933   1   1 0
roc$AUC            # Kernel-smoothed AUC estimate
#> [1] 0.8646697
roc$AUC_ci         # Confidence interval for AUC
#> [1] 0.8105293 0.9188102
roc$J              # Youden index estimate
#> [1] 0.5723741
roc$J_ci           # Confidence interval for the Youden index
#> [1] 0.4542642 0.6904840
roc$t0             # Estimated optimal cutoff (Youden point)
#> [1] 4.068528
roc$sensitivity    # Sensitivity at the Youden point
#> [1] 0.7036682
roc$specificity    # Specificity at the Youden point
#> [1] 0.8687059
roc$hX             # Selected CDF bandwidth for the non-diseased group
#> [1] 0.3146762
roc$hY             # Selected CDF bandwidth for the diseased group
#> [1] 0.9802306
```
