Skip to content

Conversation

MichaelChirico
Copy link

@MichaelChirico MichaelChirico commented May 31, 2025

{plyr} is long-superseded. The package itself only uses plyr::count(); a vignette also uses round_any().

I was pursuing just dropping {plyr} and re-implementing in base, but gave up for three main reasons:

  1. table(<data.frame>) may fail if there are very few
    duplicates and each column is of high cardinality, meaning
    table(x) would have a very large number of 0 entries that
    need to be computed and dropped (plyr::count() skips them).
  2. We can use something like interaction(..., drop=TRUE) +
    tapply() to imitate this, but it's hard to generically
    reconstruct the un-interacted levels needed to build an
    equivalent data.frame -- basically, we'd need to, for full
    generality, use a sep=<str> where <str> is not present in
    any of the unique values of any of the columns of x in order
    for strsplit(<level>, <sep>) to uniquely map back.
  3. Something like vapply(split(x, x), nrow, integer(1L)) is also
    appealingly simple, but split() always drops missing levels
    (https://bugs.r-project.org/show_bug.cgi?id=18899) --> we'd
    need an onerous/ugly loop over the columns to replace missing
    observations with a unique NA-equivalent, end-sorting sentinel.

Thus the move to {dplyr}, despite it being a non-lightweight choice.

I also applied some code quality fixes to nearby lines:

  1. T/F --> TRUE/FALSE.
  2. 1:<n> loops replaced by seq_len()/seq_along(), as appropriate.
  3. Loop like x <- c(); for (i in seq_along(y)) x[i] <- foo(y[i]) should pre-initialize x to be length(y).
  4. Move some lines around to avoid creating variables just prior to a possible early return().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant