-
-
Notifications
You must be signed in to change notification settings - Fork 16
Suggestion of new function: describe_missing()
#561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
f879900
ab9f006
218b7f4
ebaeb68
c3c1302
357dbbc
0c25fef
fbdd26d
72041f5
835b3bb
0e83588
e8d393d
f26f247
ceebf8b
b389a39
1f36678
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| #' @title Describe Missing Values in Data According to Guidelines | ||
| #' | ||
| #' @description Provides a detailed description of missing values in a data frame. | ||
| #' This function reports both absolute and percentage missing values of specified | ||
| #' column lists or scales, following recommended guidelines. Some authors recommend | ||
| #' reporting item-level missingness per scale, as well as a participant's maximum | ||
| #' number of missing items by scale. For example, Parent (2013) writes: | ||
| #' | ||
| #' *I recommend that authors (a) state their tolerance level for missing data by scale | ||
| #' or subscale (e.g., "We calculated means for all subscales on which participants gave | ||
| #' at least 75% complete data") and then (b) report the individual missingness rates | ||
| #' by scale per data point (i.e., the number of missing values out of all data points | ||
| #' on that scale for all participants) and the maximum by participant (e.g., "For Attachment | ||
| #' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant | ||
| #' missing more than a single data point").* | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved everything after "Some authors recommend" to Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose one question we have to answer is: do we want to have |
||
| #' | ||
| #' @param data The data frame to be analyzed. | ||
| #' @param vars Variable (or lists of variables) to check for missing values (NAs). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here it works a little bit differently than
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by nested structure here? Do you mean that the names in each entry of the list are the items contributing to the scale?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should include |
||
| #' @param scales The scale names to check for missing values (as a character vector). | ||
|
rempsyc marked this conversation as resolved.
Outdated
|
||
| #' @keywords missing values NA guidelines | ||
|
rempsyc marked this conversation as resolved.
Outdated
|
||
| #' @return A dataframe with the following columns: | ||
| #' - `var`: Variables selected. | ||
| #' - `items`: Number of items for selected variables. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be |
||
| #' - `na`: Number of missing cell values for those variables (e.g., 2 missing | ||
| #' values for the first participant + 2 missing values for the second participant | ||
| #' = total of 4 missing values). | ||
|
rempsyc marked this conversation as resolved.
Outdated
|
||
| #' - `cells`: Total number of cells (i.e., number of participants multiplied by | ||
| #' the number of variables, `items`). | ||
| #' - `na_percent`: The percentage of missing values (`na` divided by `cells`). | ||
| #' - `na_max`: The number of missing values for the participant with the most | ||
| #' missing values for the selected variables. | ||
| #' - `na_max_percent`: The amount of missing values for the participant with | ||
| #' the most missing values for the selected variables, as a percentage | ||
| #' (i.e., `na_max` divided by the number of selected variables, `items`). | ||
| #' - `all_na`: The number of participants missing 100% of items for that scale | ||
| #' (the selected variables). | ||
| #' | ||
| #' @export | ||
| #' @references Parent, M. C. (2013). Handling item-level missing | ||
| #' data: Simpler is just as good. *The Counseling Psychologist*, | ||
| #' *41*(4), 568-600. https://doi.org/10.1177%2F0011000012445176 | ||
| #' @examples | ||
| #' # Use the entire data frame | ||
| #' describe_missing(airquality) | ||
| #' | ||
| #' # Use selected columns explicitly | ||
| #' describe_missing(airquality, | ||
| #' vars = list( | ||
| #' c("Ozone", "Solar.R", "Wind"), | ||
| #' c("Temp", "Month", "Day") | ||
| #' ) | ||
| #' ) | ||
| #' | ||
| #' # If the questionnaire items start with the same name, e.g., | ||
| #' set.seed(15) | ||
| #' fun <- function() { | ||
| #' c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA) | ||
| #' } | ||
| #' df <- data.frame( | ||
| #' ID = c("idz", NA), | ||
| #' open_1 = fun(), open_2 = fun(), open_3 = fun(), | ||
| #' extrovert_1 = fun(), extrovert_2 = fun(), extrovert_3 = fun(), | ||
| #' agreeable_1 = fun(), agreeable_2 = fun(), agreeable_3 = fun() | ||
| #' ) | ||
|
Comment on lines
+44
to
+48
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This fails with an unclear message if there are more than one variable in The way this argument works is also not very clear to me. For instance, I'd find it more natural if the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Additionally, the current implementation means that > describe_missing(df_long, by = "dimension")
variable n_missing missing_percent complete_percent
1 agreeableness 21 50.00 50.00
2 agreeableness 0 0.00 100.00
3 agreeableness 10 23.81 76.19
4 extroversion 21 50.00 50.00
5 extroversion 0 0.00 100.00
6 extroversion 17 40.48 59.52
7 openness 21 50.00 50.00
8 openness 0 0.00 100.00
9 openness 11 26.19 73.81
10 Total 101 20.04 79.96 |
||
| #' | ||
| #' # One can list the scale names directly: | ||
| #' describe_missing(df, scales = c("ID", "open", "extrovert", "agreeable")) | ||
| describe_missing <- function(data, vars = NULL, scales = NULL) { | ||
| classes <- lapply(data, class) | ||
|
rempsyc marked this conversation as resolved.
Outdated
|
||
| if (missing(vars) && missing(scales)) { | ||
| vars.internal <- names(data) | ||
| } else if (!missing(scales)) { | ||
| vars.internal <- lapply(scales, function(x) { | ||
| grep(paste0("^", x), names(data), value = TRUE) | ||
| }) | ||
| } | ||
| if (!missing(vars)) { | ||
| vars.internal <- vars | ||
| } | ||
| if (!is.list(vars.internal)) { | ||
| vars.internal <- list(vars.internal) | ||
| } | ||
| na_df <- .describe_missing(data) | ||
| if (!missing(vars) || !missing(scales)) { | ||
| na_list <- lapply(vars.internal, function(x) { | ||
| data_subset <- data[, x, drop = FALSE] | ||
| .describe_missing(data_subset) | ||
| }) | ||
| na_df$var <- "Total" | ||
| na_df <- do.call(rbind, c(na_list, list(na_df))) | ||
| } | ||
| na_df | ||
| } | ||
|
|
||
| .describe_missing <- function(data) { | ||
| my_var <- paste0(names(data)[1], ":", names(data)[ncol(data)]) | ||
| items <- ncol(data) | ||
| na <- sum(is.na(data)) | ||
| cells <- nrow(data) * ncol(data) | ||
| na_percent <- round(na / cells * 100, 2) | ||
| na_max <- max(rowSums(is.na(data))) | ||
| na_max_percent <- round(na_max / items * 100, 2) | ||
| all_na <- sum(apply(data, 1, function(x) all(is.na(x)))) | ||
|
|
||
| data.frame( | ||
| var = my_var, | ||
| items = items, | ||
| na = na, | ||
| cells = cells, | ||
| na_percent = na_percent, | ||
| na_max = na_max, | ||
| na_max_percent = na_max_percent, | ||
| all_na = all_na | ||
| ) | ||
| } | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.