Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
332 views
in Technique[技术] by (71.8m points)

r - plyr for same analysis accross different subsets

I'm new to plyr and dplyr and seriously don't get it. I have managed my way around some functions, but I struggle with really basic stuff such as the following example.

Taking mtcars, I have different overlapping subsets, such as vs = 1 and am = 1

I now want to run the same analysis, in this case median() for one var over the different subsets, and another analysis, such as mean() for another var. This should give me in the end the same result, such as the following code - just much shorter:

data_mt <- mtcars         # has binary dummy vars for grouping
data_vs <- data_mt[ which(data_mt$vs == 1 ), ]
data_am <- data_mt[ which(data_mt$am == 1 ), ]

median(data_mt$mpg)
median(data_vs$mpg)
median(data_am$mpg)

mean(data_mt$cyl)
mean(data_vs$cyl)
mean(data_am$cyl)

In my real example, I have an analog to data_mt, so if you have a solution starting there, without data_vs etc. that would be great.

I'm sure this is very basic, but I can't wrap my head around it - and as I have some 1500 variables that I want to look at, I'd appreciate your help =)

It may well be that my answer is already out there, but with the terminology I know I didn't find it explain for Dummies ;D


Edit:

To have a better understanding of what I am doing and what I am looking for, I hereby post my original code (not the mtcars example).

I have a dataset ds with 402 observations of 553 variables The dataset comes from a study with human participants, some of which opted in for additional research mys or obs or both.

ds$mys <- 0
ds$mys[ which(ds$staffmystery_p == "Yes" ) ] <- 1

ds$obs <- 0
ds$obs[ which( !is.na(ds$sales_time)) ] <- 1

The 553 variables are either integers (e.g. for age or years of experience) or factors (e.g. sex or yes/no answers). I now want to compare some descriptive of the full dataset with the descriptives for the subsets and ideally also do a t-test for difference. Currently I have just a very long list of code that reads more or less like the following (just much longer). This doesn't include t-tests.

describe(ds$age_b)
describe(dm$age_b)
describe(do$age_b)

prop.table(table(ds$sex_b))*100
prop.table(table(dm$sex_b))*100
prop.table(table(do$sex_b))*100

ds, dm and do are different datasets, but they are all just based on the above mentioned full dataset ds and the subsets ds$mys for dm and ds$obs for do

describe comes from the psych package and just lists descriptive statistics like mean or median etc. I don't need all of the metrics, mostly n, mean, median, sd and iqr. The formula around 'prop.table' gives me a readout I can just copy into the excel tables I use for the final publications. I don't want automated output because I get asked all the time to add or change this, which is really just easier in excel than with automated output. (unless you know a much superior way ;)

Thank you so much!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here is an option if we want to do this for different columns by group separately

library(dplyr)
library(purrr)
library(stringr)
map_dfc(c('vs', 'am'), ~ 
   mtcars %>% 
     group_by(across(all_of(.x))) %>%
     summarise(!! str_c("Mean_cyl_", .x)  := mean(cyl), 
       !! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
   mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))

-output

# A tibble: 2 x 8
#     vs Mean_cyl_vs Median_mpg_vs    am Mean_cyl_am Median_mpg_am Mean_cyl_full Median_mpg_full
#  <dbl>       <dbl>         <dbl> <dbl>       <dbl>         <dbl>         <dbl>           <dbl>
#1     0        7.44          15.6     0        6.95          17.3          6.19            19.2
#2     1        4.57          22.8     1        5.08          22.8          6.19            19.2

If the package version is old, we can replace the across with group_by_at

map_dfc(c('vs', 'am'), ~ 
   mtcars %>% 
     group_by_at(vars(.x)) %>%
     summarise(!! str_c("Mean_cyl_", .x)  := mean(cyl), 
       !! str_c("Median_mpg_", .x) := median(mpg), .groups = 'drop'))%>%
   mutate(Mean_cyl_full = mean(mtcars$cyl), Median_mpg_full = median(mtcars$mpg))

Update

Based on the update, we could place the datasets in a list, do the transformations at once and return a list of descriptive statistics and the proportion table

out <- map(dplyr::lst(dm, ds, do), ~ {

          dat <- .x %>%
                     mutate(mys = as.integer(staffmystery_p == 'Yes'),
                                         obs = as.integer(!is.na(sales_time)))
                            age_b_desc <- describe(dat$age_b)
                            prop_table_out <- prop.table(table(dat$sex_b))*100
                            
                            return(dplyr::lst(age_b_desc, prop_table_out))
                            
                            
                            }
                                    
                 )

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

63 comments

56.5k users

...