Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
462 views
in Technique[技术] by (71.8m points)

r - Why does a mutate following a group_by(year, month) seem to miss a row?

I have a data frame of daily periodicity that I am converting to monthly periodicity included a simple transformation based on the summarized values:

tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year = year(date),
  month = month(date)
) %>% group_by(year, month) %>% summarise(
  date = last(date),
  month.close = last(index),
) %>% mutate(
  month.change = log(month.close / lag(month.close))
)

The code seems straight forward but when I run it I get something curious:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups:   year [2]
   year month date       month.close month.change
  <dbl> <dbl> <date>           <dbl>        <dbl>
1  2002    12 2002-12-31        403.     NA      
2  2003     1 2003-01-31        419.     NA      
3  2003     2 2003-02-28        422.      0.00572
4  2003     3 2003-03-01        417.     -0.0121 

Why doesn't row 2 have a month.change value despite row 1 and row 2 having a valid month.close value? Does the summarize() action act across both the given dimensions separately?

I really need to understand why this behavior is happening, so please don't just tell me to use a different function for collapsing periodicities, I really would like to know which part of the implementation I am understanding incorrectly so I do not insert a similar bug elsewhere in the future. I know it has something to do with grouping by 2 variables because when I simplify the two columns into one I get the expected behavior.

This code:

library(zoo)
tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year.month = as.yearmon(date)
) %>% group_by(year.month) %>% summarise(
  date = last(date),
  month.close = last(index),
) %>% mutate(
  month.change = log(month.close / lag(month.close))
)

returns the expected result

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 4
  year.month date       month.close month.change
  <yearmon>  <date>           <dbl>        <dbl>
1 Dec 2002   2002-12-31        405.     NA      
2 Jan 2003   2003-01-31        428.      0.0560 
3 Feb 2003   2003-02-28        421.     -0.0173 
4 Mar 2003   2003-03-01        423.      0.00513

What am I missing?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

When you use group_by with summarise by default only last level of grouping is dropped.

So at this stage your data is still grouped by year.

tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year = year(date),
  month = month(date)
) %>% group_by(year, month) %>% summarise(
  date = last(date),
  month.close = last(index))

# A tibble: 4 x 4
# Groups:   year [2] # <- Notice this
#   year month date       month.close
#  <int> <int> <date>           <dbl>
#1  2002    12 2002-12-31        411.
#2  2003     1 2003-01-31        393.
#3  2003     2 2003-02-28        406.
#4  2003     3 2003-03-01        398.

To overcome this behavior you can specify .groups = 'drop' or use ungroup() after above step.

tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year = year(date),
  month = month(date)
) %>% group_by(year, month) %>% summarise(
  date = last(date),
  month.close = last(index), .groups = 'drop',
) %>% mutate(
  month.change = log(month.close / lag(month.close))
)

#   year month date       month.close month.change
#  <int> <int> <date>           <dbl>        <dbl>
#1  2002    12 2002-12-31        399.    NA       
#2  2003     1 2003-01-31        380.    -0.0510  
#3  2003     2 2003-02-28        381.     0.00257 
#4  2003     3 2003-03-01        381.     0.000673

For the second step since your data is grouped by only one key it is dropped after summarise and you get expected output.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...