Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.8k views
in Technique[技术] by (71.8m points)

python - Most efficient method for updating multiple columns in a single dataframe row

line_profiler is showing me the surprising (to me) result that updating two columns in a single row is executed faster as two statements rather than one combined statement.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   696      6907   42029943.0   6085.1      4.7    df_work.loc[self.iRow, 'status'] = 'X'
   697      6907   68856814.0   9969.1      7.7    df_work.loc[self.iRow, 'clock'] = self.dClock
   698      6907  178155598.0  25793.5     19.9    df_work.loc[self.iRow, ['status', 'clock']] = ['L', self.dClock]

Lines 696 and 697 take a combined 11 secs vs 18 secs for the equivalent line 698 so 2 separate updates are 40% faster than a single update statement. I see this pattern repeatedly. I assumed the single update would run faster and before I revert my code back I want to check if there is an even more efficient method that updating one column at a time within a row. Thanks!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

After future research the solution was to switch to iat instead of loc.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   673      6907    5209397.0    754.2      1.7  df_work.iat[self.iRow, cols_work['clock']] = self.dClock

The per hit time decreased from 9969 to 754.

I initialized the dictionary to convert the column name to the column number for use with iat as follows:

    cols_work = {}
    for col in df_work.columns:
        cols_work[col] = len(cols_work)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...