I am seeking to run multiple passes of a model while splitting my data based on year. I have data in a pandas dataframe ranging from 1/1/2015 to 12/31/2019 in this format:
IN: df.date.dt.year.value_counts().keys()
OUT: Int64Index([2019, 2018, 2017, 2016, 2015], dtype='int64')
I want to:
- use every record from 2015-2018 to train my model, then use all of the 2019 records to test my model
- then use 2015-2017+2019 to train and test on 2018, etc. for all possible combinations (5), and finally store the order in which each year is used as the test set.
the pseudocode would look like this:
for i in df.date.dt.year.value_counts().keys():
# I don't know how I would pythonically code a way to iterate the next two rows to account for all possible combinations #
train_df = df.loc[(df.date.dt.year <= 2016) | (df.date.dt.year >= 2018)]
df_test = df.loc[df.date.dt.year == 2017]
##
X_train = train_df.drop(['target']), axis = 1)
y_train = train_df.target
X_test = test_df.drop(['target']), axis = 1)
y_test = test_df.target
# run through feature engineering pipeline, ensemble of algorithms, and add cv results to a plot for
# accuracy comparison
# Ideally, the loop would restart here, and process the next combination of years (train with 4, test with 5th) #
# Finally, I would like a way to store the year being held for testing, in order of when they were held out for when I go to plot my evaluation metrics, so my customer knows which year is being tested in each plot that I produce. Assumedly it would just iterate in order, but depending on how the first section is coded, that could change. As of now I assume the best means of this would be as follows: #
test_order = [] # This would obviously be placed above, outside the function
test_order.append(i)
# Afterwards, I would manually create five variables and assign a value to them based on the location in the list, i.e. test_1 = test_order[0], unless someone knows a way to automate this as well. #
I am not sure how to execute the iteration through the different combinations. It seems to me that sklearn.LeaveOneOut is not sufficient because it does a random shuffle each time, but if anyone knows a way to manipulate the LeaveOneOut method to achieve my goal that would be great too.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…