python - iterate through unique value combinations to create multiple train test splits

Question

Welcome To Ask or Share your Answers For Others

python - iterate through unique value combinations to create multiple train test splits

asked Jan 29, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - iterate through unique value combinations to create multiple train test splits

I am seeking to run multiple passes of a model while splitting my data based on year. I have data in a pandas dataframe ranging from 1/1/2015 to 12/31/2019 in this format:

IN: df.date.dt.year.value_counts().keys()
OUT: Int64Index([2019, 2018, 2017, 2016, 2015], dtype='int64')

I want to:

use every record from 2015-2018 to train my model, then use all of the 2019 records to test my model
then use 2015-2017+2019 to train and test on 2018, etc. for all possible combinations (5), and finally store the order in which each year is used as the test set.

the pseudocode would look like this:

for i in df.date.dt.year.value_counts().keys():

    # I don't know how I would pythonically code a way to iterate the next two rows to account for all possible combinations #
    
    train_df = df.loc[(df.date.dt.year <= 2016) | (df.date.dt.year >= 2018)]
    df_test = df.loc[df.date.dt.year == 2017]

    ##

    X_train = train_df.drop(['target']), axis = 1)
    y_train = train_df.target
    X_test = test_df.drop(['target']), axis = 1)
    y_test = test_df.target
    
    # run through feature engineering pipeline, ensemble of algorithms, and add cv results to a plot for 
    # accuracy comparison

    # Ideally, the loop would restart here, and process the next combination of years (train with 4, test with 5th) #

    # Finally, I would like a way to store the year being held for testing, in order of when they were held out for when I go to plot my evaluation metrics, so my customer knows which year is being tested in each plot that I produce. Assumedly it would just iterate in order, but depending on how the first section is coded, that could change. As of now I assume the best means of this would be as follows: #

    test_order = [] # This would obviously be placed above, outside the function
    test_order.append(i)

    # Afterwards, I would manually create five variables and assign a value to them based on the location in the list, i.e. test_1 = test_order[0], unless someone knows a way to automate this as well. #

I am not sure how to execute the iteration through the different combinations. It seems to me that sklearn.LeaveOneOut is not sufficient because it does a random shuffle each time, but if anyone knows a way to manipulate the LeaveOneOut method to achieve my goal that would be great too.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-29T04:24:24+0000

You might need to work on the syntax because I don't have your dataset. Essentially this captures the possible combinations then finds the year that is not in the test years. dtest is returned as list of one element.

from itertools import combinations
years = [2019, 2018, 2017, 2016, 2015]
combs = combinations(years, 4)
for comb in combs:
    # print(list(comb))
    train_df = df.loc[df.date.dt.year.isin(list(comb))]
    dtest = [i for i in years + list(comb) if i not in years or i not in list(comb)]
    df_test = df.loc[df.date.dt.year == dtest[0]]

Categories

python - iterate through unique value combinations to create multiple train test splits

python - iterate through unique value combinations to create multiple train test splits

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags