Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
241 views
in Technique[技术] by (71.8m points)

python - iterate through unique value combinations to create multiple train test splits

I am seeking to run multiple passes of a model while splitting my data based on year. I have data in a pandas dataframe ranging from 1/1/2015 to 12/31/2019 in this format:

IN: df.date.dt.year.value_counts().keys()
OUT: Int64Index([2019, 2018, 2017, 2016, 2015], dtype='int64')

I want to:

  1. use every record from 2015-2018 to train my model, then use all of the 2019 records to test my model
  2. then use 2015-2017+2019 to train and test on 2018, etc. for all possible combinations (5), and finally store the order in which each year is used as the test set.

the pseudocode would look like this:

for i in df.date.dt.year.value_counts().keys():

    # I don't know how I would pythonically code a way to iterate the next two rows to account for all possible combinations #
    
    train_df = df.loc[(df.date.dt.year <= 2016) | (df.date.dt.year >= 2018)]
    df_test = df.loc[df.date.dt.year == 2017]

    ##

    X_train = train_df.drop(['target']), axis = 1)
    y_train = train_df.target
    X_test = test_df.drop(['target']), axis = 1)
    y_test = test_df.target
    
    # run through feature engineering pipeline, ensemble of algorithms, and add cv results to a plot for 
    # accuracy comparison

    # Ideally, the loop would restart here, and process the next combination of years (train with 4, test with 5th) #

    # Finally, I would like a way to store the year being held for testing, in order of when they were held out for when I go to plot my evaluation metrics, so my customer knows which year is being tested in each plot that I produce. Assumedly it would just iterate in order, but depending on how the first section is coded, that could change. As of now I assume the best means of this would be as follows: #

    test_order = [] # This would obviously be placed above, outside the function
    test_order.append(i)

    # Afterwards, I would manually create five variables and assign a value to them based on the location in the list, i.e. test_1 = test_order[0], unless someone knows a way to automate this as well. #

I am not sure how to execute the iteration through the different combinations. It seems to me that sklearn.LeaveOneOut is not sufficient because it does a random shuffle each time, but if anyone knows a way to manipulate the LeaveOneOut method to achieve my goal that would be great too.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You might need to work on the syntax because I don't have your dataset. Essentially this captures the possible combinations then finds the year that is not in the test years. dtest is returned as list of one element.

from itertools import combinations
years = [2019, 2018, 2017, 2016, 2015]
combs = combinations(years, 4)
for comb in combs:
    # print(list(comb))
    train_df = df.loc[df.date.dt.year.isin(list(comb))]
    dtest = [i for i in years + list(comb) if i not in years or i not in list(comb)]
    df_test = df.loc[df.date.dt.year == dtest[0]]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

63 comments

56.7k users

...