小能豆

How can I create conditional column for two sets in a window of a key?

py

I have two tables with information on retailers users used in pre- and post-periods. My goal is to find what retailers were new for users in postperiod. I wonder how to do it in a user_id window, because is_in() and loc[] solutions do not fit well for my task. Also I was thinking about filtering joins (anti-join specifically) but it did not work well. Here’s sample data:

sample1 = pd.DataFrame(
    {
        'user_id': [45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_1', 'retailer_2']
    }
)

sample2 = pd.DataFrame(
    {
        'user_id': [45, 45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_6', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_2', 'retailer_1', 'retailer_2']
    }
) 

My desired result is like this:

{'user_id': {0: 45, 1: 45, 2: 556, 3: 556, 4: 556, 5: 556, 6: 556, 7: 556, 8: 1344, 9: 1588, 10: 2063, 11: 2063, 12: 2063, 13: 2673, 14: 2673, 15: 2982, 16: 2982}, 'retailer': {0: 'retailer_1', 1: 'retailer_6', 2: 'retailer_1', 3: 'retailer_2', 4: 'retailer_3', 5: 'retailer_4', 6: 'retailer_5', 7: 'retailer_6', 8: 'retailer_3', 9: 'retailer_2', 10: 'retailer_2', 11: 'retailer_3', 12: 'retailer_7', 13: 'retailer_1', 14: 'retailer_2', 15: 'retailer_1', 16: 'retailer_2'}, 'is_new_retailer': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 1, 15: 0, 16: 0}}

阅读 69

收藏
2023-12-06

共1个答案

小能豆

You can achieve this by using the merge function in pandas and then creating a new column to identify new retailers in the post-period. Here’s how you can do it:

import pandas as pd

sample1 = pd.DataFrame(
    {
        'user_id': [45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_1', 'retailer_2']
    }
)

sample2 = pd.DataFrame(
    {
        'user_id': [45, 45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_6', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_2', 'retailer_1', 'retailer_2']
    }
) 

# Merge the two samples on user_id and retailer
merged_samples = pd.merge(sample1, sample2, on=['user_id', 'retailer'], how='outer', indicator=True)

# Create a new column 'is_new_retailer' based on the '_merge' column
merged_samples['is_new_retailer'] = (merged_samples['_merge'] == 'right_only').astype(int)

# Drop the '_merge' column
merged_samples = merged_samples.drop('_merge', axis=1)

# Display the result
print(merged_samples)

This will give you a DataFrame with a new column ‘is_new_retailer’ indicating whether a retailer is new for each user in the post-period. The result should match your desired output.{‘user_id’: {0: 45, 1: 45, 2: 556, 3: 556, 4: 556, 5: 556, 6: 556, 7: 556, 8: 1344, 9: 1588, 10: 2063, 11: 2063, 12: 2063, 13: 2673, 14: 2673, 15: 2982, 16: 2982}, ‘retailer’: {0: ‘retailer_1’, 1: ‘retailer_6’, 2: ‘retailer_1’, 3: ‘retailer_2’, 4: ‘retailer_3’, 5: ‘retailer_4’, 6: ‘retailer_5’, 7: ‘retailer_6’, 8: ‘retailer_3’, 9: ‘retailer_2’, 10: ‘retailer_2’, 11: ‘retailer_3’, 12: ‘retailer_7’, 13: ‘retailer_1’, 14: ‘retailer_2’, 15: ‘retailer_1’, 16: ‘retailer_2’}, ‘is_new_retailer’: {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 1, 15: 0, 16: 0}}

2023-12-06