小能豆

Pandas: apply date_range() to a group of rows that share a fixed date?

py

I have a csv file of WTA matches that are structured like this:

season,date,name,location,level,surface,draw,round,winner,loser,status
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Maria Lindstrom,Patrizia Murgo,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Gabriela Mosca,Pascale Etchemendy,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Vicki Nelson,Xochitl Escobedo,f
...
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,SF,Gabriela Sabatini,Lori Mcneil,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,SF,Arantxa Sanchez Vicario,Mariana Perez Roldan,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,F,Gabriela Sabatini,Arantxa Sanchez Vicario,f
...
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Eva Lys,Sorana Cirstea,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Jaqueline Cristian,Celine Naef,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Patricia Maria Tig,Martha Matoula,f
...
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Tamara Korpatsch,Eva Lys,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Elena-Gabriela Ruse,Rebeka Masarova,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,F,Tamara Korpatsch,Elena-Gabriela Ruse,f

After reading the contents into a DataFrame, I would like to transform the DataFrame in such a way that for each tournament, which are unique in the combination of ['date','name'], the date is advanced by one day after the initial round is completed.

date,round
1986-12-01,R64
1986-12-02,R32
1986-12-03,R16
1986-12-04,QF
1986-12-05,SF
1986-12-06,F

The result should look like this:

season,date,name,location,level,surface,draw,round,winner,loser,status
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Maria Lindstrom,Patrizia Murgo,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Gabriela Mosca,Pascale Etchemendy,f
1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Vicki Nelson,Xochitl Escobedo,f
...
1987,1986-12-05,Argentine Open,Buenos Aires,C1,clay,56,SF,Gabriela Sabatini,Lori Mcneil,f
1987,1986-12-05,Argentine Open,Buenos Aires,C1,clay,56,SF,Arantxa Sanchez Vicario,Mariana Perez Roldan,f
1987,1986-12-06,Argentine Open,Buenos Aires,C1,clay,56,F,Gabriela Sabatini,Arantxa Sanchez Vicario,f
...
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Eva Lys,Sorana Cirstea,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Jaqueline Cristian,Celine Naef,f
2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Patricia Maria Tig,Martha Matoula,f
...
2023,2023-10-19,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Tamara Korpatsch,Eva Lys,f
2023,2023-10-19,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Elena-Gabriela Ruse,Rebeka Masarova,f
2023,2023-10-20,Transylvania Open,Cluj-Napoca,WTA250,hard,32,F,Tamara Korpatsch,Elena-Gabriela Ruse,f

The date_range() function seems like a good fit, but as I’m new to Pandas, I don’t see a straightforward way to use it. What is a good solution to this?


阅读 64

收藏
2023-12-17

共1个答案

小能豆

You can achieve this by first converting the ‘date’ column to a datetime format, then grouping the DataFrame by the unique tournament identifier ['date', 'name'], and finally, using cumcount() to create a new ‘day’ column representing the day within each tournament. After that, you can use pd.to_timedelta to add the ‘day’ column to the ‘date’ column.

Here’s how you can do it:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('your_csv_file.csv')

# Convert 'date' to datetime format
df['date'] = pd.to_datetime(df['date'])

# Group by the unique tournament identifier
grouped = df.groupby(['date', 'name'])

# Add a 'day' column representing the day within each tournament
df['day'] = grouped.cumcount()

# Add the 'day' column to the 'date' column
df['date'] += pd.to_timedelta(df['day'], unit='D')

# Drop the temporary 'day' column
df.drop('day', axis=1, inplace=True)

# Print the result
print(df)

This code should produce the desired result, with the ‘date’ column advanced by one day after each completed round within each tournament. Adjust the column names and file paths according to your actual DataFrame and CSV file.

2023-12-17