I have a csv file of WTA matches that are structured like this:
season,date,name,location,level,surface,draw,round,winner,loser,status 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Maria Lindstrom,Patrizia Murgo,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Gabriela Mosca,Pascale Etchemendy,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Vicki Nelson,Xochitl Escobedo,f ... 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,SF,Gabriela Sabatini,Lori Mcneil,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,SF,Arantxa Sanchez Vicario,Mariana Perez Roldan,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,F,Gabriela Sabatini,Arantxa Sanchez Vicario,f ... 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Eva Lys,Sorana Cirstea,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Jaqueline Cristian,Celine Naef,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Patricia Maria Tig,Martha Matoula,f ... 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Tamara Korpatsch,Eva Lys,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Elena-Gabriela Ruse,Rebeka Masarova,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,F,Tamara Korpatsch,Elena-Gabriela Ruse,f
After reading the contents into a DataFrame, I would like to transform the DataFrame in such a way that for each tournament, which are unique in the combination of ['date','name'], the date is advanced by one day after the initial round is completed.
['date','name']
round
date,round 1986-12-01,R64 1986-12-02,R32 1986-12-03,R16 1986-12-04,QF 1986-12-05,SF 1986-12-06,F
The result should look like this:
season,date,name,location,level,surface,draw,round,winner,loser,status 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Maria Lindstrom,Patrizia Murgo,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Gabriela Mosca,Pascale Etchemendy,f 1987,1986-12-01,Argentine Open,Buenos Aires,C1,clay,56,R64,Vicki Nelson,Xochitl Escobedo,f ... 1987,1986-12-05,Argentine Open,Buenos Aires,C1,clay,56,SF,Gabriela Sabatini,Lori Mcneil,f 1987,1986-12-05,Argentine Open,Buenos Aires,C1,clay,56,SF,Arantxa Sanchez Vicario,Mariana Perez Roldan,f 1987,1986-12-06,Argentine Open,Buenos Aires,C1,clay,56,F,Gabriela Sabatini,Arantxa Sanchez Vicario,f ... 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Eva Lys,Sorana Cirstea,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Jaqueline Cristian,Celine Naef,f 2023,2023-10-16,Transylvania Open,Cluj-Napoca,WTA250,hard,32,R32,Patricia Maria Tig,Martha Matoula,f ... 2023,2023-10-19,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Tamara Korpatsch,Eva Lys,f 2023,2023-10-19,Transylvania Open,Cluj-Napoca,WTA250,hard,32,SF,Elena-Gabriela Ruse,Rebeka Masarova,f 2023,2023-10-20,Transylvania Open,Cluj-Napoca,WTA250,hard,32,F,Tamara Korpatsch,Elena-Gabriela Ruse,f
The date_range() function seems like a good fit, but as I’m new to Pandas, I don’t see a straightforward way to use it. What is a good solution to this?
date_range()
You can achieve this by first converting the ‘date’ column to a datetime format, then grouping the DataFrame by the unique tournament identifier ['date', 'name'], and finally, using cumcount() to create a new ‘day’ column representing the day within each tournament. After that, you can use pd.to_timedelta to add the ‘day’ column to the ‘date’ column.
['date', 'name']
cumcount()
pd.to_timedelta
Here’s how you can do it:
import pandas as pd # Read the CSV file into a DataFrame df = pd.read_csv('your_csv_file.csv') # Convert 'date' to datetime format df['date'] = pd.to_datetime(df['date']) # Group by the unique tournament identifier grouped = df.groupby(['date', 'name']) # Add a 'day' column representing the day within each tournament df['day'] = grouped.cumcount() # Add the 'day' column to the 'date' column df['date'] += pd.to_timedelta(df['day'], unit='D') # Drop the temporary 'day' column df.drop('day', axis=1, inplace=True) # Print the result print(df)
This code should produce the desired result, with the ‘date’ column advanced by one day after each completed round within each tournament. Adjust the column names and file paths according to your actual DataFrame and CSV file.