I’m trying to split a column by the last ' - ' that is followed by all uppercase strings letters.
' - '
Below, I have a df with Value containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - ' and uppercase letters.
Value
I’ve got Last column correct but not First column.
Last
First
df = pd.DataFrame({ 'Value': [ 'Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'], })
option 1)
df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True) df['Last'] = df['Value'].str.split('- ').str[-1]
option 2)
# Regular expression pattern pattern = r'^(.*) - ([A-Z\s]+)$' # Extract groups into two new columns df[['First', 'Last']] = df['Value'].str.extract(pattern)
option 3)
df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)
None of these options return the intended output.
intended output:
First Last 0 Juan-Diva HOLLS 1 Carlos - George ESTE BAN - BOM 2 Javier Plain - Hotham Ham ALPINE 3 Yul KONJ KOL MON
You can achieve the desired output by using the rsplit method with a specified n parameter to control the number of splits. In this case, you want to perform only one split from the right side. Here’s how you can do it:
rsplit
n
import pandas as pd df = pd.DataFrame({ 'Value': [ 'Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'], }) df[['First', 'Last']] = df['Value'].str.rsplit(' - ', n=1, expand=True) print(df)
This will give you the intended output:
Value First Last 0 Juan-Diva - HOLLS Juan-Diva HOLLS 1 Carlos - George - ESTE BAN - BOM Carlos - George ESTE BAN - BOM 2 Javier Plain - Hotham Ham - ALPINE Javier Plain - Hotham Ham ALPINE 3 Yul - KONJ KOL MON Yul KONJ KOL MON
The key here is to use rsplit with n=1 to perform only one split from the right side. This way, you split the string into two parts at the last occurrence of ' - '.
n=1