I’m trying to split a column by the last ' - '
AND being followed by all uppercase strings letters.
It may not necessarily be the last delimiter in isolation. But it will be the last before all uppercase strings.
I can find separate questions that separate based on first/last delimiter. But not with a combination.
Below, I have a df with Value
containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - '
and uppercase letters.
I’ve got Last
column correct but not First
column.
df = pd.DataFrame({
'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
})
option 1)
df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True)
df['Last'] = df['Value'].str.split('- ').str[-1]
option 2)
# Regular expression pattern
pattern = r'^(.*) - ([A-Z\s]+)$'
# Extract groups into two new columns
df[['First', 'Last']] = df['Value'].str.extract(pattern)
option 3)
df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)
None of these options return the intended output.
intended output:
First Last
0 Juan-Diva HOLLS
1 Carlos - George ESTE BAN - BOM
2 Javier Plain - Hotham Ham ALPINE
3 Yul KONJ KOL MON
regex:
df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)')
Value First Last
0 Juan-Diva - HOLLS Juan D
1 Carlos - George - ESTE BAN - BOM Carlos G
2 Javier Plain - Hotham Ham - ALPINE Javier Plain H
3 Yul - KONJ KOL MON Yul KONJ KOL MON
You can use the rsplit
function with a positive lookahead assertion in the regular expression to achieve the desired result. Here’s how you can modify option 3:
import pandas as pd
df = pd.DataFrame({
'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
})
# Extract groups into two new columns using rsplit with lookahead assertion
df[['First', 'Last']] = df['Value'].str.rsplit(r' - (?=[A-Z]+)', n=1, expand=True)
print(df)
This will give you the intended output:
Value First Last
0 Juan-Diva - HOLLS Juan-Diva HOLLS
1 Carlos - George - ESTE BAN - BOM Carlos - George ESTE BAN - BOM
2 Javier Plain - Hotham Ham - ALPINE Javier Plain Hotham Ham - ALPINE
3 Yul - KONJ KOL MON Yul KONJ KOL MON
In this regex pattern, r' - (?=[A-Z]+)'
, the (?=[A-Z]+)
is a positive lookahead assertion, ensuring that the delimiter ' - '
is followed by one or more uppercase letters. This helps in finding the last occurrence of ' - '
before all uppercase strings.