I’m trying to split a column by the last ' - ' AND being followed by all uppercase strings letters.
' - '
It may not necessarily be the last delimiter in isolation. But it will be the last before all uppercase strings.
I can find separate questions that separate based on first/last delimiter. But not with a combination.
Below, I have a df with Value containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - ' and uppercase letters.
Value
I’ve got Last column correct but not First column.
Last
First
df = pd.DataFrame({ 'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'], })
option 1)
df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True) df['Last'] = df['Value'].str.split('- ').str[-1]
option 2)
# Regular expression pattern pattern = r'^(.*) - ([A-Z\s]+)$' # Extract groups into two new columns df[['First', 'Last']] = df['Value'].str.extract(pattern)
option 3)
df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)
None of these options return the intended output.
intended output:
First Last 0 Juan-Diva HOLLS 1 Carlos - George ESTE BAN - BOM 2 Javier Plain - Hotham Ham ALPINE 3 Yul KONJ KOL MON
regex:
df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)') Value First Last 0 Juan-Diva - HOLLS Juan D 1 Carlos - George - ESTE BAN - BOM Carlos G 2 Javier Plain - Hotham Ham - ALPINE Javier Plain H 3 Yul - KONJ KOL MON Yul KONJ KOL MON
You can use the rsplit function with a positive lookahead assertion in the regular expression to achieve the desired result. Here’s how you can modify option 3:
rsplit
import pandas as pd df = pd.DataFrame({ 'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'], }) # Extract groups into two new columns using rsplit with lookahead assertion df[['First', 'Last']] = df['Value'].str.rsplit(r' - (?=[A-Z]+)', n=1, expand=True) print(df)
This will give you the intended output:
Value First Last 0 Juan-Diva - HOLLS Juan-Diva HOLLS 1 Carlos - George - ESTE BAN - BOM Carlos - George ESTE BAN - BOM 2 Javier Plain - Hotham Ham - ALPINE Javier Plain Hotham Ham - ALPINE 3 Yul - KONJ KOL MON Yul KONJ KOL MON
In this regex pattern, r' - (?=[A-Z]+)', the (?=[A-Z]+) is a positive lookahead assertion, ensuring that the delimiter ' - ' is followed by one or more uppercase letters. This helps in finding the last occurrence of ' - ' before all uppercase strings.
r' - (?=[A-Z]+)'
(?=[A-Z]+)