Split column by last delimiter AND uppercase values

小能豆

Split column by last delimiter AND uppercase values - python

I’m trying to split a column by the last ' - ' AND being followed by all uppercase strings letters.

It may not necessarily be the last delimiter in isolation. But it will be the last before all uppercase strings.

I can find separate questions that separate based on first/last delimiter. But not with a combination.

Below, I have a df with Value containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - ' and uppercase letters.

I’ve got Last column correct but not First column.

df = pd.DataFrame({
   'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
   })

option 1)

df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True)

df['Last'] = df['Value'].str.split('- ').str[-1]

option 2)

# Regular expression pattern
pattern = r'^(.*) - ([A-Z\s]+)$'

# Extract groups into two new columns
df[['First', 'Last']] = df['Value'].str.extract(pattern)

option 3)

df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)

None of these options return the intended output.

intended output:

                       First            Last
0                  Juan-Diva           HOLLS
1            Carlos - George  ESTE BAN - BOM
2  Javier Plain - Hotham Ham          ALPINE
3                        Yul    KONJ KOL MON

regex:

df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)')
                                Value         First          Last
0                   Juan-Diva - HOLLS          Juan             D
1    Carlos - George - ESTE BAN - BOM        Carlos             G
2  Javier Plain - Hotham Ham - ALPINE  Javier Plain             H
3                  Yul - KONJ KOL MON           Yul  KONJ KOL MON

阅读 113

2023-12-12

共1个答案

小能豆

You can use the rsplit function with a positive lookahead assertion in the regular expression to achieve the desired result. Here’s how you can modify option 3:

import pandas as pd

df = pd.DataFrame({
    'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
})

# Extract groups into two new columns using rsplit with lookahead assertion
df[['First', 'Last']] = df['Value'].str.rsplit(r' - (?=[A-Z]+)', n=1, expand=True)

print(df)

This will give you the intended output:

                      Value          First          Last
0           Juan-Diva - HOLLS       Juan-Diva         HOLLS
1  Carlos - George - ESTE BAN - BOM    Carlos - George    ESTE BAN - BOM
2  Javier Plain - Hotham Ham - ALPINE   Javier Plain    Hotham Ham - ALPINE
3               Yul - KONJ KOL MON         Yul         KONJ KOL MON

In this regex pattern, r' - (?=[A-Z]+)', the (?=[A-Z]+) is a positive lookahead assertion, ensuring that the delimiter ' - ' is followed by one or more uppercase letters. This helps in finding the last occurrence of ' - ' before all uppercase strings.

2023-12-12