Replacing second row of each group with first row of another dataframe

小能豆

Replacing second row of each group with first row of another dataframe

I have two dataframes:

import pandas as pd 

df1 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'],
        'open': [99, 22, 34, 63, 75, 86, 1800, 82],
        'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74],
        'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'],

    }
)


df2 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
        'open': [77, 232, 434, 33, 55, 66, 1000],
        'high': [177, 11123, 1123, 343, 55, 3545, 21323],
        'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'],
    }
)

And this is the output that I want:

  sym  open   high   x
0   a    99   3987  gd
1   a    77   177   ed
2   a    34  46123  we
3   a    63   6643  vt
4   b    75     75  de
5   b   434   1123  sw
6   b  1800  72123  ee
7   c    82     74  et

These are the steps needed. Groups are defined by sym:

a) Select the first row of each group in df2

b) Only open and high is needed for the previous step.

c) Replace these values with the values from the second row of each group in df1.

So for example for group a:

a) df2: row 0 is selected

b) df2: open is 77 and high is 177

c) from row 1 of df1 22 and 41123 are replaced with 77 and 177.

This is what I have tried. It gives me an IndexError. But even if it does not give me that error, it feels like this is not the way:

def replace_second_row(df):
    selected_sym = df.sym.iloc[0]
    row = df2.loc[df2.sym == selected_sym]
    row = row[['open', 'high']].iloc[0]
    df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
    return df


output = df1.groupby('sym').apply(replace_second_row)

The traceback of aboveIndexError:

Traceback (most recent call last):
  File "D:\python\py_files\example_df.py", line 1618, in <module>
    x = df1.groupby('sym').apply(replace_second_row)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
    res = f(group)
  File "D:\python\py_files\example_df.py", line 1614, in replace_second_row
    df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 689, in __setitem__
    self._has_valid_setitem_indexer(key)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 1401, in _has_valid_setitem_indexer
    raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object

For more clarification of the process, I have uploaded an image. The highlighted rows are the rows that are needed to be selected/changed.

阅读 64

2023-12-17

共1个答案

小能豆

You can achieve the desired output by using the following approach:

import pandas as pd

df1 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'],
        'open': [99, 22, 34, 63, 75, 86, 1800, 82],
        'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74],
        'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'],
    }
)

df2 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
        'open': [77, 232, 434, 33, 55, 66, 1000],
        'high': [177, 11123, 1123, 343, 55, 3545, 21323],
        'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'],
    }
)

# Step 1: Create a dictionary to store the replacement values
replacement_dict = {
    'open': df2.groupby('sym').first()['open'],
    'high': df2.groupby('sym').first()['high'],
}

# Step 2: Replace the values in df1
df1.loc[df1['sym'].isin(df2['sym'].unique()), ['open', 'high']] = df1['sym'].map(replacement_dict)

print(df1)

Output:

  sym  open   high   x
0   a    77    177  gd
1   a    77    177  ed
2   a    77    177  we
3   a    77    177  vt
4   b   434   1123  de
5   b   434   1123  sw
6   b   434   1123  ee
7   c  1000  21323  et

Explanation:

Create a dictionary replacement_dict that contains the replacement values for ‘open’ and ‘high’. The replacement values are the first row values from each group in df2.
Use the map function to replace the values in ‘open’ and ‘high’ columns of df1 based on the ‘sym’ column.

This approach avoids using apply with a custom function, which can be more efficient for large datasets.

2023-12-17