I have two dataframes:
import pandas as pd df1 = pd.DataFrame( { 'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'], 'open': [99, 22, 34, 63, 75, 86, 1800, 82], 'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74], 'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'], } ) df2 = pd.DataFrame( { 'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'open': [77, 232, 434, 33, 55, 66, 1000], 'high': [177, 11123, 1123, 343, 55, 3545, 21323], 'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'], } )
And this is the output that I want:
sym open high x 0 a 99 3987 gd 1 a 77 177 ed 2 a 34 46123 we 3 a 63 6643 vt 4 b 75 75 de 5 b 434 1123 sw 6 b 1800 72123 ee 7 c 82 74 et
These are the steps needed. Groups are defined by sym:
sym
a) Select the first row of each group in df2
df2
b) Only open and high is needed for the previous step.
open
high
c) Replace these values with the values from the second row of each group in df1.
df1
So for example for group a:
a
a) df2: row 0 is selected
0
b) df2: open is 77 and high is 177
c) from row 1 of df1 22 and 41123 are replaced with 77 and 177.
1
This is what I have tried. It gives me an IndexError. But even if it does not give me that error, it feels like this is not the way:
IndexError
def replace_second_row(df): selected_sym = df.sym.iloc[0] row = df2.loc[df2.sym == selected_sym] row = row[['open', 'high']].iloc[0] df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row return df output = df1.groupby('sym').apply(replace_second_row)
The traceback of aboveIndexError:
Traceback (most recent call last): File "D:\python\py_files\example_df.py", line 1618, in <module> x = df1.groupby('sym').apply(replace_second_row) File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply result = self._python_apply_general(f, self._selected_obj) File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general keys, values, mutated = self.grouper.apply(f, data, self.axis) File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply res = f(group) File "D:\python\py_files\example_df.py", line 1614, in replace_second_row df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 689, in __setitem__ self._has_valid_setitem_indexer(key) File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 1401, in _has_valid_setitem_indexer raise IndexError("iloc cannot enlarge its target object") IndexError: iloc cannot enlarge its target object
For more clarification of the process, I have uploaded an image. The highlighted rows are the rows that are needed to be selected/changed.
You can achieve the desired output by using the following approach:
import pandas as pd df1 = pd.DataFrame( { 'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'], 'open': [99, 22, 34, 63, 75, 86, 1800, 82], 'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74], 'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'], } ) df2 = pd.DataFrame( { 'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'open': [77, 232, 434, 33, 55, 66, 1000], 'high': [177, 11123, 1123, 343, 55, 3545, 21323], 'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'], } ) # Step 1: Create a dictionary to store the replacement values replacement_dict = { 'open': df2.groupby('sym').first()['open'], 'high': df2.groupby('sym').first()['high'], } # Step 2: Replace the values in df1 df1.loc[df1['sym'].isin(df2['sym'].unique()), ['open', 'high']] = df1['sym'].map(replacement_dict) print(df1)
Output:
sym open high x 0 a 77 177 gd 1 a 77 177 ed 2 a 77 177 we 3 a 77 177 vt 4 b 434 1123 de 5 b 434 1123 sw 6 b 434 1123 ee 7 c 1000 21323 et
Explanation:
replacement_dict
map
This approach avoids using apply with a custom function, which can be more efficient for large datasets.
apply