小能豆

Concat DataFrame 重新索引仅对唯一值的 Index 对象有效

py

我正在尝试连接以下数据框:

df1

                                price   side timestamp
timestamp           
2016-01-04 00:01:15.631331072   0.7286  2   1451865675631331
2016-01-04 00:01:15.631399936   0.7286  2   1451865675631400
2016-01-04 00:01:15.631860992   0.7286  2   1451865675631861
2016-01-04 00:01:15.631866112   0.7286  2   1451865675631866

和:

df2

                                bid     bid_size offer  offer_size
timestamp               
2016-01-04 00:00:31.331441920   0.7284  4000000 0.7285  1000000
2016-01-04 00:00:53.631324928   0.7284  4000000 0.7290  4000000
2016-01-04 00:01:03.131234048   0.7284  5000000 0.7286  4000000
2016-01-04 00:01:12.131444992   0.7285  1000000 0.7286  4000000
2016-01-04 00:01:15.631364096   0.7285  4000000 0.7290  4000000

 data = pd.concat([df1,df2], axis=1)  

但我得到以下输出:

InvalidIndexError                         Traceback (most recent call last)
<ipython-input-38-2e88458f01d7> in <module>()
----> 1 data = pd.concat([df1,df2], axis=1)
      2 data = data.fillna(method='pad')
      3 data = data.fillna(method='bfill')
      4 data['timestamp'] =  data.index.values#converting to datetime
      5 data['timestamp'] = pd.to_datetime(data['timestamp'])#converting to datetime

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    810                        keys=keys, levels=levels, names=names,
    811                        verify_integrity=verify_integrity,
--> 812                        copy=copy)
    813     return op.get_result()
    814 

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    947         self.copy = copy
    948 
--> 949         self.new_axes = self._get_new_axes()
    950 
    951     def get_result(self):

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self)
   1013                 if i == self.axis:
   1014                     continue
-> 1015                 new_axes[i] = self._get_comb_axis(i)
   1016         else:
   1017             if len(self.join_axes) != ndim - 1:

/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i)
   1039                 raise TypeError("Cannot concatenate list of %s" % types)
   1040 
-> 1041         return _get_combined_index(all_indexes, intersect=self.intersect)
   1042 
   1043     def _get_concat_axis(self):

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect)
   6120             index = index.intersection(other)
   6121         return index
-> 6122     union = _union_indexes(indexes)
   6123     return _ensure_index(union)
   6124 

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes)
   6149 
   6150         if hasattr(result, 'union_many'):
-> 6151             return result.union_many(indexes[1:])
   6152         else:
   6153             for other in indexes[1:]:

/usr/local/lib/python2.7/site-packages/pandas/tseries/index.pyc in union_many(self, others)
    959             else:
    960                 tz = this.tz
--> 961                 this = Index.union(this, other)
    962                 if isinstance(this, DatetimeIndex):
    963                     this.tz = tz

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other)
   1553                 result.extend([x for x in other._values if x not in value_set])
   1554         else:
-> 1555             indexer = self.get_indexer(other)
   1556             indexer, = (indexer == -1).nonzero()
   1557 

/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1890 
   1891         if not self.is_unique:
-> 1892             raise InvalidIndexError('Reindexing only valid with uniquely'
   1893                                     ' valued Index objects')
   1894 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects  

我已经删除了额外的列并删除了可能存在冲突的重复项和 NA - 但我根本不知道出了什么问题。


阅读 48

收藏
2024-10-05

共1个答案

小能豆

您遇到的问题源于索引中有重复的时间戳,这会导致尝试合并或连接 DataFrames 时出现问题。执行连接(或重新索引)时,Pandas 需要唯一的索引值,并且错误告诉您索引中有重复的值。

解决方法如下:

解决问题的步骤

  1. 检查索引中的重复项:timestamp首先,检查两个 DataFrame 的索引 中是否有重复项:

print(df1.index.duplicated().sum()) # Check for duplicates in df1 print(df2.index.duplicated().sum()) # Check for duplicates in df2

如果有重复,则需要在连接 DataFrames 之前处理它们。

  1. 删除或处理重复索引: 有几个选项可以处理重复项:

  2. 选项 1:删除索引中的重复项

    如果不需要重复的行,则可以删除它们:

    df1 = df1[~df1.index.duplicated(keep='first')] df2 = df2[~df2.index.duplicated(keep='first')]

  3. 选项 2:重置索引

    您还可以重置索引,使其timestamp成为常规列,而不是索引:

    df1 = df1.reset_index() df2 = df2.reset_index()

    重置索引后,timestamp将成为一个常规列,您可以执行连接而不会发生任何冲突。

  4. 连接 DataFrames: 处理重复项或重置索引后,可以连接 DataFrames:

data = pd.concat([df1, df2], axis=1)

  1. 前向和后向填充缺失数据: 与原始代码一样,您可以使用前向和后向填充来处理任何缺失的数据:

data = data.fillna(method='pad') data = data.fillna(method='bfill')

  1. 重新转换timestamp为日期时间: 如果您重置了索引并创建了timestamp一列,则可能需要将其转换回日期时间对象:

data['timestamp'] = pd.to_datetime(data['timestamp'])

示例解决方案

# Step 1: Check for duplicates in the index
print(df1.index.duplicated().sum())
print(df2.index.duplicated().sum())

# Step 2: Option 1 - Remove duplicates in the index
df1 = df1[~df1.index.duplicated(keep='first')]
df2 = df2[~df2.index.duplicated(keep='first')]

# Step 3: Concatenate the DataFrames
data = pd.concat([df1, df2], axis=1)

# Step 4: Fill missing data
data = data.fillna(method='pad')
data = data.fillna(method='bfill')

# Step 5: Convert timestamp to datetime (if needed)
data['timestamp'] = pd.to_datetime(data['timestamp'])

这种方法应该可以解决问题InvalidIndexError并允许您成功连接 DataFrames。

2024-10-05