我正在尝试连接以下数据框:
df1
price side timestamp timestamp 2016-01-04 00:01:15.631331072 0.7286 2 1451865675631331 2016-01-04 00:01:15.631399936 0.7286 2 1451865675631400 2016-01-04 00:01:15.631860992 0.7286 2 1451865675631861 2016-01-04 00:01:15.631866112 0.7286 2 1451865675631866
和:
df2
bid bid_size offer offer_size timestamp 2016-01-04 00:00:31.331441920 0.7284 4000000 0.7285 1000000 2016-01-04 00:00:53.631324928 0.7284 4000000 0.7290 4000000 2016-01-04 00:01:03.131234048 0.7284 5000000 0.7286 4000000 2016-01-04 00:01:12.131444992 0.7285 1000000 0.7286 4000000 2016-01-04 00:01:15.631364096 0.7285 4000000 0.7290 4000000
和
data = pd.concat([df1,df2], axis=1)
但我得到以下输出:
InvalidIndexError Traceback (most recent call last) <ipython-input-38-2e88458f01d7> in <module>() ----> 1 data = pd.concat([df1,df2], axis=1) 2 data = data.fillna(method='pad') 3 data = data.fillna(method='bfill') 4 data['timestamp'] = data.index.values#converting to datetime 5 data['timestamp'] = pd.to_datetime(data['timestamp'])#converting to datetime /usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy) 810 keys=keys, levels=levels, names=names, 811 verify_integrity=verify_integrity, --> 812 copy=copy) 813 return op.get_result() 814 /usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy) 947 self.copy = copy 948 --> 949 self.new_axes = self._get_new_axes() 950 951 def get_result(self): /usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self) 1013 if i == self.axis: 1014 continue -> 1015 new_axes[i] = self._get_comb_axis(i) 1016 else: 1017 if len(self.join_axes) != ndim - 1: /usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i) 1039 raise TypeError("Cannot concatenate list of %s" % types) 1040 -> 1041 return _get_combined_index(all_indexes, intersect=self.intersect) 1042 1043 def _get_concat_axis(self): /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect) 6120 index = index.intersection(other) 6121 return index -> 6122 union = _union_indexes(indexes) 6123 return _ensure_index(union) 6124 /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes) 6149 6150 if hasattr(result, 'union_many'): -> 6151 return result.union_many(indexes[1:]) 6152 else: 6153 for other in indexes[1:]: /usr/local/lib/python2.7/site-packages/pandas/tseries/index.pyc in union_many(self, others) 959 else: 960 tz = this.tz --> 961 this = Index.union(this, other) 962 if isinstance(this, DatetimeIndex): 963 this.tz = tz /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other) 1553 result.extend([x for x in other._values if x not in value_set]) 1554 else: -> 1555 indexer = self.get_indexer(other) 1556 indexer, = (indexer == -1).nonzero() 1557 /usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance) 1890 1891 if not self.is_unique: -> 1892 raise InvalidIndexError('Reindexing only valid with uniquely' 1893 ' valued Index objects') 1894 InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我已经删除了额外的列并删除了可能存在冲突的重复项和 NA - 但我根本不知道出了什么问题。
您遇到的问题源于索引中有重复的时间戳,这会导致尝试合并或连接 DataFrames 时出现问题。执行连接(或重新索引)时,Pandas 需要唯一的索引值,并且错误告诉您索引中有重复的值。
解决方法如下:
timestamp
print(df1.index.duplicated().sum()) # Check for duplicates in df1 print(df2.index.duplicated().sum()) # Check for duplicates in df2
如果有重复,则需要在连接 DataFrames 之前处理它们。
删除或处理重复索引: 有几个选项可以处理重复项:
选项 1:删除索引中的重复项
如果不需要重复的行,则可以删除它们:
df1 = df1[~df1.index.duplicated(keep='first')] df2 = df2[~df2.index.duplicated(keep='first')]
选项 2:重置索引
您还可以重置索引,使其timestamp成为常规列,而不是索引:
df1 = df1.reset_index() df2 = df2.reset_index()
重置索引后,timestamp将成为一个常规列,您可以执行连接而不会发生任何冲突。
连接 DataFrames: 处理重复项或重置索引后,可以连接 DataFrames:
data = pd.concat([df1, df2], axis=1)
data = data.fillna(method='pad') data = data.fillna(method='bfill')
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Step 1: Check for duplicates in the index print(df1.index.duplicated().sum()) print(df2.index.duplicated().sum()) # Step 2: Option 1 - Remove duplicates in the index df1 = df1[~df1.index.duplicated(keep='first')] df2 = df2[~df2.index.duplicated(keep='first')] # Step 3: Concatenate the DataFrames data = pd.concat([df1, df2], axis=1) # Step 4: Fill missing data data = data.fillna(method='pad') data = data.fillna(method='bfill') # Step 5: Convert timestamp to datetime (if needed) data['timestamp'] = pd.to_datetime(data['timestamp'])
这种方法应该可以解决问题InvalidIndexError并允许您成功连接 DataFrames。
InvalidIndexError