This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:
>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype = 'object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype = 'int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object >>>
df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype = 'object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype = 'int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
In core/internals.py
, we have the BlockManager
method
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
if not all(ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
return False
self._consolidate_inplace()
other._consolidate_inplace()
return all(block.equals(oblock) for block, oblock in
zip(self.blocks, other.blocks))
and that last all
assumes that the blocks in self
and other
correspond. But if we add some print
calls before it, we see:
>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.,Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.,Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise.,DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.
>>> df = pd.DataFrame({ 1: [10], 2: [20] }) >>> df 1 2 0 10 20
>>> exactly_equal = pd.DataFrame({ 1: [10], 2: [20] }) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True
>>> different_column_type = pd.DataFrame({ 1.0: [10], 2.0: [20] }) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True
>>> different_data_type = pd.DataFrame({ 1: [10.0], 2: [20.0] }) >>> different_data_type 1 2 0 10.0 20.0 >>> df.equals(different_data_type) False
The pandas dataframe function equals() is used to compare two dataframes for equality. It returns True if the two dataframes have the same shape and elements. For two dataframes to be equal, the elements should have the same dtype. The column headers, however, do not need to have the same dtype. The following is the syntax:,While working with pandas dataframes, it may happen that you require to check whether two dataframes are same or not. In this tutorial, we’ll look at how to compare two pandas dataframes for equality along with some examples.,What will the equals() function return if two dataframes have the same elements but different column names? ,In the above example, two dataframes df1 and df2 are compared for equality using the equals() method. Since the dataframes are exactly similar (1. values and datatypes of elements are the same and values and 2. datatypes of row and column labels are the same) True is returned.
The pandas dataframe function equals()
is used to compare two dataframes for equality. It returns True
if the two dataframes have the same shape and elements. For two dataframes to be equal, the elements should have the same dtype
. The column headers, however, do not need to have the same dtype. The following is the syntax:
df1.equals(df2)
1. Compare two exactly similar dataframes
import pandas as pd # two identical dataframes df1 = pd.DataFrame({ 'A': [1, 2], 'B': ['x', 'y'] }) df2 = pd.DataFrame({ 'A': [1, 2], 'B': ['x', 'y'] }) # print the two dataframes print("DataFrame df1:") print(df1) print("\nDataFrame df2:") print(df2) # check if both are equal print(df1.equals(df2))
DataFrame df1: A B 0 1 x 1 2 y DataFrame df2: A B 0 1 x 1 2 y True
Output:
DataFrame df1: A B 0 1.0 x 1 NaN None DataFrame df2: A B 0 1.0 x 1 NaN None Are both equal ? True
3. Compare two dataframes with equal values but different dtypes
import pandas as pd import numpy as np # two identical dataframes df1 = pd.DataFrame({ 'A': [1, 2], 'B': ['x', 'y'] }) df2 = pd.DataFrame({ 'A': [1.0, 2.0], 'B': ['x', 'y'] }) # print the two dataframes print("DataFrame df1:") print(df1) print("\nDataFrame df2:") print(df2) # check if both are equal print("\nAre both equal?") print(df1.equals(df2))
DataFrame df1: A B 0 1 x 1 2 y DataFrame df2: A B 0 1 x 1 2 y True
How can Pandas DataFrames appear identical but fail equals()?,How to compare two dataframes with timestamps and create a dictionary of dataframes ? Python Pandas,How to compare 2 non-identical dataframes in python,DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
Outer-merge
the two dfs:
merged = df1.merge(df2, how = 'outer', left_on = 'col_id', right_on = 'id') # col_id num name_x id no name_y # 0 1 3 linda 1 2 granpa # 1 2 4 james 2 6 linda # 2 NaN NaN NaN 3 7 sam
Divide the merged
frame into left
/right
frames and align their columns with set_axis
:
cols = df1.columns left = merged.iloc[: ,: len(cols)].set_axis(cols, axis = 1) # col_id num name # 0 1 3 linda # 1 2 4 james # 2 NaN NaN NaN right = merged.iloc[: , len(cols): ].set_axis(cols, axis = 1) # col_id num name # 0 1 2 granpa # 1 2 6 linda # 2 3 7 sam
compare
the aligned left
/right
frames (use keep_equal=True
to show equal cells):
left.compare(right, keep_shape = True, keep_equal = True) # col_id num name # self other self other self other # 0 1 1 3 2 linda granpa # 1 2 2 4 6 james linda # 2 NaN 3 NaN 7 NaN sam left.compare(right, keep_shape = True) # col_id num name # self other self other self other # 0 NaN NaN 3 2 linda granpa # 1 NaN NaN 4 6 james linda # 2 NaN 3 NaN 7 NaN sam
Outer-merge
the two dfs:
merged = df1.merge(df2, how = 'outer', left_on = 'col_id', right_on = 'id') # col_id num name_x id no name_y # 0 1 3 linda 1 2 granpa # 1 2 4 james 2 6 linda # 2 NaN NaN NaN 3 7 sam
Divide the merged
frame into left
/right
frames and align their columns with set_axis
:
cols = df1.columns left = merged.iloc[: ,: len(cols)].set_axis(cols, axis = 1) # col_id num name # 0 1 3 linda # 1 2 4 james # 2 NaN NaN NaN right = merged.iloc[: , len(cols): ].set_axis(cols, axis = 1) # col_id num name # 0 1 2 granpa # 1 2 6 linda # 2 3 7 sam
compare
the aligned left
/right
frames (use keep_equal=True
to show equal cells):
left.compare(right, keep_shape = True, keep_equal = True) # col_id num name # self other self other self other # 0 1 1 3 2 linda granpa # 1 2 2 4 6 james linda # 2 NaN 3 NaN 7 NaN sam left.compare(right, keep_shape = True) # col_id num name # self other self other self other # 0 NaN NaN 3 2 linda granpa # 1 NaN NaN 4 6 james linda # 2 NaN 3 NaN 7 NaN sam
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis = 1).merge(df2.rename({
'id': 'col_id'
}, axis = 1), how = 'outer')
Output:
>>> new_df col_id no name 0 1 2 granpa 1 2 6 linda 2 3 7 sam
Updated Aug 04, 2022
import numpy as np import pandas as pd # Enable Arrow - based columnar data transfers spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") # Generate a pandas DataFrame pdf = pd.DataFrame(np.random.rand(100, 3)) # Create a Spark DataFrame from a pandas DataFrame using Arrow df = spark.createDataFrame(pdf) # Convert the Spark DataFrame back to a pandas DataFrame using Arrow result_pdf = df.select("*").toPandas()
When searching for dataframe rows that only match a single condition, we can avoid the error with masking, using df[] and placing the statement generating the mask within the brackets, for example, df[df['price'] < 30000].,By using the | operator in place of or, we can return a copy containing rows that have a True value in the mask generated by either condition, as shown:,If looking for rows that match multiple conditions, to avoid the error, we must replace statements like and, or and not with their respective bitwise operators, &, | and ~.,This error is usually triggered when creating a copy of a dataframe that matches either a single or multiple conditions. Let's consider the example dataframe below:
ValueError: The truth value of a Series is ambiguous.Use a.empty, a.bool(), a.item(), a.any() or a.all()
import pandas as pd
df = pd.DataFrame.from_dict({
'manufacturer': ['BMW', 'Kia', 'Mercedes', 'Audi'],
'model': ['1 Series', 'Rio', 'A-Class', 'A3'],
'price': [28000, 12500, 30000, 26500],
'mileage': [1800, 4500, 400, 700]
})
if df['price'] < 20000:
print(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-2357a7362348> in <module>
----> 1 if df['price'] < 20000:
2 print(df)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1476
1477 def __nonzero__(self):
-> 1478 raise ValueError(
1479 f"The truth value of a {type(self).__name__} is ambiguous. "
1480 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
df['price'] < 20000
0 False 1 True 2 False 3 False Name: price, dtype: bool