how can pandas dataframes appear identical but fail equals()?

  • Last Update :
  • Techknowledgy :

This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:

>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype = 'object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype = 'int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object >>>
   df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype = 'object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype = 'int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64

In core/internals.py, we have the BlockManager method

def equals(self, other):
   self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
   return False
if not all(ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
   return False
self._consolidate_inplace()
other._consolidate_inplace()
return all(block.equals(oblock) for block, oblock in
   zip(self.blocks, other.blocks))

and that last all assumes that the blocks in self and other correspond. But if we add some print calls before it, we see:

>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False

Suggestion : 2

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.,Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.,Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise.,DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.

>>> df = pd.DataFrame({
      1: [10],
      2: [20]
   }) >>>
   df
1 2
0 10 20
>>> exactly_equal = pd.DataFrame({
      1: [10],
      2: [20]
   }) >>>
   exactly_equal
1 2
0 10 20
   >>>
   df.equals(exactly_equal)
True
>>> different_column_type = pd.DataFrame({
      1.0: [10],
      2.0: [20]
   }) >>>
   different_column_type
1.0 2.0
0 10 20
   >>>
   df.equals(different_column_type)
True
>>> different_data_type = pd.DataFrame({
      1: [10.0],
      2: [20.0]
   }) >>>
   different_data_type
1 2
0 10.0 20.0
   >>>
   df.equals(different_data_type)
False

Suggestion : 3

The pandas dataframe function equals() is used to compare two dataframes for equality. It returns True if the two dataframes have the same shape and elements. For two dataframes to be equal, the elements should have the same dtype. The column headers, however, do not need to have the same dtype. The following is the syntax:,While working with pandas dataframes, it may happen that you require to check whether two dataframes are same or not. In this tutorial, we’ll look at how to compare two pandas dataframes for equality along with some examples.,What will the equals() function return if two dataframes have the same elements but different column names? ,In the above example, two dataframes df1 and df2 are compared for equality using the equals() method. Since the dataframes are exactly similar (1. values and datatypes of elements are the same and values and 2. datatypes of row and column labels are the same) True is returned.

The pandas dataframe function equals() is used to compare two dataframes for equality. It returns True if the two dataframes have the same shape and elements. For two dataframes to be equal, the elements should have the same dtype. The column headers, however, do not need to have the same dtype. The following is the syntax:

df1.equals(df2)

1. Compare two exactly similar dataframes

import pandas as pd

# two identical dataframes
df1 = pd.DataFrame({
   'A': [1, 2],
   'B': ['x', 'y']
})
df2 = pd.DataFrame({
   'A': [1, 2],
   'B': ['x', 'y']
})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check
if both are equal
print(df1.equals(df2))
3._
DataFrame df1:
   A B
0 1 x
1 2 y

DataFrame df2:
   A B
0 1 x
1 2 y
True

Output:

DataFrame df1:
   A B
0 1.0 x
1 NaN None

DataFrame df2:
   A B
0 1.0 x
1 NaN None

Are both equal ?
   True

3. Compare two dataframes with equal values but different dtypes

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({
   'A': [1, 2],
   'B': ['x', 'y']
})
df2 = pd.DataFrame({
   'A': [1.0, 2.0],
   'B': ['x', 'y']
})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check
if both are equal
print("\nAre both equal?")
print(df1.equals(df2))
DataFrame df1:
   A B
0 1 x
1 2 y

DataFrame df2:
   A B
0 1 x
1 2 y
True

Suggestion : 4

How can Pandas DataFrames appear identical but fail equals()?,How to compare two dataframes with timestamps and create a dictionary of dataframes ? Python Pandas,How to compare 2 non-identical dataframes in python,DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

Outer-merge the two dfs:

merged = df1.merge(df2, how = 'outer', left_on = 'col_id', right_on = 'id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam

Divide the merged frame into left/right frames and align their columns with set_axis:

cols = df1.columns
left = merged.iloc[: ,: len(cols)].set_axis(cols, axis = 1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN

right = merged.iloc[: , len(cols): ].set_axis(cols, axis = 1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam

compare the aligned left/right frames (use keep_equal=True to show equal cells):

left.compare(right, keep_shape = True, keep_equal = True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam

left.compare(right, keep_shape = True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam

Outer-merge the two dfs:

merged = df1.merge(df2, how = 'outer', left_on = 'col_id', right_on = 'id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam

Divide the merged frame into left/right frames and align their columns with set_axis:

cols = df1.columns
left = merged.iloc[: ,: len(cols)].set_axis(cols, axis = 1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN

right = merged.iloc[: , len(cols): ].set_axis(cols, axis = 1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam

compare the aligned left/right frames (use keep_equal=True to show equal cells):

left.compare(right, keep_shape = True, keep_equal = True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam

left.compare(right, keep_shape = True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam

If I understand correctly, you want something like this:

new_df = df1.drop(['name', 'num'], axis = 1).merge(df2.rename({
   'id': 'col_id'
}, axis = 1), how = 'outer')

Output:

>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam

Suggestion : 5

Updated Aug 04, 2022

import numpy as np
import pandas as pd

# Enable Arrow - based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Suggestion : 6

When searching for dataframe rows that only match a single condition, we can avoid the error with masking, using df[] and placing the statement generating the mask within the brackets, for example, df[df['price'] < 30000].,By using the | operator in place of or, we can return a copy containing rows that have a True value in the mask generated by either condition, as shown:,If looking for rows that match multiple conditions, to avoid the error, we must replace statements like and, or and not with their respective bitwise operators, &, | and ~.,This error is usually triggered when creating a copy of a dataframe that matches either a single or multiple conditions. Let's consider the example dataframe below:

ValueError: The truth value of a Series is ambiguous.Use a.empty, a.bool(), a.item(), a.any() or a.all()
import pandas as pd

df = pd.DataFrame.from_dict({
   'manufacturer': ['BMW', 'Kia', 'Mercedes', 'Audi'],
   'model': ['1 Series', 'Rio', 'A-Class', 'A3'],
   'price': [28000, 12500, 30000, 26500],
   'mileage': [1800, 4500, 400, 700]
})
if df['price'] < 20000:
   print(df)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-2357a7362348> in <module>
----> 1 if df['price'] < 20000:
      2     print(df)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1476 
   1477     def __nonzero__(self):
-> 1478         raise ValueError(
   1479             f"The truth value of a {type(self).__name__} is ambiguous. "
   1480             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
df['price'] < 20000
0 False
1 True
2 False
3 False
Name: price, dtype: bool