I used a workaround digging into the MagicMock
instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.,Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.,Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise.,DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.
>>> df = pd.DataFrame({ 1: [10], 2: [20] }) >>> df 1 2 0 10 20
>>> exactly_equal = pd.DataFrame({ 1: [10], 2: [20] }) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True
>>> different_column_type = pd.DataFrame({ 1.0: [10], 2.0: [20] }) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True
>>> different_data_type = pd.DataFrame({ 1: [10.0], 2: [20.0] }) >>> different_data_type 1 2 0 10.0 20.0 >>> df.equals(different_data_type) False
The equals() method compares two (2) DataFrames/Series against each other to determine if they have an identical shape and elements. If identical return True, otherwise return False.,For this example, we have two (2) DataFrames containing grades for three (3) students.,Line [3] compares df_scores1 against df_scores2 testing the shape and elements. The outcome saves to result (True/False).,Line [5-6] confirms the shape of the DataFrames is equal by outputting the results to the terminal.
To install this library, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd
The syntax for this method is as follows:
DataFrame.equals(other)
Output
True (3, 3) (3, 3)
DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.,NOTE: if you only want to validate whether a dataframe matches exactly or not, you should look at pandas.testing.assert_frame_equal. The main use case for datacompy is when you need to interpret the difference between two dataframes.,DataComPy will try to join two dataframes either on a list of join columns, or on indexes. If the two dataframes have duplicates based on join values, the match process sorts by the remaining fields and joins based on that row number.,The class validates that you passed dataframes, that they contain all of the columns in join_columns and have unique column names other than that. The class also lowercases all column names to disambiguate.
pip install datacompy
from io import StringIO import pandas as pd import datacompy data1 = "" "acct_id,dollar_amt,name,float_fld,date_fld 10000001234, 123.45, George Maharis, 14530.1555, 2017 - 01 - 01 10000001235, 0.45, Michael Bluth, 1, 2017 - 01 - 01 10000001236, 1345, George Bluth, , 2017 - 01 - 01 10000001237, 123456, Bob Loblaw, 345.12, 2017 - 01 - 01 10000001239, 1.05, Lucille Bluth, , 2017 - 01 - 01 "" " data2 = "" "acct_id,dollar_amt,name,float_fld 10000001234, 123.4, George Michael Bluth, 14530.155 10000001235, 0.45, Michael Bluth, 10000001236, 1345, George Bluth, 1 10000001237, 123456, Robert Loblaw, 345.12 10000001238, 1.05, Loose Seal Bluth, 111 "" " df1 = pd.read_csv(StringIO(data1)) df2 = pd.read_csv(StringIO(data2)) compare = datacompy.Compare( df1, df2, join_columns = 'acct_id', #You can also specify a list of columns abs_tol = 0, #Optional, defaults to 0 rel_tol = 0, #Optional, defaults to 0 df1_name = 'Original', #Optional, defaults to 'df1' df2_name = 'New' #Optional, defaults to 'df2' ) compare.matches(ignore_extra_columns = False) # False # This method prints out a human - readable report summarizing and sampling differences print(compare.report())
import datetime import datacompy from pyspark.sql import Row # This example assumes you have a SparkSession named "spark" in your environment, as you # do when running `pyspark` from the terminal or in a Databricks notebook(Spark v2 .0 and higher) data1 = [ Row(acct_id = 10000001234, dollar_amt = 123.45, name = 'George Maharis', float_fld = 14530.1555, date_fld = datetime.date(2017, 1, 1)), Row(acct_id = 10000001235, dollar_amt = 0.45, name = 'Michael Bluth', float_fld = 1.0, date_fld = datetime.date(2017, 1, 1)), Row(acct_id = 10000001236, dollar_amt = 1345.0, name = 'George Bluth', float_fld = None, date_fld = datetime.date(2017, 1, 1)), Row(acct_id = 10000001237, dollar_amt = 123456.0, name = 'Bob Loblaw', float_fld = 345.12, date_fld = datetime.date(2017, 1, 1)), Row(acct_id = 10000001239, dollar_amt = 1.05, name = 'Lucille Bluth', float_fld = None, date_fld = datetime.date(2017, 1, 1)) ] data2 = [ Row(acct_id = 10000001234, dollar_amt = 123.4, name = 'George Michael Bluth', float_fld = 14530.155), Row(acct_id = 10000001235, dollar_amt = 0.45, name = 'Michael Bluth', float_fld = None), Row(acct_id = 10000001236, dollar_amt = 1345.0, name = 'George Bluth', float_fld = 1.0), Row(acct_id = 10000001237, dollar_amt = 123456.0, name = 'Robert Loblaw', float_fld = 345.12), Row(acct_id = 10000001238, dollar_amt = 1.05, name = 'Loose Seal Bluth', float_fld = 111.0) ] base_df = spark.createDataFrame(data1) compare_df = spark.createDataFrame(data2) comparison = datacompy.SparkCompare(spark, base_df, compare_df, join_columns = ['acct_id']) # This prints out a human - readable report summarizing differences comparison.report()