If you pass the keys
parameter to concat
, the columns of the resulting dataframe will be comprised of a multi-index which keeps track of the original dataframes:
In[1]: c = pd.concat([df, df2], axis = 1, keys = ['df1', 'df2'])
c
Out[1]:
df1 df2
a b c a b c
A na na na 1 na 1
B na 1 1 na na na
C na 1 na na 1 na
D NaN NaN NaN na 1 na
Since the underlying arrays now have the same shape, you can now use ==
to broadcast your comparison and use this as a mask to return all matching values:
In[171]: m = c.df1[c.df1 == c.df2];
m
Out[171]:
a b c
A NaN NaN NaN
B NaN NaN NaN
C NaN 1 NaN
D NaN NaN NaN
If your 'na' value are actually zeros, you could use a sparse matrix to reduce this to the coordinates of the matching values (you'll lose your index and column names though):
import scipy.sparse as sp
print(sp.coo_matrix(m.where(m.notnull(), 0)))
(2, 1) 1.0
Or, slightly more readable:
m = (df1 != df2) different_indices = [(i, j) for i in range(len(m.columns)) for j in range(len(m)) if m[i][j]]
Combine data from multiple files into a single DataFrame using merge and concat.,Combine data from multiple files into a single DataFrame using merge and concat. ,Combine two DataFrames using a unique ID found in both DataFrames.,Combine two DataFrames using a unique ID found in both DataFrames.
import pandas as pd
surveys_df = pd.read_csv("data/surveys.csv",
keep_default_na = False, na_values = [""])
surveys_df
record_id month day year plot species sex hindfoot_length weight
0 1 7 16 1977 2 NA M 32 NaN
1 2 7 16 1977 3 NA M 33 NaN
2 3 7 16 1977 2 DM F 37 NaN
3 4 7 16 1977 7 DM M 36 NaN
4 5 7 16 1977 3 DM M 35 NaN
..............................
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15 14
35547 35548 12 31 2002 7 DO M 36 51
35548 35549 12 31 2002 5 NaN NaN NaN NaN
[35549 rows x 9 columns]
species_df = pd.read_csv("data/species.csv",
keep_default_na = False, na_values = [""])
species_df
species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird..............
49 UP Pipilo sp.Bird
50 UR Rodent sp.Rodent
51 US Sparrow sp.Bird
52 ZL Zonotrichia leucophrys Bird
53 ZM Zenaida macroura Bird
[54 rows x 4 columns]
# Read in first 10 lines of surveys table survey_sub = surveys_df.head(10) # Grab the last 10 rows survey_sub_last10 = surveys_df.tail(10) # Reset the index values to the second dataframe appends properly survey_sub_last10 = survey_sub_last10.reset_index(drop = True) # drop = True option avoids adding new index column with old index values
# Stack the DataFrames on top of each other vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis = 0) # Place the DataFrames side by side horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis = 1)
# Write DataFrame to CSV vertical_stack.to_csv('data/out.csv', index = False)
# For kicks read our output back into Python and make sure all looks good new_output = pd.read_csv('data/out.csv', keep_default_na = False, na_values = [""])
# Read in first 10 lines of surveys table survey_sub = surveys_df.head(10) # Import a small subset of the species data designed for this part of the lesson. # It is stored in the data folder. species_sub = pd.read_csv('data/speciesSubset.csv', keep_default_na = False, na_values = [""])
We can load these CSV files as Pandas DataFrames into pandas using the Pandas read_csv command, and examine the contents using the DataFrame head() command.,Inner Merge / Inner join – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.,High performance database joins with Pandas, a comparison of merge speeds by Wes McKinney, creator of Pandas.,Combining DataFrames with Pandas on “Python for Ecologists” by DataCarpentry
Lets see how we can correctly add the “device” and “platform” columns to the user_usage dataframe using the Pandas Merge command.
result = pd.merge(user_usage,
user_device[['use_id', 'platform', 'device']],
on = 'use_id')
result.head()
You can change the merge to a left-merge with the “how” parameter to your merge command. The top of the result dataframe contains the successfully matched items, and at the bottom contains the rows in user_usage that didn’t have a corresponding use_id in user_device.
result = pd.merge(user_usage,
user_device[['use_id', 'platform', 'device']],
on = 'use_id',
how = 'left')
For examples sake, we can repeat this process with a right join / right merge, simply by replacing how=’left’ with how=’right’ in the Pandas merge command.
result = pd.merge(user_usage,
user_device[['use_id', 'platform', 'device']],
on = 'use_id',
how = 'right')
Coming back to our original problem, we have already merged user_usage with user_device, so we have the platform and device for each user. Originally, we used an “inner merge” as the default in Pandas, and as such, we only have entries for users where there is also device information. We’ll redo this merge using a left join to keep all users, and then use a second left merge to finally to get the device manufacturers in the same dataframe.
# First, add the platform and device to the user usage - use a left join this time. result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on = 'use_id', how = 'left') # At this point, the platform and device columns are included # in the result along with all columns from user_usage # Now, based on the "device" column in result, match the "Model" column in devices. devices.rename(columns = { "Retail Branding": "manufacturer" }, inplace = True) result = pd.merge(result, devices[['manufacturer', 'Model']], left_on = 'device', right_on = 'Model', how = 'left') print(result.head())
With our merges complete, we can use the data aggregation functionality of Pandas to quickly work out the mean usage for users based on device manufacturer. Note that the small sample size creates even smaller groups, so I wouldn’t attribute any statistical significance to these particular results!
result.groupby("manufacturer").agg({
"outgoing_mins_per_month": "mean",
"outgoing_sms_per_month": "mean",
"monthly_mb": "mean",
"use_id": "count"
})
When the two DataFrames don’t have identical labels or shape.,Compare with another Series and show differences.,Compare to another DataFrame and show the differences.,Test whether two objects contain the same elements.
>>> df = pd.DataFrame(
...{
..."col1": ["a", "a", "b", "b", "a"],
..."col2": [1.0, 2.0, 3.0, np.nan, 5.0],
..."col3": [1.0, 2.0, 3.0, 4.0, 5.0]
...
},
...columns = ["col1", "col2", "col3"],
...) >>>
df
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0
>>> df2 = df.copy() >>>
df2.loc[0, 'col1'] = 'c' >>>
df2.loc[2, 'col3'] = 4.0 >>>
df2
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0
>>> df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
>>> df.compare(df2, align_axis = 0) col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0
>>> df.compare(df2, keep_equal = True) col1 col3 self other self other 0 a c 1.0 1.0 2 b b 3.0 4.0
>>> df.compare(df2, keep_shape = True) col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN
The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.,The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side.,keep_shape: This is a boolean parameter. Setting this to True prevents dropping of any row or column, and compare drops rows and columns with all elements same for the two data frames for the default value False.,keep_equal: This is another boolean parameter. Setting this to True shows equal values between the two DataFrames, while compare shows the positions with the same values for the two data frames as NaN for the default value False.
Syntax
DataFrame.compare(other, align_axis = 1, keep_shape = False, keep_equal = False)
import pandas as pd
data = [
['dom', 10],
['chibuge', 15],
['celeste', 14]
]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
data1 = [
['dom', 11],
['abhi', 17],
['celeste', 14]
]
df1 = pd.DataFrame(data1, columns = ['Name', 'Age'])
print("Dataframe 1 -- \n")
print(df)
print("-" * 5)
print("Dataframe 2 -- \n")
print(df1)
print("-" * 5)
print("Dataframe difference -- \n")
print(df.compare(df1))
print("-" * 5)
print("Dataframe difference keeping equal values -- \n")
print(df.compare(df1, keep_equal = True))
print("-" * 5)
print("Dataframe difference keeping same shape -- \n")
print(df.compare(df1, keep_shape = True))
print("-" * 5)
print("Dataframe difference keeping same shape and equal values -- \n")
print(df.compare(df1, keep_shape = True, keep_equal = True))
jQuery Tutorial
Try the following code:
df = pd.concat([df2.set_index('currency').T, df1], axis = 0, ignore_index = True)