flexibly select pandas dataframe rows using dictionary [duplicate]

  • Last Update :
  • Techknowledgy :

Yes, there is! You can build a query string using a simple list comprehension, and pass the string to query for dynamic evaluation.

query = ' and '.join([f '{k} == {repr(v)}'
   for k, v in m.items()
])
# query = ' and '.join(['{} == {}'.format(k, repr(v)) for k, v in m.items()])
new_df = df.query(query)

print(query)
# "color == 'red' and year == 2016"

print(new_df)
color brand year
0 red Ford 2016

For better performance, AND handling column names with spaces, etc, use logical_and.reduce:

df[np.logical_and.reduce([df[k] == v
   for k, v in m.items()
])]

color brand year
0 red Ford 2016

With single expression:

In[728]: df = pd.DataFrame({
   'color': ['red', 'green', 'blue'],
   'brand': ['Ford', 'fiat', 'opel'],
   'year': [2016, 2016, 2017]
})

In[729]: d = {
   'color': 'red',
   'year': 2016
}

In[730]: df.loc[np.all(df[list(d)] == pd.Series(d), axis = 1)]
Out[730]:
   brand color year
0 Ford red 2016

Suggestion : 2

How To Create a Pandas Dataframe from a Dictionary,Here we construct a Pandas dataframe from a dictionary. We use the Pandas constructor, since it can handle different types of data structures.,Here is yet another example of how useful and powerful Pandas is. Pandas can create dataframes from many kinds of data structures—without you having to write lots of lengthy code. One of those data structures is a dictionary.,Pandas also has a Pandas.DataFrame.from_dict() method. If that sounds repetitious, since the regular constructor works with dictionaries, you can see from the example below that the from_dict() method supports parameters unique to dictionaries.

If you are running virtualenv, create a new Python environment and install Pandas like this:

virtualenv py37--python = python3 .7
pip install pandas

You can check the Pandas version with:

import pandas as pd
pd.__version__

So, let’s use the same in the array idx.

import pandas as pd
dict = {
   'scene': ["foul", "murder", "drunken", "intrigue"],
   'facade': ["fair", "beaten", "fat", "elf"]
}
idx = ['hamlet', 'lear', 'falstaff', 'puck']
dp = pd.DataFrame(dict, index = idx)


Now we flip that on its side. We will make the rows the dictionary keys.

pd.DataFrame.from_dict(dict, orient = 'index')

It’s as simple as putting the column names in an array and passing it as the columns parameter. One wonders why the earlier versions of Pandas did not have that.

pd.DataFrame.from_dict(dict, orient = 'index', columns = idx)
hamlet lear falstaff puck
scene foul murder drunken intrigue
facade fair beaten fat elf

Suggestion : 3

1. Remap Column Values with a Dict Using Pandas DataFrame.replace(),We are often required to remap a Pandas DataFrame column values with a dictionary (Dict), you can achieve this by using DataFrame.replace() method. The DataFrame.replace() method takes different parameters and signatures, we will use the one that takes Dictionary(Dict) to remap the column values. As you know Dictionary is a key-value pair where the key is the existing value on the column and value is the literal value you wanted to replace with.,You can use df.replace({"Courses": dict}) to remap/replace values in pandas DataFrame with Dictionary values. It allows you the flexibility to replace the column values with regular expressions for regex substitutions.,9. Complete Example of Dictionary to Remap Columns Values in Pandas DataFrame Other Good Reads References

First, let’s create a Pandas DataFrame.

import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
   'Fee': [22000, 25000, 23000, 24000, 26000],
   'Duration': ['30days', '50days', '30days', None, np.nan],
   'Discount': [1000, 2300, 1000, 1200, 2500]
}
df = pd.DataFrame(technologies)
print(df)
2._
Courses Fee Duration Discount
0 Spark 22000 30 days 1000
1 PySpark 25000 50 days 2300
2 Hadoop 23000 30 days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500

Now we will remap the values of the 'Courses‘ column by their respective codes using the df.replace() function.

# Difine Dict with the key - value pair to remap.
dict = {
   "Spark": 'S',
   "PySpark": 'P',
   "Hadoop": 'H',
   "Python": 'P',
   "Pandas": 'P'
}
df2 = df.replace({
   "Courses": dict
})
print(df2)
5._
df.replace({
   "Courses": dict
}, inplace = True)
print(df)

use df.replace({"Duration": dict_duration},inplace=True) to remap none or NaN values in pandas DataFrame with Dictionary values. To remap None/NaN values of the 'Duration‘ column by their respective codes using the df.replace() function. Read how to replace None/NaN values with empty string in pandas.

#Remap values
for None & nan
df = pd.DataFrame(technologies)
dict_duration = {
   "30days": '30',
   "50days": '50',
   "55days": '55',
   np.nan: '50'
}
df.replace({
   "Duration": dict_duration
}, inplace = True)
print(df)

Suggestion : 4

Preserve order of rows when converting pandas Dataframe to dictionary,Converting pandas dataframe to dictionary with same keys over multiple rows,Convert a Pandas Core Series with Dictionary in Rows into Pandas Dataframe,Convert a Pandas DataFrame to a dictionary

Create Series and convert to dictionary:

print(df.set_index('col1')['col2'].to_dict()) {
   1: 0.5,
   2: 0.75
}

Or use zip with dict:

print(dict(zip(df['col1'], ['col2']))) {
   1: 0.5,
   2: 0.75
}

As simple as zipping the columns together:

result = dict(zip(df.col1, df.col2))
print(result)

Or as an alternative:

result = dict(df[['col1', 'col2']].values)
print(result)

Suggestion : 5

With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.,There is one signficant departure from standard python/numpy slicing semantics. python/numpy allow slicing past the end of an array without an associated error.,Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an Index directly is to pass a list or other sequence to Index:,query() also supports special use of Python’s in and not in comparison operators, providing a succint syntax for calling the isin method of a Series or DataFrame.

In [1]: dates = date_range('1/1/2000', periods=8)

In [2]: df = DataFrame(randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [3]: df
Out[3]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

[8 rows x 4 columns]

In [4]: panel = Panel({'one' : df, 'two' : df - df.mean()})

In [5]: panel
Out[5]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D
In[6]: s = df['A']

In[7]: s[dates[5]]
Out[7]: -0.67368970808837059

In[8]: panel['two']
Out[8]:
   A B C D
2000 - 01 - 01 0.409571 0.113086 - 0.610826 - 0.936507
2000 - 01 - 02 1.152571 0.222735 1.017442 - 0.845111
2000 - 01 - 03 - 0.921390 - 1.708620 0.403304 1.270929
2000 - 01 - 04 0.662014 - 0.310822 - 0.141342 0.470985
2000 - 01 - 05 - 0.484513 0.962970 1.174465 - 0.888276
2000 - 01 - 06 - 0.733231 0.509598 - 0.580194 0.724113
2000 - 01 - 07 0.345164 0.972995 - 0.816769 - 0.840143
2000 - 01 - 08 - 0.430188 - 0.761943 - 0.446079 1.044010

[8 rows x 4 columns]
In[9]: df
Out[9]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885

[8 rows x 4 columns]

In[10]: df[['B', 'A']] = df[['A', 'B']]

In[11]: df
Out[11]:
   A B C D
2000 - 01 - 01 - 0.282863 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 - 0.173215 1.212112 0.119209 - 1.044236
2000 - 01 - 03 - 2.104569 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 - 0.706771 0.721555 - 1.039575 0.271860
2000 - 01 - 05 0.567020 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 0.113648 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 0.577046 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 - 1.157892 - 0.370647 - 1.344312 0.844885

[8 rows x 4 columns]
In[12]: sa = Series([1, 2, 3], index = list('abc'))

In[13]: dfa = df.copy()
In[14]: sa.b
Out[14]: 2

In[15]: dfa.A
Out[15]:
   2000 - 01 - 01 - 0.282863
2000 - 01 - 02 - 0.173215
2000 - 01 - 03 - 2.104569
2000 - 01 - 04 - 0.706771
2000 - 01 - 05 0.567020
2000 - 01 - 06 0.113648
2000 - 01 - 07 0.577046
2000 - 01 - 08 - 1.157892
Freq: D, Name: A, dtype: float64

In[16]: panel.one
Out[16]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885

[8 rows x 4 columns]
In[17]: sa.a = 5

In[18]: sa
Out[18]:
   a 5
b 2
c 3
dtype: int64

In[19]: dfa.A = list(range(len(dfa.index)))

In[20]: dfa
Out[20]:
   A B C D
2000 - 01 - 01 0 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 1 1.212112 0.119209 - 1.044236
2000 - 01 - 03 2 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 3 0.721555 - 1.039575 0.271860
2000 - 01 - 05 4 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 5 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 6 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 7 - 0.370647 - 1.344312 0.844885

[8 rows x 4 columns]

Suggestion : 6

It returns a dataframe with the duplicate rows removed. It drops the duplicates except for the first occurrence by default. You can change this behavior through the parameter keep which takes in 'first', 'last', or False. To modify the dataframe in-place pass the argument inplace=True.,By default, the drop_duplicates() function identifies the duplicates taking all the columns into consideration. It then, drops the duplicate rows and just keeps their first occurrence.,The pandas dataframe drop_duplicates() function can be used to remove duplicate rows from a dataframe. It also gives you the flexibility to identify duplicates based on certain columns through the subset parameter. The following is its syntax:,In the above example, we identify the duplicates based on just the columns Pet and Color by passing them as a list to the drop_duplicates() function. With this criteria, rows with index 1, 2, and 3 are now duplicates with the returned dataframe only retaining the first row.

The pandas dataframe drop_duplicates() function can be used to remove duplicate rows from a dataframe. It also gives you the flexibility to identify duplicates based on certain columns through the subset parameter. The following is its syntax:

df.drop_duplicates()

By default, the drop_duplicates() function identifies the duplicates taking all the columns into consideration. It then, drops the duplicate rows and just keeps their first occurrence.

import pandas as pd

# create a sample dataframe with duplicate rows
data = {
   'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'],
   'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'],
   'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green']
}

df = pd.DataFrame(data)

# print the dataframe
print("The original dataframe:\n")
print(df)

# drop duplicates
df_unique = df.drop_duplicates()
print("\nAfter dropping duplicates:\n")
print(df_unique)

Output:

The original dataframe:

   Pet Color Eyes
0 Cat Brown Black
1 Dog Golden Black
2 Dog Golden Black
3 Dog Golden Brown
4 Cat Black Green

After dropping duplicates:

   Pet Color Eyes
0 Cat Brown Black
1 Dog Golden Black
3 Dog Golden Brown
4 Cat Black Green

If you want to retain the last duplicate row instead of the first one pass keep='last' to the drop_duplicates() function.

import pandas as pd

# create a sample dataframe with duplicate rows
data = {
   'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'],
   'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'],
   'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green']
}

df = pd.DataFrame(data)

# print the dataframe
print("The original dataframe:\n")
print(df)

# drop duplicates
df_unique = df.drop_duplicates(keep = 'last')
print("\nAfter dropping duplicates:\n")
print(df_unique)

If you do not want to retain any of the duplicate rows pass keep=False to the drop_duplicates() function.

import pandas as pd

# create a sample dataframe with duplicate rows
data = {
   'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'],
   'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'],
   'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green']
}

df = pd.DataFrame(data)

# print the dataframe
print("The original dataframe:\n")
print(df)

# drop duplicates
df_unique = df.drop_duplicates(keep = False)
print("\nAfter dropping duplicates:\n")
print(df_unique)

Suggestion : 7

Think about this as listing the row and column selections one after another. Putting together a column selection and a row selection:,Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.,Selecting columns will be important to much of the analysis you do throughout the tutorials, especially in grouping and counting events.,You should also assign the DataFrame as a variable. Since you'll only be working with one DataFrame in this lesson, you can keep it simple and just call it data:

In Mode Python Notebooks, the first cell is automatically populated with the following code to access the data produced by the SQL query:

datasets[0].head(n = 5)
2._
import pandas as pd
3._
data = datasets[0] # assign SQL query results to the data variable
5._
data['url']
6._
0 https: //watsi.org/
   1 https: //watsi.org/team/the-meteor-chef
   2 https: //watsi.org/gift-cards
   3 https: //watsi.org/
   4 https: //watsi.org/
   Name: url, dtype: object
import pandas as pd
data = datasets[0] # assign SQL query results to the data variable
data = data.fillna('') # replace missing values with strings
for easier text processing
data['url']
0 https: //watsi.org/
   1 https: //watsi.org/team/the-meteor-chef
   2 https: //watsi.org/gift-cards
   3 https: //watsi.org/
   4 https: //watsi.org/
   Name: url, dtype: object