df.loc filtering doesn't work with none values

  • Last Update :
  • Techknowledgy :

You have to use isnull for this:

In[3]:

   df[df['Project ID'].isnull()]
Out[3]:
   Project ID State Cost
1 None 3 1000

Or use apply:

In[5]:

   df.loc[df['Project ID'].apply(lambda x: x is None)]
Out[5]:
   Project ID State Cost
1 None 3 1000

Just to elaborate, it doesn't work because pandas use np.nan, and:

print np.nan == np.nan # False
print np.nan == None # False
print np.isnan(np.nan) # True

Suggestion : 2

May 31, 2020April 23, 2022,This filters down to only show May 2020 data.

Let’s begin by loading a sample dataframe that we’ll use throughout the tutorial.

import pandas as pd
df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx', parse_dates = ['Date'])
print(df.head())

This returns:

        Date Region Type Units Sales
        0 2020 - 07 - 11 East Children 's Clothing   18.0    306
        1 2020 - 09 - 23 North Children 's Clothing   14.0    448
        2 2020 - 04 - 02 South Women 's Clothing   17.0    425
        3 2020 - 02 - 28 East Children 's Clothing   26.0    832
        4 2020 - 03 - 19 West Women 's Clothing    3.0     33

For example, if you wanted to select rows where sales were over 300, you could write:

greater_than = df[df['Sales'] > 300]
print(greater_than.head())
print(greater_than.shape)

If you want to filter on a specific date (or before/after a specific date), simply include that in your filter query like above:

# To filter dates following a certain date:
   date_filter = df[df['Date'] > '2020-05-01']

# To filter to a specific date:
   date_filter2 = df[df['Date'] == '2020-05-01']

The first piece of code shows any rows where Date is later than May 1, 2020. You can also use multiple filters to filter between two dates:

date_filter3 = df[(df['Date'] >= '2020-05-01') & (df['Date'] < '2020-06-01')]

Suggestion : 3

Marketing

 num_df.loc[num_df['a'] == 2]

Suggestion : 4

It is because loc does not produce output based on index position. It considers labels of index only which can be alphabet as well and includes both starting and end point. Refer the example below.,what about cases where you need to filter rows by two or more columns that exist in another df?you can't use lists... you need that the pairs or triplets will match.easy to do in a for loop but is there a way to implement in vectorization way not with join/merge?,Fetch information of employees who spent more than 3 years in the organization and received highest rating in the past 2 years,Something to note how x.loc[0:5] is inclusive of 5 i.e. the sixth element.Very well articulated. I loved reading this article.

We are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 336776 rows and 16 columns. See column names below. To import dataset, we are using read_csv( ) function from pandas package.

['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time',
   'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest',
   'air_time', 'distance', 'hour', 'minute'
]
['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time',
       'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest',
       'air_time', 'distance', 'hour', 'minute']
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", usecols = range(1, 17))
3._
newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]
5._
newdf = df.query('origin == "JFK" & carrier == "B6"')
location
newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]
iloc - Index Position
x.iloc[0: 5]

Output
col1
9 1
8 3
7 5
6 7
0 9
loc - Index Label
x.loc[0: 5]

Output
col1
0 9
1 11
2 13
3 15
4 17
5 19

It is because loc does not produce output based on index position. It considers labels of index only which can be alphabet as well and includes both starting and end point. Refer the example below.

x = pd.DataFrame({
   "col1": range(1, 5)
}, index = ['a', 'b', 'c', 'd'])
x.loc['a': 'c'] # equivalent to x.iloc[0: 3]

col1
a 1
b 2
c 3

Suggestion : 5

Filter out NAN rows (Data selection) by using DataFrame.dropna() method. The dropna() function is also possible to drop rows with NaN values df.dropna(thresh=2)it will drop all rows where there are at least two non- NaN .,3. Filter out NAN Rows Using DataFrame.dropna(),In this article, You have learned how to filter nan rows from pandas DataFrame by using DataFrame.dropna(), DataFrame.notnull() methods. Also learned how to filter rows only when all values are NaN/None, only when selected columns have NaN values, and using inplace parameter.,In this section, let’s see how to drop rows only when selected columns have NaN/None values in DataFrame, you can achieve this by using subset parameter. The subset parameter is simply selecting particular rows and columns of data from a DataFrame (or Series).

1._
# Below are some Quick examples.

# Using DataFrame.dropna() method drop all rows that have NAN / none.
df2 = df.dropna()

# Filter out NAN data selection column by DataFrame.dropna().
df2 = df.dropna(thresh = 2)

# Pandas find columns with nan to update.
df2 = df[df.Duration.notnull()]

# Drop rows that has all NaN values.
df2 = df.dropna(how = 'all')

# Using reset_index() Method.
df2 = df.dropna().reset_index(drop = True)

# Two columns by using subset parameter.
df2 = df.dropna(subset = ['Courses', 'Fee'])

# Filter NAN Data selection column of strings by not operator.
df2 = df[~pd.isnull(df['Courses'])]

Now, let’s create a Pandas DataFrame with a few rows and columns and execute some examples to learn how to drop rows with NAN values. Our DataFrame contains column names CoursesFeeDuration, and Discount.

# Create a pandas DataFrame.
import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Spark", "Python", "PySpark", "Java"],
   'Fee': [22000, 25000, np.nan, np.nan, np.nan, np.nan],
   'Duration': ['30days', np.nan, '30days', 'N/A', np.nan, np.nan]
}
df = pd.DataFrame(technologies)
print(df)

Yields below output.

Courses Fee Duration
0 Spark 22000.0 30 days
1 PySpark 25000.0 NaN
2 Spark NaN 30 days
3 Python NaN N / A
4 PySpark NaN NaN
5 Java NaN NaN

Filter out NAN rows (Data selection) by using DataFrame.dropna() method. The dropna() function is also possible to drop rows with NaN values df.dropna(thresh=2)it will drop all rows where there are at least two non- NaN .

# Filter out NAN data selection column by DataFrame.dropna().
df2 = df.dropna(thresh = 2)
print(df2)
6._
Courses Fee Duration
0 Spark 22000.0 30 days
1 PySpark 25000.0 NaN
2 Spark NaN 30 days
3 Python NaN N / A

Suggestion : 6

So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.,Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:,An exception on this basic propagation rule are reductions (such as the mean or the minimum), where pandas defaults to skipping missing values. See above for more.,The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.

In[1]: df = pd.DataFrame(
      ...: np.random.randn(5, 3),
      ...: index = ["a", "c", "e", "f", "h"],
      ...: columns = ["one", "two", "three"],
      ...: )
   ...:

   In[2]: df["four"] = "bar"

In[3]: df["five"] = df["one"] > 0

In[4]: df
Out[4]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
c - 1.135632 1.212112 - 0.173215 bar False
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
h 0.721555 - 0.706771 - 1.039575 bar True

In[5]: df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

In[6]: df2
Out[6]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
b NaN NaN NaN NaN NaN
c - 1.135632 1.212112 - 0.173215 bar False
d NaN NaN NaN NaN NaN
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
g NaN NaN NaN NaN NaN
h 0.721555 - 0.706771 - 1.039575 bar True
In[7]: df2["one"]
Out[7]:
   a 0.469112
b NaN
c - 1.135632
d NaN
e 0.119209
f - 2.104569
g NaN
h 0.721555
Name: one, dtype: float64

In[8]: pd.isna(df2["one"])
Out[8]:
   a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool

In[9]: df2["four"].notna()
Out[9]:
   a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool

In[10]: df2.isna()
Out[10]:
   one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
In[11]: None == None # noqa: E711
Out[11]: True

In[12]: np.nan == np.nan
Out[12]: False
In[13]: df2["one"] == np.nan
Out[13]:
   a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
   3 4
   dtype: Int64
In[15]: df2 = df.copy()

In[16]: df2["timestamp"] = pd.Timestamp("20120101")

In[17]: df2
Out[17]:
   one two three four five timestamp
a 0.469112 - 0.282863 - 1.509059 bar True 2012 - 01 - 01
c - 1.135632 1.212112 - 0.173215 bar False 2012 - 01 - 01
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h 0.721555 - 0.706771 - 1.039575 bar True 2012 - 01 - 01

In[18]: df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan

In[19]: df2
Out[19]:
   one two three four five timestamp
a NaN - 0.282863 - 1.509059 bar True NaT
c NaN 1.212112 - 0.173215 bar False NaT
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h NaN - 0.706771 - 1.039575 bar True NaT

In[20]: df2.dtypes.value_counts()
Out[20]:
   float64 3
object 1
bool 1
datetime64[ns] 1
dtype: int64