pandas timedelta mean returns error "no numeric types to aggregate". why?

  • Last Update :
  • Techknowledgy :

From reading the discussion of this issue on Github here, you can solve this issue by specifying numeric_only=False for mean calculation as follows

pd.concat([A, B], axis = 1).groupby("status_reason")["closing_time"]\
   .mean(numeric_only = False)

The problem might be In Progress only have NaT time, which might not allowed in groupby().mean(). Here's the test:

df = pd.DataFrame({
   'closing_time': ['11:35:00', '07:13:00', np.nan, np.nan, np.nan],
   'status_reason': ['Won', 'Canceled', 'In Progress', 'In Progress', 'In Progress']
})
df.closing_time = pd.to_timedelta(df.closing_time)
df.groupby('status_reason').closing_time.mean()

gives the exact error. To overcome this, do:

def custom_mean(x):
   try:
   return x.mean()
except:
   return pd.to_timedelta([np.nan])

df.groupby('status_reason').closing_time.apply(custom_mean)

which gives:

status_reason
Canceled 07: 13: 00
In Progress NaT
Won 11: 35: 00
Name: closing_time, dtype: timedelta64[ns]

I cannot say why groupby's mean() method does not work, but the following slight modification of your code should work: First, convert timedelta column to seconds with total_seconds() method, then groupby and mean, then convert seconds to timedelta again:

pd.to_timedelta(pd.concat([A.dt.total_seconds(), B], axis = 1).groupby("status_reason")["closing_time"].mean(), unit = "s")

For example dataframe below, the code -

df = pd.DataFrame({
   'closing_time': ['2 days 11:35:00', '07:13:00', np.nan, np.nan, np.nan],
   'status_reason': ['Won', 'Canceled', 'In Progress', 'In Progress', 'In Progress']
})

df.loc[: , "closing_time"] = \
   pd.to_timedelta(df.closing_time).dt.days * 24 * 3600\ +
   pd.to_timedelta(df.closing_time).dt.seconds

# or alternatively use total_seconds() to get total seconds in timedelta as follows
# df.loc[: , "closing_time"] = pd.to_timedelta(df.closing_time).dt.total_seconds()

pd.to_timedelta(df.groupby("status_reason")["closing_time"].mean(), unit = "s")

produces

status_reason
Canceled 0 days 07: 13: 00
In Progress NaT
Won 2 days 11: 35: 00
Name: closing_time, dtype: timedelta64[ns]

To overcome this situation, the first thing you have to do is deciding how to handle NaN values. The best approach depends on what we want to achieve. In my case, it's fine to have even a simple categorical result, so I can do something like this:

import datetime

def define_time(row):
   if pd.isnull(row["closing_time"]):
   return "Null"
elif row["closing_time"] < datetime.timedelta(days = 100):
   return "<100"
elif row["closing_time"] > datetime.timedelta(days = 100):
   return ">100"

time_results = pd.concat([A, B], axis = 1).apply(lambda row: define_time(row), axis = 1)

In the end the result is like this:

In:
   time_results.value_counts()
Out:
   >
   100 1452 <
   100 1091
Null 1000
dtype: int64

Suggestion : 2

Pandas Timedelta mean returns error "No numeric types to aggregate". Why?,How can i pivot a dataframe in pandas where values are date? I get DataError: No numeric types to aggregate error,pandas.core.base.DataError: No numeric types to aggregate when trying to get the mean with groupby,`No numeric types to aggregate` error with rolling sum and timedelta type

To overcome this situation, the first thing you have to do is deciding how to handle NaN values. The best approach depends on what we want to achieve. In my case, it's fine to have even a simple categorical result, so I can do something like this:

import datetime

def define_time(row):
   if pd.isnull(row["closing_time"]):
   return "Null"
elif row["closing_time"] < datetime.timedelta(days = 100):
   return "<100"
elif row["closing_time"] > datetime.timedelta(days = 100):
   return ">100"

time_results = pd.concat([A, B], axis = 1).apply(lambda row: define_time(row), axis = 1)

In the end the result is like this:

In:
   time_results.value_counts()
Out:
   >
   100 1452 <
   100 1091
Null 1000
dtype: int64

The problem might be In Progress only have NaT time, which might not allowed in groupby().mean(). Here's the test:

df = pd.DataFrame({
   'closing_time': ['11:35:00', '07:13:00', np.nan, np.nan, np.nan],
   'status_reason': ['Won', 'Canceled', 'In Progress', 'In Progress', 'In Progress']
})
df.closing_time = pd.to_timedelta(df.closing_time)
df.groupby('status_reason').closing_time.mean()

gives the exact error. To overcome this, do:

def custom_mean(x):
   try:
   return x.mean()
except:
   return pd.to_timedelta([np.nan])

df.groupby('status_reason').closing_time.apply(custom_mean)

which gives:

status_reason
Canceled 07: 13: 00
In Progress NaT
Won 11: 35: 00
Name: closing_time, dtype: timedelta64[ns]

I cannot say why groupby's mean() method does not work, but the following slight modification of your code should work: First, convert timedelta column to seconds with total_seconds() method, then groupby and mean, then convert seconds to timedelta again:

pd.to_timedelta(pd.concat([A.dt.total_seconds(), B], axis = 1).groupby("status_reason")["closing_time"].mean(), unit = "s")

For example dataframe below, the code -

df = pd.DataFrame({
   'closing_time': ['2 days 11:35:00', '07:13:00', np.nan, np.nan, np.nan],
   'status_reason': ['Won', 'Canceled', 'In Progress', 'In Progress', 'In Progress']
})

df.loc[: , "closing_time"] = \
   pd.to_timedelta(df.closing_time).dt.days * 24 * 3600\ +
   pd.to_timedelta(df.closing_time).dt.seconds

# or alternatively use total_seconds() to get total seconds in timedelta as follows
# df.loc[: , "closing_time"] = pd.to_timedelta(df.closing_time).dt.total_seconds()

pd.to_timedelta(df.groupby("status_reason")["closing_time"].mean(), unit = "s")

produces

status_reason
Canceled 0 days 07: 13: 00
In Progress NaT
Won 2 days 11: 35: 00
Name: closing_time, dtype: timedelta64[ns]

From reading the discussion of this issue on Github here, you can solve this issue by specifying numeric_only=False for mean calculation as follows

pd.concat([A, B], axis = 1).groupby("status_reason")["closing_time"]\
   .mean(numeric_only = False)

Suggestion : 3

Jun 4, 2017 183K views,Jun 7, 2020 27K views,Jun 21, 2020 3.1K views


import pandas as pd df1 = pd.DataFrame({
   'index': range(8),
   'variable1': ["A", "A", "B", "B", "A", "B", "B", "A"],
   'variable2': ["a", "b", "a", "b", "a", "b", "a", "b"],
   'variable3': ["x", "x", "x", "y", "y", "y", "x", "y"],
   'result': ["on", "off", "off", "on", "on", "off", "off", "on"]
}) df1.pivot_table(values = 'result', rows = 'index', cols = ['variable1', 'variable2', 'variable3'])

import pandas as pd df1 = pd.DataFrame({
   'index': range(8),
   'variable1': ["A", "A", "B", "B", "A", "B", "B", "A"],
   'variable2': ["a", "b", "a", "b", "a", "b", "a", "b"],
   'variable3': ["x", "x", "x", "y", "y", "y", "x", "y"],
   'result': ["on", "off", "off", "on", "on", "off", "off", "on"]
}) # these are the columns to end up in the multi - index columns.unstack_cols = ['variable1', 'variable2', 'variable3']