sort bins from pandas cut

  • Last Update :
  • Techknowledgy :

There is main problem losing ordered CategoricalIndex.

np.random.seed(12456)
y = pd.Series(np.random.randn(100))
x1 = pd.Series(np.sign(np.random.randn(100)))
x2 = pd.cut(pd.Series(np.random.randn(100)), bins = [-3, -0.5, 0, 0.5, 3])

model = pd.concat([y, x1, x2], axis = 1, keys = ['Y', 'X1', 'X2'])
int_output = model.groupby(['X1', 'X2']).mean().unstack()
int_output.columns = int_output.columns.get_level_values(1)

print(int_output)
X2(-3, -0.5](-0.5, 0](0, 0.5](0.5, 3]
X1
   -
   1.0 0.230060 - 0.079266 - 0.079834 - 0.064455
1.0 - 0.451351 0.268688 0.020091 - 0.280218

print(int_output.columns)
CategoricalIndex(['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'],
   categories = ['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'],
   ordered = True, name = 'X2', dtype = 'category')

output = pd.concat(int_output.to_dict('series'), axis = 1)
print(output)
   (-0.5, 0](-3, -0.5](0, 0.5](0.5, 3]
X1
   -
   1.0 - 0.079266 0.230060 - 0.079834 - 0.064455
1.0 0.268688 - 0.451351 0.020091 - 0.280218

print(output.columns)
Index(['(-0.5, 0]', '(-3, -0.5]', '(0, 0.5]', '(0.5, 3]'], dtype = 'object')

One possible solution is extract first number from output.columns, create helper Series and sort it. Last reindex original columns:

cat = output.columns.str.extract('\((.*),', expand = False).astype(float)
a = pd.Series(cat, index = output.columns).sort_values()
print(a)
   (-3, -0.5] - 3.0(-0.5, 0] - 0.5(0, 0.5] 0.0(0.5, 3] 0.5
dtype: float64

output = output.reindex(columns = a.index)
print(output)
   (-3, -0.5](-0.5, 0](0, 0.5](0.5, 3]
X1
   -
   1.0 0.230060 - 0.079266 - 0.079834 - 0.064455
1.0 - 0.451351 0.268688 0.020091 - 0.280218

An easy fix to the problem you've highlighted above is to simply reorder the columns:

output[sorted(output.columns)]

I made a function to do so.

def dfsortbybins(df, col):
   ""
"
param df: pandas dataframe
param col: name of column containing bins ""
"
d = dict(zip(bins, [float(s.split(',')[0].split('(')[1]) for s in bins]))
df[f '{col} dfrankbybins'] = df.apply(lambda x: d[x[col]]
   if not pd.isnull(x[col])
   else x[col], axis = 1)
df = df.sort_values(f '{col} dfrankbybins').drop(f '{col} dfrankbybins', axis = 1)
return df

Here's another function. This worked for me, in multiple cases, unlike the other solutions. Figured I'd leave it here in a hope it will come handy for some people who come across the same issue in the future.

def sort_bins(bin_col):
   ""
"
Sorts bins after using pd.cut.Increasing order.Puts "NaN"
bin at the beginning.

Input:
   bin_col: pd.series containing bins to be sorted

""
"

# Dictionary to store first value from each bin
vals = {}

# Iterate through all bins
for i, item in enumerate(bin_col.unique()):

   # Check
if bin is "nan",
   if yes, assign low value to put it at the beginning
if item == "nan":
   vals[i] = -99999

# If not "nan", get the first value from bin to sort later
else:
   vals[i] = float(item.split(",")[0][1: ])

# Sort bins according to extracted first values
ixs = list({
   k: v
   for k,
   v in \
   sorted(vals.items(), key = lambda item: item[1])
}.keys())

# Make sorted list of bins
sorted_bins = bin_col.unique()[list(ixs)]

return sorted_bins

# Example, assuming "age_bin"
column has the bins:
   sorted_bins = sort_bins(df["age_bin"])

Suggestion : 2

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.,Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between two bins.,Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based on bins.,ordered=False will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
   ...[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
      Categories(3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0]...
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins = True)
   ...
   ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
         Categories(3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0]...
            array([0.994, 3., 5., 7.]))
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
   ...3, labels = ["bad", "medium", "good"])['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories(3, object): ['bad' < 'medium' < 'good']
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
   ...labels = ["B", "A", "B"], ordered = False)['B', 'B', 'A', 'A', 'B', 'B']
Categories(2, object): ['A', 'B']
>>> pd.cut([0, 1, 1, 2], bins = 4, labels = False)
array([0, 1, 1, 3])
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
      ...index = ['a', 'b', 'c', 'd', 'e']) >>>
   pd.cut(s, 3)
   ...
   a(1.992, 4.667]
b(1.992, 4.667]
c(4.667, 7.333]
d(7.333, 10.0]
e(7.333, 10.0]
dtype: category
Categories(3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...

Suggestion : 3

Python Python Home date Samples String List Math Functions Built in Functions File Handling Error Handling Class Object Tkinter Numpy Pandas Python & MySQL SQLite

Examples using options

import pandas as pd
my_dict = {
   'NAME': ['Ravi', 'Raju', 'Alex', 'Ron', 'King', 'Jack'],
   'ID': [1, 2, 3, 4, 5, 6],
   'MATH': [80, 40, 70, 70, 60, 30],
   'ENGLISH': [80, 70, 40, 50, 60, 30]
}
my_data = pd.DataFrame(data = my_dict)
my_data['my_cut'] = pd.cut(x = my_data['MATH'], bins = [1, 50, 70, 100])
print(my_data)
   NAME ID MATH ENGLISH my_cut
   0 Ravi 1 80 80(70, 100]
   1 Raju 2 40 70(1, 50]
   2 Alex 3 70 40(50, 70]
   3 Ron 4 70 50(50, 70]
   4 King 5 60 60(50, 70]
   5 Jack 6 30 30(1, 50]
print(my_data['my_cut'].dtypes) # category
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5) 
print(my_data)
   NAME ID MATH ENGLISH my_cut
   0 Ravi 1 80 80(70.0, 80.0]
   1 Raju 2 40 70(29.95, 40.0]
   2 Alex 3 70 40(60.0, 70.0]
   3 Ron 4 70 50(60.0, 70.0]
   4 King 5 60 60(50.0, 60.0]
   5 Jack 6 30 30(29.95, 40.0]
Sequence of scalars
my_data['my_cut'] = pd.cut(x = my_data['MATH'], bins = [1, 50, 70, 100])
print(my_data)

Suggestion : 4

Since the .qcut() function doesn’t allow you to specify including the lowest value of the range, the cut() function needs to be used.,The Pandas cut function is closely related to the .qcut() function. However, it’s used to bin values into discrete intervals, which you define yourself. This, for example, can be very helpful when defining meaningful age groups or income groups. In many cases, these groupings will have some other type of meaning, such as legal or cultural. ,How to use the cut and qcut functions in Pandas,The Pandas qcut function bins data into an equal distributon of items

To follow along with the tutorial, let’s use a very simple Pandas DataFrame. The data is deliberately kept simple to better understand how the data is being split. The dataset has only two columns: a Name column and an Age column. Let’s load the data using the .from_dict() method:

# Loading a Sample Pandas DataFrame
import pandas as pd

df = pd.DataFrame.from_dict({
   'Name': ['Ray', 'Jane', 'Kate', 'Nik', 'Autumn', 'Kasi', 'Mandeep', 'Evan', 'Kyra', 'Jim'],
   'Age': [12, 7, 33, 34, 45, 65, 77, 11, 32, 55]
})

print(df.head())

# Returns:
   # Name Age
# 0 Ray 12
# 1 Jane 7
# 2 Kate 33
# 3 Nik 34
# 4 Autumn 45

The Pandas .qcut() method splits your data into equal-sized buckets, based on rank or some sample quantiles. This process is known as quantile-based discretization. Let’s take a look at the parameters available in the function:

# Parameters of the Pandas.qcut() method
pd.qcut(
   x, # Column to bin q, # Number of quantiles labels = None, # List of labels to include retbins = False, # Whether to
   return the bins / labels or not
   precision = 3, # The precision to store and display the bins labels duplicates = 'raise'
   # If bin edges are not unique, raise a ValueError
)

The function only has two required parameters, the column to bin (x=) and the number of quantiles to generate (q=). The function returns a Series of data that can, for example, be assigned to a new column. Let’s see how we can split our Age column into four different quantiles:

# Splitting Age Column into Four Quantiles
df['Age Groups'] = pd.qcut(df['Age'], 4)
print(df.head())

# Returns:
   # Name Age Age Groups
# 0 Ray 12(6.999, 17.0]
# 1 Jane 7(6.999, 17.0]
# 2 Kate 33(17.0, 33.5]
# 3 Nik 34(33.5, 52.5]
# 4 Autumn 45(33.5, 52.5]

Rather than simply passing in a number of groupings you want to create, you can also pass in a list of quartiles you want to create. This list should be a range from 0 through 1, splitting the data into equal percentages. Let’ see how we can split our data into 25% bins.

# Splitting Age Column into Four Quantiles
df['Age Groups'] = pd.qcut(
   df['Age'],
   [0, 0.25, 0.5, 0.75, 1]
)
print(df.head())

# Returns:
   # Name Age Age Groups
# 0 Ray 12(6.999, 17.0]
# 1 Jane 7(6.999, 17.0]
# 2 Kate 33(17.0, 33.5]
# 3 Nik 34(33.5, 52.5]
# 4 Autumn 45(33.5, 52.5]

Right now, the bins of our dataset are descriptive, but they’re also a little hard to read. You can pass in a list of labels that you want to relabel your dataset as. The length of the list should match the number of bins being created. Let’s see how we can convert our grouped data into descriptive labels:

# Adding Labels to Pandas.qcut()
df['Age Groups'] = pd.qcut(
   df['Age'],
   [0, 0.25, 0.5, 0.75, 1],
   labels = ['0-25%', '26-49%', '51-75%', '76-100%']
)
print(df.head())

# Returns:
   # Name Age Age Groups
# 0 Ray 12 0 - 25 %
   # 1 Jane 7 0 - 25 %
   # 2 Kate 33 26 - 49 %
   # 3 Nik 34 51 - 75 %
   # 4 Autumn 45 51 - 75 %

Since the .qcut() function doesn’t allow you to specify including the lowest value of the range, the cut() function needs to be used.

df['Age Group'] = pd.cut(
   df['Age'],
   [0, 0.25, 0.5, 0.75, 1],
   include_lowest = True,
   right = False
)