There is main problem losing ordered
CategoricalIndex
.
np.random.seed(12456)
y = pd.Series(np.random.randn(100))
x1 = pd.Series(np.sign(np.random.randn(100)))
x2 = pd.cut(pd.Series(np.random.randn(100)), bins = [-3, -0.5, 0, 0.5, 3])
model = pd.concat([y, x1, x2], axis = 1, keys = ['Y', 'X1', 'X2'])
int_output = model.groupby(['X1', 'X2']).mean().unstack()
int_output.columns = int_output.columns.get_level_values(1)
print(int_output)
X2(-3, -0.5](-0.5, 0](0, 0.5](0.5, 3]
X1
-
1.0 0.230060 - 0.079266 - 0.079834 - 0.064455
1.0 - 0.451351 0.268688 0.020091 - 0.280218
print(int_output.columns)
CategoricalIndex(['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'],
categories = ['(-3, -0.5]', '(-0.5, 0]', '(0, 0.5]', '(0.5, 3]'],
ordered = True, name = 'X2', dtype = 'category')
output = pd.concat(int_output.to_dict('series'), axis = 1)
print(output)
(-0.5, 0](-3, -0.5](0, 0.5](0.5, 3]
X1
-
1.0 - 0.079266 0.230060 - 0.079834 - 0.064455
1.0 0.268688 - 0.451351 0.020091 - 0.280218
print(output.columns)
Index(['(-0.5, 0]', '(-3, -0.5]', '(0, 0.5]', '(0.5, 3]'], dtype = 'object')
One possible solution is extract
first number from output.columns
, create helper Series and sort it. Last reindex
original columns:
cat = output.columns.str.extract('\((.*),', expand = False).astype(float)
a = pd.Series(cat, index = output.columns).sort_values()
print(a)
(-3, -0.5] - 3.0(-0.5, 0] - 0.5(0, 0.5] 0.0(0.5, 3] 0.5
dtype: float64
output = output.reindex(columns = a.index)
print(output)
(-3, -0.5](-0.5, 0](0, 0.5](0.5, 3]
X1
-
1.0 0.230060 - 0.079266 - 0.079834 - 0.064455
1.0 - 0.451351 0.268688 0.020091 - 0.280218
An easy fix to the problem you've highlighted above is to simply reorder the columns:
output[sorted(output.columns)]
I made a function to do so.
def dfsortbybins(df, col): "" " param df: pandas dataframe param col: name of column containing bins "" " d = dict(zip(bins, [float(s.split(',')[0].split('(')[1]) for s in bins])) df[f '{col} dfrankbybins'] = df.apply(lambda x: d[x[col]] if not pd.isnull(x[col]) else x[col], axis = 1) df = df.sort_values(f '{col} dfrankbybins').drop(f '{col} dfrankbybins', axis = 1) return df
Here's another function. This worked for me, in multiple cases, unlike the other solutions. Figured I'd leave it here in a hope it will come handy for some people who come across the same issue in the future.
def sort_bins(bin_col): "" " Sorts bins after using pd.cut.Increasing order.Puts "NaN" bin at the beginning. Input: bin_col: pd.series containing bins to be sorted "" " # Dictionary to store first value from each bin vals = {} # Iterate through all bins for i, item in enumerate(bin_col.unique()): # Check if bin is "nan", if yes, assign low value to put it at the beginning if item == "nan": vals[i] = -99999 # If not "nan", get the first value from bin to sort later else: vals[i] = float(item.split(",")[0][1: ]) # Sort bins according to extracted first values ixs = list({ k: v for k, v in \ sorted(vals.items(), key = lambda item: item[1]) }.keys()) # Make sorted list of bins sorted_bins = bin_col.unique()[list(ixs)] return sorted_bins # Example, assuming "age_bin" column has the bins: sorted_bins = sort_bins(df["age_bin"])
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.,Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between two bins.,Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based on bins.,ordered=False will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
...[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories(3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0]...
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins = True) ... ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories(3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0]... array([0.994, 3., 5., 7.]))
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
...3, labels = ["bad", "medium", "good"])['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories(3, object): ['bad' < 'medium' < 'good']
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
...labels = ["B", "A", "B"], ordered = False)['B', 'B', 'A', 'A', 'B', 'B']
Categories(2, object): ['A', 'B']
>>> pd.cut([0, 1, 1, 2], bins = 4, labels = False) array([0, 1, 1, 3])
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
...index = ['a', 'b', 'c', 'd', 'e']) >>>
pd.cut(s, 3)
...
a(1.992, 4.667]
b(1.992, 4.667]
c(4.667, 7.333]
d(7.333, 10.0]
e(7.333, 10.0]
dtype: category
Categories(3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...
Python Python Home date Samples String List Math Functions Built in Functions File Handling Error Handling Class Object Tkinter Numpy Pandas Python & MySQL SQLite
Examples using options
import pandas as pd
my_dict = {
'NAME': ['Ravi', 'Raju', 'Alex', 'Ron', 'King', 'Jack'],
'ID': [1, 2, 3, 4, 5, 6],
'MATH': [80, 40, 70, 70, 60, 30],
'ENGLISH': [80, 70, 40, 50, 60, 30]
}
my_data = pd.DataFrame(data = my_dict)
my_data['my_cut'] = pd.cut(x = my_data['MATH'], bins = [1, 50, 70, 100])
print(my_data)
NAME ID MATH ENGLISH my_cut 0 Ravi 1 80 80(70, 100] 1 Raju 2 40 70(1, 50] 2 Alex 3 70 40(50, 70] 3 Ron 4 70 50(50, 70] 4 King 5 60 60(50, 70] 5 Jack 6 30 30(1, 50]
print(my_data['my_cut'].dtypes) # category
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5)
print(my_data)
NAME ID MATH ENGLISH my_cut 0 Ravi 1 80 80(70.0, 80.0] 1 Raju 2 40 70(29.95, 40.0] 2 Alex 3 70 40(60.0, 70.0] 3 Ron 4 70 50(60.0, 70.0] 4 King 5 60 60(50.0, 60.0] 5 Jack 6 30 30(29.95, 40.0]
my_data['my_cut'] = pd.cut(x = my_data['MATH'], bins = [1, 50, 70, 100])
print(my_data)
Since the .qcut() function doesn’t allow you to specify including the lowest value of the range, the cut() function needs to be used.,The Pandas cut function is closely related to the .qcut() function. However, it’s used to bin values into discrete intervals, which you define yourself. This, for example, can be very helpful when defining meaningful age groups or income groups. In many cases, these groupings will have some other type of meaning, such as legal or cultural. ,How to use the cut and qcut functions in Pandas,The Pandas qcut function bins data into an equal distributon of items
To follow along with the tutorial, let’s use a very simple Pandas DataFrame. The data is deliberately kept simple to better understand how the data is being split. The dataset has only two columns: a Name column and an Age column. Let’s load the data using the .from_dict()
method:
# Loading a Sample Pandas DataFrame import pandas as pd df = pd.DataFrame.from_dict({ 'Name': ['Ray', 'Jane', 'Kate', 'Nik', 'Autumn', 'Kasi', 'Mandeep', 'Evan', 'Kyra', 'Jim'], 'Age': [12, 7, 33, 34, 45, 65, 77, 11, 32, 55] }) print(df.head()) # Returns: # Name Age # 0 Ray 12 # 1 Jane 7 # 2 Kate 33 # 3 Nik 34 # 4 Autumn 45
The Pandas .qcut()
method splits your data into equal-sized buckets, based on rank or some sample quantiles. This process is known as quantile-based discretization. Let’s take a look at the parameters available in the function:
# Parameters of the Pandas.qcut() method pd.qcut( x, # Column to bin q, # Number of quantiles labels = None, # List of labels to include retbins = False, # Whether to return the bins / labels or not precision = 3, # The precision to store and display the bins labels duplicates = 'raise' # If bin edges are not unique, raise a ValueError )
The function only has two required parameters, the column to bin (x=
) and the number of quantiles to generate (q=
). The function returns a Series of data that can, for example, be assigned to a new column. Let’s see how we can split our Age
column into four different quantiles:
# Splitting Age Column into Four Quantiles df['Age Groups'] = pd.qcut(df['Age'], 4) print(df.head()) # Returns: # Name Age Age Groups # 0 Ray 12(6.999, 17.0] # 1 Jane 7(6.999, 17.0] # 2 Kate 33(17.0, 33.5] # 3 Nik 34(33.5, 52.5] # 4 Autumn 45(33.5, 52.5]
Rather than simply passing in a number of groupings you want to create, you can also pass in a list of quartiles you want to create. This list should be a range from 0 through 1, splitting the data into equal percentages. Let’ see how we can split our data into 25% bins.
# Splitting Age Column into Four Quantiles df['Age Groups'] = pd.qcut( df['Age'], [0, 0.25, 0.5, 0.75, 1] ) print(df.head()) # Returns: # Name Age Age Groups # 0 Ray 12(6.999, 17.0] # 1 Jane 7(6.999, 17.0] # 2 Kate 33(17.0, 33.5] # 3 Nik 34(33.5, 52.5] # 4 Autumn 45(33.5, 52.5]
Right now, the bins of our dataset are descriptive, but they’re also a little hard to read. You can pass in a list of labels that you want to relabel your dataset as. The length of the list should match the number of bins being created. Let’s see how we can convert our grouped data into descriptive labels:
# Adding Labels to Pandas.qcut() df['Age Groups'] = pd.qcut( df['Age'], [0, 0.25, 0.5, 0.75, 1], labels = ['0-25%', '26-49%', '51-75%', '76-100%'] ) print(df.head()) # Returns: # Name Age Age Groups # 0 Ray 12 0 - 25 % # 1 Jane 7 0 - 25 % # 2 Kate 33 26 - 49 % # 3 Nik 34 51 - 75 % # 4 Autumn 45 51 - 75 %
Since the .qcut()
function doesn’t allow you to specify including the lowest value of the range, the cut()
function needs to be used.
df['Age Group'] = pd.cut(
df['Age'],
[0, 0.25, 0.5, 0.75, 1],
include_lowest = True,
right = False
)