python: pandas - separate a dataframe based on a column value

  • Last Update :
  • Techknowledgy :

You can use boolean indexing:

df = pd.DataFrame({
   'Sales': [10, 20, 30, 40, 50],
   'A': [3, 4, 7, 6, 1]
})
print(df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50

s = 30

df1 = df[df['Sales'] >= s]
print(df1)
A Sales
2 7 30
3 6 40
4 1 50

df2 = df[df['Sales'] < s]
print(df2)
A Sales
0 3 10
1 4 20

It's also possible to invert mask by ~:

mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print(df1)
A Sales
2 7 30
3 6 40
4 1 50

print(df2)
A Sales
0 3 10
1 4 20

print(mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool

print(~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool

Using groupby you could split into two dataframes like

In[1047]: df1, df2 = [x
   for _, x in df.groupby(df['Sales'] < 30)
]

In[1048]: df1
Out[1048]:
   A Sales
2 7 30
3 6 40
4 1 50

In[1049]: df2
Out[1049]:
   A Sales
0 3 10
1 4 20

Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.

DF = pd.DataFrame({
   'chr': ["chr3", "chr3", "chr7", "chr6", "chr1"],
   'pos': [10, 20, 30, 40, 50],
})
ans = [y
   for x, y in DF.groupby('chr', as_index = False)
]

accessing the separated DF like this:

ans[0]
ans[1]
ans[len(ans) - 1] # this is the last separated DF

accessing the column value of the separated DF like this:

ansI_chr = ans[i].chr

One-liner using the walrus operator (Python 3.8):

df1, df2 = df[(mask: = df['Sales'] >= 30)], df[~mask]

Consider using copy to avoid SettingWithCopyWarning:

df1, df2 = df[(mask: = df['Sales'] >= 30)].copy(), df[~mask].copy()

Alternatively, you can use the method query:

df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()

I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:

df_dict = {
   sale_v: df[df['Sales'] == sale_v]
   for sale_v in df.Sales.unique()
}

Suggestion : 2

We want to slice this dataframe according to the column year.,Create new dataframe in pandas with dynamic names also add new column,Examples of how to slice (split) a dataframe by column value with pandas in python:,Create dynamic dataframe name by splitting a larger dataframe

Let's first create a dataframe

import pandas as pd
import random

l1 = [random.randint(1, 100) for i in range(15)]
l2 = [random.randint(1, 100) for i in range(15)]
l3 = [random.randint(2018, 2020) for i in range(15)]

data = {
   'Column A': l1,
   'Column B': l2,
   'Year': l3
}

df = pd.DataFrame(data)

print(df)

returns

    Column A Column B Year
    0 63 9 2018
    1 97 29 2018
    2 1 92 2019
    3 75 38 2020
    4 19 50 2019
    5 20 71 2019
    6 59 60 2020
    7 93 46 2019
    8 6 17 2020
    9 87 82 2018
    10 36 12 2020
    11 89 71 2018
    12 87 69 2019
    13 98 21 2018
    14 82 67 2020

To find the unique value in a given column:

df['Year'].unique()

To extract dataframe rows for a given column value (for example 2018), a solution is to do:

df[df['Year'] == 2018]

Now we can slice the original dataframe using a dictionary for example to store the results:

df_sliced_dict = {}

for year in df['Year'].unique():

   df_sliced_dict[year] = df[df['Year'] == year]

Suggestion : 3

The way that you’ll learn to split a dataframe by its column values is by using the .groupby() method. I have covered this method quite a bit in this video tutorial:,Created a group by object called grouped, splitting the dataframe by the Name column,,Now you’ll learn how to split the dataframe into all its possible groupings.,Splitting a dataframe by column value is a very helpful skill to know. It can help with automating reporting or being able to parse out different values of a dataframe.

Let’s get started and load some data!

import pandas as pd
df = pd.DataFrame.from_dict({
   'Name': ['Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen'],
   'Year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
   'Income': [10000, 11000, 9000, 12000, 13000, 11000, 14000, 15000, 13000, 12000, 14000, 13000],
   'Gender': ['F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F']
})
print(df)

Printing out the dataframe returns the following:

       Name Year Income Gender
       0 Jenny 2020 10000 F
       1 Matt 2021 11000 M
       2 Kristen 2022 9000 F
       3 Jenny 2020 12000 F
       4 Matt 2021 13000 M
       5 Kristen 2022 11000 F
       6 Jenny 2020 14000 F
       7 Matt 2021 15000 M
       8 Kristen 2022 13000 F
       9 Jenny 2020 12000 F
       10 Matt 2021 14000 M

Let’ see how we can split the dataframe by the Name column:

grouped = df.groupby(df['Name'])
print(grouped.get_group('Jenny'))

The way that we can find the midpoint of a dataframe is by finding the dataframe’s length and dividing it by two. Once we know the length, we can split the dataframe using the .iloc accessor.

>>> half_df = len(df) // 2
   >>>
   first_half = df.iloc[: half_df, ]

   >>>
   print(first_half)

Name Year Income Gender
0 Jenny 2020 10000 F
1 Matt 2021 11000 M
2 Kristen 2022 9000 F
3 Jenny 2020 12000 F
4 Matt 2021 13000 M
5 Kristen 2022 11000 F

Let’s see how we can turn this into a function split the dataframe into multiple sections:

def split_dataframe_by_position(df, splits):
   ""
"
Takes a dataframe and an integer of the number of splits to create.
Returns a list of dataframes.
""
"
dataframes = []
index_to_split = len(df) // splits
start = 0
end = index_to_split
for split in range(splits):
   temporary_df = df.iloc[start: end,: ]
dataframes.append(temporary_df)
start += index_to_split
end += index_to_split

return dataframes

split_dataframes = split_dataframe_by_position(df, 3)
print(split_dataframes[1])

Suggestion : 4

Starting with 0.8, pandas Index objects now supports duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:,Splitting the data into groups based on some criteria,Transformation: perform some group-specific computations and return a like-indexed. Some examples: Standardizing data (zscore) within group Filling NAs within groups with a value derived from each group ,If there are any NaN values in the grouping key, these will be automatically excluded. So there will never be an “NA group”. This was not the case in older versions of pandas, but users were generally discarding the NA group anyway (and supporting it was an implementation headache).

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
>>> grouped = obj.groupby(key) >>>
   grouped = obj.groupby(key, axis = 1) >>>
   grouped = obj.groupby([key1, key2])
In[1]: df = DataFrame({
      'A': ['foo', 'bar', 'foo', 'bar',
         ...: 'foo', 'bar', 'foo', 'foo'
      ],
      ...: 'B': ['one', 'one', 'two', 'three',
         ...: 'two', 'two', 'one', 'three'
      ],
      ...: 'C': randn(8),
      'D': randn(8)
   })
   ...:

   In[2]: df
Out[2]:
   A B C D
0 foo one 0.469112 - 0.861849
1 bar one - 0.282863 - 2.104569
2 foo two - 1.509059 - 0.494929
3 bar three - 1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two - 0.173215 - 0.706771
6 foo one 0.119209 - 1.039575
7 foo three - 1.044236 0.271860
In[3]: grouped = df.groupby('A')

In[4]: grouped = df.groupby(['A', 'B'])
In[5]: def get_letter_type(letter):
   ...: if letter.lower() in 'aeiou':
   ...: return 'vowel'
      ...:
      else:
         ...: return 'consonant'
            ...:

            In[6]: grouped = df.groupby(get_letter_type, axis = 1)
In[7]: lst = [1, 2, 3, 1, 2, 3]

In[8]: s = Series([1, 2, 3, 10, 20, 30], lst)

In[9]: grouped = s.groupby(level = 0)

In[10]: grouped.first()
Out[10]:
   1 1
2 2
3 3
dtype: int64

In[11]: grouped.last()
Out[11]:
   1 10
2 20
3 30
dtype: int64

In[12]: grouped.sum()
Out[12]:
   1 11
2 22
3 33
dtype: int64

Suggestion : 5

Apply the pandas series str.split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.,You can use the pandas Series.str.split() function to split strings in the column around a given separator/delimiter. It is similar to the python string split() function but applies to the entire dataframe column. The following is the syntax:,Just like you used the pandas Series.str.split() function to split a column, you can use the pandas Series.str.cat() to concatenate values from multiple columns into a single column. Here’s an example –,You can still split this column of lists into multiple columns but if your objective is to split a text column into multiple columns it’s better to pass expand=True to the pandas Series.str.split() function.

You can use the pandas Series.str.split() function to split strings in the column around a given separator/delimiter. It is similar to the python string split() function but applies to the entire dataframe column. The following is the syntax:

# df is a pandas dataframe
#
default parameters pandas Series.str.split()
function
df['Col'].str.split(pat, n = -1, expand = False)
# to split into multiple columns by delimiter
df['Col'].str.split(delimiter, expand = True)

Let’s look at the usage of the above method with the help of some examples. First, we will create a dataframe that we will be using throughout this tutorial.

import pandas as pd

# create a dataframe
df = pd.DataFrame({
   'Address': ['4860 Sunset Boulevard,San Francisco,California',
      '3055 Paradise Lane,Salt Lake City,Utah',
      '682 Main Street,Detroit,Michigan',
      '9001 Cascade Road,Kansas City,Missouri'
   ]
})
# display the dataframe
df

Apply the pandas series str.split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

# split column into multiple columns by delimiter
df['Address'].str.split(',', expand = True)

Output:

0[4860 Sunset Boulevard, San Francisco, Califor...
      1[3055 Paradise Lane, Salt Lake City, Utah] 2[682 Main Street, Detroit, Michigan] 3[9001 Cascade Road, Kansas City, Missouri] Name: Address, dtype: object

Let’s now add the three new columns resulting from the split to the dataframe df.

# split column and add new columns to df
df[['Street', 'City', 'State']] = df['Address'].str.split(',', expand = True)
# display the dataframe
df