in pandas apply method, duplicate the row based on condition

  • Last Update :
  • Techknowledgy :

Here is one way using df.iterrows inside a list comprehension. You will need to append your rows to a loop and then concat.

def func(row):
   if row['a'] == "3":
   row2 = row.copy()
# make edits to row2
return pd.concat([row, row2], axis = 1)
return row

pd.concat([func(row) for _, row in df.iterrows()], ignore_index = True, axis = 1).T

a b
0 1 2
1 1 2
2 3 other_value
3 3 other_value

Your logic does seem mostly vectorisable. Since the order of rows in your output appears to be important, you can increment the default RangeIndex by 0.5 and then use sort_index.

def row_appends(x):
   newrows = x.loc[x['a'].isin(['3', '4', '5'])].copy()
newrows.loc[x['a'] == '3', 'b'] = 10 # make conditional edit
newrows.loc[x['a'] == '4', 'b'] = 20 # make conditional edit
newrows.index = newrows.index + 0.5
return newrows

res = pd.concat([df, df.pipe(row_appends)])\
   .sort_index().reset_index(drop = True)

print(res)

a b
0 1 2
1 1 2
2 3 other_value
3 3 10

I would vectorise it, doing it category by category:

df[df_condition_1]["a"] = 3
df[df_condition_2]["a"] = 4

duplicates = df[df_condition_3] # somehow we store it ?
   duplicates["a"] = 5

#then
df.join(duplicates, how = 'outer')

Suggestion : 2

first : Mark duplicates as True except for the first occurrence.,last : Mark duplicates as True except for the last occurrence.,Determines which duplicates (if any) to mark.,By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

>>> df = pd.DataFrame({
      ...'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
      ...'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
      ...'rating': [4, 4, 3.5, 15, 5]
         ...
   }) >>>
   df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
>>> df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep = 'last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep = False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(subset = ['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool

Suggestion : 3

How to check for duplicate values in the same dataframe column and apply if condition by dropping the row based on frequency?,Create duplicate row in Pandas dataframe based on condition, and change values for a specific column,pandas add column to dataframe having the value from another row based on condition,Drop Duplicate rows (Summarizing the data) in pandas dataframe based on condition and other columns

With Multi Index and stack():

# Create the dataframe
df = [
   ["John", 22, 1, 0, 0],
   ["Pete", 54, 0, 1, 0],
   ["Lisa", 26, 1, 1, 0]
]
df = pd.DataFrame(df, columns = ["Name", "Age", "BMoney", "BTime", "BEffort"])

# Set Multi Indexing
df.set_index(["Name", "Age"], inplace = True)

# Use the fact that columns and Series can carry names and use stack to do the transformation
df.columns.name = "B"
df = df.stack()
df.name = "value"
df = df.reset_index()

# Select only the "valid"
rows, remove the last columns and remove first letter in B columns
df = df[df.value == 1]
df.drop("value", axis = 1, inplace = True)
df["B"] = df.B.apply(lambda x: x[1: ])

Suggestion : 4

We can use DataFrame.map() function to achieve the same goal. It is a straight forward method where we use a dictionary to simply map values to the newly added column based on the key. Map values of Series according to input correspondence. It is used for substituting each value in a Series with another value.,If you need to check multiple columns to create a new column use DataFrame.assign() function, you can see below example-,Finally, you can use the method transform() with a lambda function. The transform() function returns a self-produced DataFrame with transformed values after applying the function specified in its parameter.,NOTE: You can replace values where the condition is false by Series.where() method. The where() method is used to check a DataFrame for one or more conditions and return the result accordingly.

1._
# Below are some quick examples.
# Create conditional DataFrame column by np.where()
function.
df['Discount'] = np.where(df['Courses'] == 'Spark', 1000, 2000)

# Another way to create column conditionally.
df['Discount'] = [1000
   if x == 'Spark'
   else 2000
   for x in df['Courses']
]

# Create conditional DataFrame column by map() and lambda.
df['Discount'] = df.Courses.map(lambda x: 1000
   if x == 'Spark'
   else 2000)

# Create conditional DataFrame column by np.select()
function.
conditions = [
   (df['Courses'] == 'Spark') & (df['Duration'] == '30days'),
   (df['Courses'] == 'Spark') & (df['Duration'] == '35days'),
   (df['Duration'] == '50days')
]
choices = [1000, 1050, 200]
df['Discount'] = np.select(conditions, choices,
   default = 0)

# Using Dictionary to map new values.
Discount_dictionary = {
   'Spark': 1500,
   'PySpark': 800,
   'Spark': 1200
}
df['Discount'] = df['Courses'].map(Discount_dictionary)

# Pandas create conditional DataFrame column by dictionary
df['Discount'] = [Discount_dictionary.get(v, None) for v in df['Courses']]

# Using DataFrame.assign() method.
def Courses_Discount(row):
   if row["Courses"] == "Spark":
   return 1000
else:
   return 2000
df = df.assign(Discount = df.apply(Courses_Discount, axis = 1))

# Conditions with multiple rand multiple columns.
def Courses_Discount(row):
   if row["Courses"] == "Spark":
   return 1000
elif row["Fee"] == 25000:
   return 2000
else:
   return 0
df = df.assign(Discount = df.apply(Courses_Discount, axis = 1))

# Using.loc[] property
for single condition.
df.loc[(df['Courses'] == "Spark"), 'Discount'] = 1000

# Using loc[] method
for Multiple conditions.
df.loc[(df['Courses'] == "Spark") & (df['Fee'] == 23000) | (df['Fee'] == 25000), 'Discount'] = 1000

# Using DataFrame.apply() method with lambda
function.
df['Discount'] = df['Courses'].apply(lambda x: '1000'
   if x == 'Spark'
   else 1000)

# Pandas create conditional column using mask() method.
# Replace values where the condition is True
df['Discount'] = df['Discount'].mask(df['Courses'] == 'Spark', other = 1000)

# Using where()
df['Discount'] = df['Discount'].where(df['Courses'] == 'Spark', other = 1000)

# Using transform() with a lambda
function.
df['Discount'] = df['Courses'].transform(lambda x: 1000
   if x == 'Spark'
   else 2000)

Let’s create a pandas DataFrame with a few rows and columns and execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration.

import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Spark", "Python", "PySpark"],
   'Fee': [22000, 25000, 23000, 24000, 26000],
   'Duration': ['30days', '50days', '30days', None, np.nan]
}
df = pd.DataFrame(technologies)

Yields below output.

Courses Fee Duration
0 Spark 22000 30 days
1 PySpark 25000 50 days
2 Spark 23000 30 days
3 Python 24000 None
4 PySpark 26000 NaN
6._
# Another way to create another column conditionally.
df['Discount'] = [1000
   if x == 'Spark'
   else 2000
   for x in df['Courses']
]
print(df)

Similarly, you can also create by using Series.map() and lambda. The lambda functions are defined using the keyword lambda. They can have any number of arguments but only one expression. These are very helpful when we have to perform small tasks with less code.

# Create conditional DataFrame column by lambda.
df['Discount'] = df.Courses.map(lambda x: 1000
   if x == 'Spark'
   else 2000)
print(df)

Suggestion : 5

In this article we will discuss different ways to select rows in DataFrame based on condition on single or multiple columns.,Select Rows based on any of the multiple conditions on column,Select rows in above DataFrame for which ‘Sale’ column contains Values greater than 30 & less than 33 i.e. ,Select rows in above DataFrame for which ‘Product‘ column contains either ‘Grapes‘ or ‘Mangos‘ i.e

First let’s create a DataFrame,

# List of Tuples
students = [('jack', 'Apples', 34),
   ('Riti', 'Mangos', 31),
   ('Aadi', 'Grapes', 30),
   ('Sonia', 'Apples', 32),
   ('Lucy', 'Mangos', 33),
   ('Mike', 'Apples', 35)
]

#Create a DataFrame object
dfObj = pd.DataFrame(students, columns = ['Name', 'Product', 'Sale'])

    Name Product Sale
    0 jack Apples 34
    1 Riti Mangos 31
    2 Aadi Grapes 30
    3 Sonia Apples 32
    4 Lucy Mangos 33
    5 Mike Apples 35

Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’,

subsetDataFrame = dfObj[dfObj['Product'] == 'Apples']

If we pass this series object to [] operator of DataFrame, then it will return a new DataFrame with only those rows that has True in the passed Series object i.e.

dfObj[dfObj['Product'] == 'Apples']

Select rows in above DataFrame for which ‘Product‘ column contains either ‘GrapesorMangos‘ i.e

subsetDataFrame = dfObj[dfObj['Product'].isin(['Mangos', 'Grapes'])]