Here's an alternate (fancy indexing) way to do it:
df.append(df.iloc[[-1] * 3])
Out[757]:
A B C D
2014 - 01 - 01 1 0 0 0
2014 - 01 - 02 0 1 0 0
2014 - 01 - 03 0 0 1 0
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
You could use nested concat
operations, the inner one will concatenate your last row 3 times and we then concatenate this with your orig df:
In[181]:
dates = pd.date_range('1/1/2014', periods = 4)
df = pd.DataFrame(np.eye(4, 4), index = dates, columns = ['A', 'B', 'C', 'D'])
pd.concat([df, pd.concat([df[-1: ]] * 3)])
Out[181]:
A B C D
2014 - 01 - 01 1 0 0 0
2014 - 01 - 02 0 1 0 0
2014 - 01 - 03 0 0 1 0
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
This could be put into a function like so:
In[182]:
def repeatRows(d, n = 3):
return pd.concat([d] * n)
pd.concat([df, repeatRows(df[-1: ], 3)])
Out[182]:
A B C D
2014 - 01 - 01 1 0 0 0
2014 - 01 - 02 0 1 0 0
2014 - 01 - 03 0 0 1 0
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
2014 - 01 - 04 0 0 0 1
Repeat or replicate the rows of dataframe in pandas python (create duplicate rows) can be done in a roundabout way by using concat() function. Let’s see how to, Repeat or replicate the dataframe in pandas python.,Concat function repeats the dataframe in pandas with index. So index will also be repeated,Repeat or replicate the dataframe in pandas along with index.
First let’s create a dataframe
import pandas as pd
import numpy as np
#Create a DataFrame
df1 = {
'State': ['Arizona AZ', 'Georgia GG', 'Newyork NY', 'Indiana IN', 'Florida FL'],
'Score': [62, 47, 55, 74, 31]
}
df1 = pd.DataFrame(df1, columns = ['State', 'Score'])
print(df1)
Repeat the dataframe 3 times with concat function. Ignore_index=True does not repeat the index. So new index will be created for the repeated columns
''
' Repeat without index '
''
df_repeated = pd.concat([df1] * 3, ignore_index = True)
print(df_repeated)
Concat function repeats the dataframe in pandas with index. So index will also be repeated
''
' Repeat with index'
''
df_repeated_with_index = pd.concat([df1] * 2)
print(df_repeated_with_index)
Only consider certain columns for identifying duplicates, by default use all of the columns.,Return boolean Series denoting duplicate rows.,Considering certain columns is optional.,By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df = pd.DataFrame({
...'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...'rating': [4, 4, 3.5, 15, 5]
...
}) >>>
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
>>> df.duplicated(keep = 'last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep = False) 0 True 1 True 2 False 3 False 4 False dtype: bool
>>> df.duplicated(subset = ['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
In this article, you have learned how to drop/remove/delete duplicate rows using pandas.DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.,Below is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.,By using pandas.DataFrame.drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns. In this article, we’ll explain several ways of how to drop duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.,You can use DataFrame.drop_duplicates() without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns four rows after removing duplicate rows in our DataFrame.
# Below are quick example # keep first duplicate row df2 = df.drop_duplicates() # Using DataFrame.drop_duplicates() to keep first duplicate row df2 = df.drop_duplicates(keep = 'first') # keep last duplicate row df2 = df.drop_duplicates(keep = 'last') # Remove all duplicate rows df2 = df.drop_duplicates(keep = False) # Delete duplicate rows based on specific columns df2 = df.drop_duplicates(subset = ["Courses", "Fee"], keep = False) # Drop duplicate rows in place df.drop_duplicates(inplace = True) # Using DataFrame.apply() and lambda function df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset = ['Courses', 'Fee'], keep = 'first')
Below is the syntax of the DataFrame.drop_duplicates()
function that removes duplicate rows from the pandas DataFrame.
# Syntax of drop_duplicates DataFrame.drop_duplicates(subset = None, keep = 'first', inplace = False, ignore_index = False)
Now, let’s create a DataFrame with a few duplicate rows on columns. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
import pandas as pd
import numpy as np
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas", "Python", "Spark", "pandas"],
'Fee': [20000, 25000, 22000, 30000, 22000, 20000, 30000],
'Duration': ['30days', '40days', '35days', '50days', '35days', '30days', '50days'],
'Discount': [1000, 2300, 1200, 2000, 1200, 1000, 2000]
}
df = pd.DataFrame(technologies)
print(df)
You can use DataFrame.drop_duplicates()
without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None
and keep=‘first’
. The below example returns four rows after removing duplicate rows in our DataFrame.
# keep first duplicate row df2 = df.drop_duplicates() print(df2) # Using DataFrame.drop_duplicates() to keep first duplicate row df2 = df.drop_duplicates(keep = 'first') print(df2)
Yields below output.
Courses Fee Duration Discount 0 Spark 20000 30 days 1000 1 PySpark 25000 40 days 2300 2 Python 22000 35 days 1200 3 pandas 30000 50 days 2000
Pandas Dataframe provides a function dataframe.append() to add rows to a dataframe i.e.,We can pass a list of series too in the dataframe.append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e.,We can also pass a series object to the append() function to append a new row to the dataframe i.e.,We can select a row from dataframe by its name using loc[] attribute and the pass the selected row as an argument to the append() function. It will add the that row to the another dataframe. Let’s see an example where we will select a row with index label ‘b’ and append it to another dataframe using append(). For example,
Pandas Dataframe provides a function dataframe.append() to add rows to a dataframe i.e.
DataFrame.append(other, ignore_index = False, verify_integrity = False, sort = None)
Name Age City Country
a jack 34 Sydeny Australia
b Riti 30 Delhi India
c Vikas 31 Mumbai India
d Neelu 32 Bangalore India
e John 16 New York US
f Mike 17 las vegas US
Let’s add a new row in above dataframe by passing dictionary i.e.
# Pass the row elements as key value pairs to append() function mod_df = df.append({ 'Name': 'Sahil', 'Age': 22 }, ignore_index = True) print('Modified Dataframe') print(mod_df)
Complete example to add a dictionary as row to the dataframe is as follows,
import pandas as pd # List of Tuples students = [('jack', 34, 'Sydeny', 'Australia'), ('Riti', 30, 'Delhi', 'India'), ('Vikas', 31, 'Mumbai', 'India'), ('Neelu', 32, 'Bangalore', 'India'), ('John', 16, 'New York', 'US'), ('Mike', 17, 'las vegas', 'US') ] #Create a DataFrame object df = pd.DataFrame(students, columns = ['Name', 'Age', 'City', 'Country'], index = ['a', 'b', 'c', 'd', 'e', 'f']) print('Original Dataframe') print(df) # Pass the row elements as key value pairs to append() function mod_df = df.append({ 'Name': 'Sahil', 'Age': 22 }, ignore_index = True) print('Modified Dataframe') print(mod_df)
Output:
Original Dataframe
Name Age City Country
a jack 34 Sydeny Australia
b Riti 30 Delhi India
c Vikas 31 Mumbai India
d Neelu 32 Bangalore India
e John 16 New York US
f Mike 17 las vegas US
Modified Dataframe
Name Age City Country
0 jack 34 Sydeny Australia
1 Riti 30 Delhi India
2 Vikas 31 Mumbai India
3 Neelu 32 Bangalore India
4 John 16 New York US
5 Mike 17 las vegas US
6 Sahil 22 NaN NaN