append column to pandas dataframe without mutating the original

  • Last Update :
  • Techknowledgy :

what about copy():

df2 = df.copy()
df2['newcol'] = 123

You can do it with assign:

df2 = df.assign(newcol = 123)

Suggestion : 2

January 26, 2019 by cmdline

Let us first load pandas library

import pandas as pd

Let us use gapminder data set to add new column or new variable in our examples. We will use gapminder data from Software Carpentry website given as data_url below.

data_url = 'http://bit.ly/2cLzoxH'
# load the gapminder dataframe from web as data frame
gapminder = pd.read_csv(data_url)
# select four columns
gapminder = gapminder[['country', 'year', 'gdpPercap', 'pop']]
# view few elements of the data frame
print(gapminder.head(3))
country year gdpPercap pop
0 Afghanistan 1952 779.445314 8425333.0
1 Afghanistan 1957 820.853030 9240934.0
2 Afghanistan 1962 853.100710 10267083.0
3._
# add new column using square bracket notation
gapminder['pop_in_millions'] = gapminder['pop'] / 1e06

country year gdpPercap pop pop_in_millions
0 Afghanistan 1952 779.445314 8425333.0 8.425333
1 Afghanistan 1957 820.853030 9240934.0 9.240934
2 Afghanistan 1962 853.100710 10267083.0 10.267083

Inspired by dplyr’s mutate function in R to add new variable, Pandas’ recent versions have new function “assign” to add new columns. We can simply chain “assign” to the data frame.

 gapminder.assign(pop_in_millions = gapminder['pop'] / 1e06).head(3)

 country year gdpPercap pop pop_in_millions
 0 Afghanistan 1952 779.445314 8425333.0 8.425333
 1 Afghanistan 1957 820.853030 9240934.0 9.240934
 2 Afghanistan 1962 853.100710 10267083.0 10.267083

With assign function, we can also use a function to add a new column. Here we use a lambda function to create nthe new column with population in millions.

gapminder.assign(pop_in_millions = lambda x: x['pop'] / 1e06).head()

Suggestion : 3

Create a new column by assigning the output to the DataFrame with a new column name in between the [].,To create a new column, use the [] brackets with the new column name at the left side of the assignment.,I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column,I want to rename the data columns to the corresponding station identifiers used by openAQ

In[1]: import pandas as pd
In[2]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col = 0, parse_dates = True)

In[3]: air_quality.head()
Out[3]:
   station_antwerp station_paris station_london
datetime
2019 - 05 - 07 02: 00: 00 NaN NaN 23.0
2019 - 05 - 07 03: 00: 00 50.5 25.0 19.0
2019 - 05 - 07 04: 00: 00 45.0 27.7 19.0
2019 - 05 - 07 05: 00: 00 NaN 50.4 16.0
2019 - 05 - 07 06: 00: 00 NaN 61.9 NaN
In[4]: air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

In[5]: air_quality.head()
Out[5]:
   station_antwerp station_paris station_london london_mg_per_cubic
datetime
2019 - 05 - 07 02: 00: 00 NaN NaN 23.0 43.286
2019 - 05 - 07 03: 00: 00 50.5 25.0 19.0 35.758
2019 - 05 - 07 04: 00: 00 45.0 27.7 19.0 35.758
2019 - 05 - 07 05: 00: 00 NaN 50.4 16.0 30.112
2019 - 05 - 07 06: 00: 00 NaN 61.9 NaN NaN
In[6]: air_quality["ratio_paris_antwerp"] = (
      ...: air_quality["station_paris"] / air_quality["station_antwerp"]
      ...: )
   ...:

   In[7]: air_quality.head()
Out[7]:
   station_antwerp station_paris station_london london_mg_per_cubic ratio_paris_antwerp
datetime
2019 - 05 - 07 02: 00: 00 NaN NaN 23.0 43.286 NaN
2019 - 05 - 07 03: 00: 00 50.5 25.0 19.0 35.758 0.495050
2019 - 05 - 07 04: 00: 00 45.0 27.7 19.0 35.758 0.615556
2019 - 05 - 07 05: 00: 00 NaN 50.4 16.0 30.112 NaN
2019 - 05 - 07 06: 00: 00 NaN 61.9 NaN NaN NaN
In[8]: air_quality_renamed = air_quality.rename(
      ...: columns = {
         ...: "station_antwerp": "BETR801",
         ...: "station_paris": "FR04014",
         ...: "station_london": "London Westminster",
         ...:
      }
      ...: )
   ...:
In[9]: air_quality_renamed.head()
Out[9]:
   BETR801 FR04014 London Westminster london_mg_per_cubic ratio_paris_antwerp
datetime
2019 - 05 - 07 02: 00: 00 NaN NaN 23.0 43.286 NaN
2019 - 05 - 07 03: 00: 00 50.5 25.0 19.0 35.758 0.495050
2019 - 05 - 07 04: 00: 00 45.0 27.7 19.0 35.758 0.615556
2019 - 05 - 07 05: 00: 00 NaN 50.4 16.0 30.112 NaN
2019 - 05 - 07 06: 00: 00 NaN 61.9 NaN NaN NaN

Suggestion : 4

You can add columns using the dplyr function mutate. This function is aware of the column names and inside the function you can call them unquoted:,murders is the first argument of the select function, and the new data frame (formerly new_table) is the first argument of the filter function.,Notice that select no longer has a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the %>%.,Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument. In the code we wrote:

library(tidyverse)

We say that a data table is in tidy format if each row represents one observation and columns represent the different variables available for each of these observations. The murders dataset is an example of a tidy data frame.

# > state abb region population total
# > 1 Alabama AL South 4779736 135
# > 2 Alaska AK West 710231 19
# > 3 Arizona AZ West 6392017 232
# > 4 Arkansas AR South 2915918 93
# > 5 California CA West 37253956 1257
# > 6 Colorado CO West 5029196 65

To see how the same information can be provided in different formats, consider the following example:

# > country year fertility
# > 1 Germany 1960 2.41
# > 2 South Korea 1960 6.16
# > 3 Germany 1961 2.44
# > 4 South Korea 1961 5.99
# > 5 Germany 1962 2.47
# > 6 South Korea 1962 5.79

This tidy dataset provides fertility rates for two countries across the years. This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate. However, this dataset originally came in another format and was reshaped for the dslabs package. Originally, the data was in the following format:

# > country 1960 1961 1962
# > 1 Germany 2.41 2.44 2.47
# > 2 South Korea 6.16 5.99 5.79
library(dslabs)
data("murders")
murders < -mutate(murders, rate = total / population * 100000)
head(murders)
# > state abb region population total rate
# > 1 Alabama AL South 4779736 135 2.82
# > 2 Alaska AK West 710231 19 2.68
# > 3 Arizona AZ West 6392017 232 3.63
# > 4 Arkansas AR South 2915918 93 3.19
# > 5 California CA West 37253956 1257 3.37
# > 6 Colorado CO West 5029196 65 1.29
filter(murders, rate <= 0.71)
# > state abb region population total rate
# > 1 Hawaii HI West 1360301 7 0.515
# > 2 Iowa IA North Central 3046355 21 0.689
# > 3 New Hampshire NH Northeast 1316470 5 0.380
# > 4 North Dakota ND North Central 672591 4 0.595
# > 5 Vermont VT Northeast 625741 2 0.320
new_table < -select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
# > state region rate
# > 1 Hawaii West 0.515
# > 2 Iowa North Central 0.689
# > 3 New Hampshire Northeast 0.380
# > 4 North Dakota North Central 0.595
# > 5 Vermont Northeast 0.320