python: how to assign unique ids to pandas dataframe entries

  • Last Update :
  • Techknowledgy :

This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

df['id'] = df.groupby(['LastName', 'FirstName']).ngroup() >>>
   df

First Second id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

% timeit df.assign(id = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 µs per loop

   %
   timeit df.assign(id = df.groupby(['LastName', 'FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop

However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

import faker
fakenames = faker.Faker()
first = [fakenames.first_name() for _ in range(5000)]
last = [fakenames.last_name() for _ in range(5000)]
df2 = pd.DataFrame({
   'FirstName': first,
   'LastName': last
})
df2 = pd.concat([df2, df2.iloc[: 2000]])

Of course, multiple people with the same name would have the same id.

df = df.assign(id = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes) >>>
   df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2

This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.

# Create Dataframe
df = pd.DataFrame({
   'FirstName': ['Tom', 'Tom', 'David', 'Alex', 'Alex'],
   'LastName': ['Jones', 'Jones', 'Smith', 'Thompson', 'Thompson'],
})

newIdName = 'id'
# Set new name here.

df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes

Output:

>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2

Suggestion : 2

To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. ,To fetch the unique values in column ‘Age’ of the above created dataframe, we will call unique() function on the column i.e. ,Suppose instead of getting the name of unique values in a column, if we are interested in count of unique elements in a column then we can use series.unique() function i.e. ,In this article we will discuss how to find unique elements in a single, multiple or each column of a dataframe.

It returns the a numpy array of unique elements in series object.

Series.unique(self)

Series.nunique(self, dropna = True)

First of all, create a dataframe,

# List of Tuples
empoyees = [('jack', 34, 'Sydney', 5),
   ('Riti', 31, 'Delhi', 7),
   ('Aadi', 16, np.NaN, 11),
   ('Mohit', 31, 'Delhi', 7),
   ('Veena', np.NaN, 'Delhi', 4),
   ('Shaunak', 35, 'Mumbai', 5),
   ('Shaun', 35, 'Colombo', 11)
]

# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees, columns = ['Name', 'Age', 'City', 'Experience'], index = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])

print("Contents of the Dataframe : ")
print(empDfObj)

Suppose instead of getting the name of unique values in a column, if we are interested in count of unique elements in a column then we can use series.unique() function i.e.

# Count unique values in column 'Age' of the dataframe
uniqueValues = empDfObj['Age'].nunique()

print('Number of unique values in column "Age" of the dataframe : ')
print(uniqueValues)

Using nunique() with default arguments doesn’t include NaN while counting the unique elements, if we want to include NaN too then we need to pass the dropna argument i.e.

# Count unique values in column 'Age'
including NaN
uniqueValues = empDfObj['Age'].nunique(dropna = False)

print('Number of unique values in column "Age" including NaN')
print(uniqueValues)

Suggestion : 3

I am assuming, based on your comments above, that the station id is always located at the second row of the first column in all the csv files.,How to get the number of unique combinations of two columns that occur in a python pandas dataframe,How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?,how to assign the number in pandas dataframe for the unique value appearing in the row based on given column

I am assuming, based on your comments above, that the station id is always located at the second row of the first column in all the csv files.

import pandas as pd
from io
import StringIO

# sample data
s = ""
"Station ID,Sensor, Serial Num,
12345678, 123456789,
Precip, 02 / 01 / 2020, 09: 45: 00, -2.19,
   Batt Voltage, 02 / 01 / 2020, 09: 45: 00, 13.4,
   Temp In Box, 02 / 01 / 2020, 09: 45: 00, -2.58,
   Precip, 02 / 01 / 2020, 10: 00: 00, -2.19,
   Batt Voltage, 02 / 01 / 2020, 10: 00: 00, 13.6,
   Temp In Box, 02 / 01 / 2020, 10: 00: 00, -2.17, ""
"

# read your file
df = pd.read_csv(StringIO(s), usecols = ['Variable', 'Date', 'Time', 'FPR-D Oil'],
   skiprows = [0, 1], names = ['Variable', 'Date', 'Time', 'FPR-D Oil'])

# read it again but only get the first value of the second row
sid = pd.read_csv(StringIO(s), skiprows = 1, nrows = 1, header = None)[0].iloc[0]

# filter and copy so you are not assign to a slice of a frame
new_df = df[df['Variable'] == 'Precip'].copy()

# assign sid to a new column
new_df.loc[: , 'id'] = sid

print(new_df)

Variable Date Time FPR - D Oil id
0 Precip 02 / 01 / 2020 09: 45: 00 - 2.19 12345678
3 Precip 02 / 01 / 2020 10: 00: 00 - 2.19 12345678

Try this:

df['Station ID'] = 12345678

Suggestion : 4

When we get the unique values of a column, we need to type the name of the dataframe, then the name of the column, and then unique (). Keep in mind that these must be separated by ‘dots.’ So in this example, titanic is the name of the dataframe. embark_town is the name of the column. , If you want to use the unique () method on a dataframe column, you can do so as follows: Type the name of the dataframe, then use “dot syntax” and type the name of the column. Then use dot syntax to call the unique() method. , 1 week ago Nov 01, 2020  · If you want to use the unique () method on a dataframe column, you can do so as follows: Type the name of the dataframe, then use “dot syntax” and type the name of the … , When you use the method version, you start by typing the name of the Series object that you want to work with. Next, you type a “dot,” and then the name of the method, unique (). When we use the Pandas unique method, we can use it on a lone Series object that exists on it’s own, outside of a dataframe.


index first last dob 0 peter jones 20000101 1 john doe 19870105 2 adam smith 19441212 3 john doe 19870105 4 jenny fast 19640822
index first last dob 0 peter jones 20000101 1 john doe19870105 2 adam smith 19441212 3 john doe19870105 4 jenny fast 19640822
index first last dobid 0 peter jones 20000101 1244821450 1 john doe19870105 1742118427 2 adam smith 19441212 1841181386 3 john doe19870105 1742118427 4 jenny fast 19640822 1687411973
df['id'] = df[['first', 'last']].sum(axis = 1).map(hash)
import numpy as np np.random.seed(1) # create a list of unique names names = df[['first', 'last']].agg(' '.join, 1).unique().tolist() # generte ids ids = np.random.randint(low = 1e9, high = 1e10, size = len(names)) # maps ids to names maps = {
   k: v
   for k,
   v in zip(names, ids)
}
# add new id column df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps) index first lastdob id 00 peter jones 20000101 9176146523 11 john doe 19870105 8292931172 22 adam smith 19441212 4108641136 33 john doe 19870105 8292931172 44 jenny fast 19640822 6385979058
def generate_id(s): return abs(hash(s)) % (10 ** 10) df['id'] = df['first'].apply(generate_id)

Suggestion : 5

Next, we can use the group_by and mutate functions of the dplyr package to assign a unique ID number to each group of identical values in a column (i.e. x1):,Next, we can use the functions of the data.table package to assign an ID number to our data:,In this example, I’ll demonstrate how to create a unique ID column by group using the dplyr package.,In this example, I’ll demonstrate how to group by a variable and then assign an ID based on these groups using the transform function of the basic installation of the R programming language.

data < -data.frame(x1 = rep(letters[1: 3], # Create example data each = 3),
   x2 = 11: 19)
data # Print example data
data_id1 < -transform(data, # Create ID by group ID = as.numeric(factor(x1)))
data_id1 # Print data with group ID
install.packages("dplyr") # Install & load dplyr package
library("dplyr")
data_id2 <- data %>% # Create ID by group
   group_by(x1) %>%
   dplyr::mutate(ID = cur_group_id())
   data_id2 # Print data with group ID
   # # A tibble: 9 x 3
   # # Groups: x1 [3]
   # x1 x2 ID
   # <chr>
      <int>
         <int>
            # 1 a 11 1
            # 2 a 12 1
            # 3 a 13 1
            # 4 b 14 2
            # 5 b 15 2
            # 6 b 16 2
            # 7 c 17 3
            # 8 c 18 3
            # 9 c 19 3
install.packages("data.table") # Install data.table package
library("data.table") # Load data.table
data_id3 < -data # Duplicate data
setDT(data_id3)[, ID: = .GRP, by = x1] # Create ID by group
data_id3 # Print data with group ID