split pandas column into two

  • Last Update :
  • Techknowledgy :

The simplest solution is:

df[['A', 'B']] = df['AB'].str.split(' ', 1, expand = True)

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>>
import pandas as pd
   >>>
   df = pd.DataFrame({
      'AB': ['A1-B1', 'A2-B2']
   }) >>>
   df

AB
0 A1 - B1
1 A2 - B2 >>>
   df['AB_split'] = df['AB'].str.split('-') >>>
   df

AB AB_split
0 A1 - B1[A1, B1]
1 A2 - B2[A2, B2]

It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({
      "U": ["A", "B", "C"]
   }) >>>
   upper_lower_df

U
0 A
1 B
2 C
   >>>
   upper_lower_df["L"] = upper_lower_df["U"].str.lower() >>>
   upper_lower_df

U L
0 A a
1 B b
2 C c

Of course, this indexing interface of .str doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:

>>> df['AB'].str.split('-', 1).str[0]

0 A1
1 A2
Name: AB, dtype: object

   >>>
   df['AB'].str.split('-', 1).str[1]

0 B1
1 B2
Name: AB, dtype: object

Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str >>>
   df

AB AB_split A B
0 A1 - B1[A1, B1] A1 B1
1 A2 - B2[A2, B2] A2 B2

There might be a better way, but this here's one approach:

                            row
                            0 00000 UNITED STATES
                            1 01000 ALABAMA
                            2 01001 Autauga County, AL
                            3 01003 Baldwin County, AL
                            4 01005 Barbour County, AL
                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ', 1).tolist(),
   columns = ['fips', 'row'])
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['fips','row'])
   fips row
   0 00000 UNITED STATES
   1 01000 ALABAMA
   2 01001 Autauga County, AL
   3 01003 Baldwin County, AL
   4 01005 Barbour County, AL

You can extract the different parts out quite neatly using a regex pattern:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]

To explain the somewhat long regex:

(?P<fips>\d{5})

The next part:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

or

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
df[['fips', 'row']] = df['row'].str.split(' ', n = 1, expand = True)

You can use str.split by whitespace (default separator) and parameter expand=True for DataFrame with assign to new columns:

df = pd.DataFrame({
   'row': ['00000 UNITED STATES', '01000 ALABAMA',
      '01001 Autauga County, AL', '01003 Baldwin County, AL',
      '01005 Barbour County, AL'
   ]
})
print(df)
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL

df[['a', 'b']] = df['row'].str.split(n = 1, expand = True)
print(df)
row a b
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL

Modification if need remove original column with DataFrame.pop

df[['a', 'b']] = df.pop('row').str.split(n = 1, expand = True)
print(df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL

What is same like:

df[['a', 'b']] = df['row'].str.split(n = 1, expand = True)
df = df.drop('row', axis = 1)
print(df)

a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL

You can check and it return 4 column DataFrame, not only 2:

print(df['row'].str.split(expand = True))
0 1 2 3
0 00000 UNITED STATES None
1 01000 ALABAMA None None
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL

Then solution is append new DataFrame by join:

df = pd.DataFrame({
   'row': ['00000 UNITED STATES', '01000 ALABAMA',
      '01001 Autauga County, AL', '01003 Baldwin County, AL',
      '01005 Barbour County, AL'
   ],
   'a': range(5)
})
print(df)
a row
0 0 00000 UNITED STATES
1 1 01000 ALABAMA
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL

df = df.join(df['row'].str.split(expand = True))
print(df)

a row 0 1 2 3
0 0 00000 UNITED STATES 00000 UNITED STATES None
1 1 01000 ALABAMA 01000 ALABAMA None None
2 2 01001 Autauga County, AL 01001 Autauga County, AL
3 3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 4 01005 Barbour County, AL 01005 Barbour County, AL

If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

df["flips"], df["row_name"] = zip( * df["row"].str.split().tolist())
del df["row"]

Suggestion : 2

Apply the pandas series str.split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.,Just like you used the pandas Series.str.split() function to split a column, you can use the pandas Series.str.cat() to concatenate values from multiple columns into a single column. Here’s an example –,You can still split this column of lists into multiple columns but if your objective is to split a text column into multiple columns it’s better to pass expand=True to the pandas Series.str.split() function.,You can use the pandas Series.str.split() function to split strings in the column around a given separator/delimiter. It is similar to the python string split() function but applies to the entire dataframe column. The following is the syntax:

You can use the pandas Series.str.split() function to split strings in the column around a given separator/delimiter. It is similar to the python string split() function but applies to the entire dataframe column. The following is the syntax:

# df is a pandas dataframe
#
default parameters pandas Series.str.split()
function
df['Col'].str.split(pat, n = -1, expand = False)
# to split into multiple columns by delimiter
df['Col'].str.split(delimiter, expand = True)

Let’s look at the usage of the above method with the help of some examples. First, we will create a dataframe that we will be using throughout this tutorial.

import pandas as pd

# create a dataframe
df = pd.DataFrame({
   'Address': ['4860 Sunset Boulevard,San Francisco,California',
      '3055 Paradise Lane,Salt Lake City,Utah',
      '682 Main Street,Detroit,Michigan',
      '9001 Cascade Road,Kansas City,Missouri'
   ]
})
# display the dataframe
df

Apply the pandas series str.split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

# split column into multiple columns by delimiter
df['Address'].str.split(',', expand = True)

Output:

0[4860 Sunset Boulevard, San Francisco, Califor...
      1[3055 Paradise Lane, Salt Lake City, Utah] 2[682 Main Street, Detroit, Michigan] 3[9001 Cascade Road, Kansas City, Missouri] Name: Address, dtype: object

Let’s now add the three new columns resulting from the split to the dataframe df.

# split column and add new columns to df
df[['Street', 'City', 'State']] = df['Address'].str.split(',', expand = True)
# display the dataframe
df

Suggestion : 3

November 10, 2018 by cmdline

We can use Pandas’ string manipulation functions to do that easily. Let us first create a simple Pandas data frame using Pandas’ DataFrame function.

#
import Pandas as pd
import pandas as pd
# create a new data frame
df = pd.DataFrame({
   'Name': ['Steve Smith', 'Joe Nadal',
      'Roger Federer'
   ],
   'Age': [32, 34, 36]
})
df

str.split() with expand=True option results in a data frame and without that we will get Pandas Series object as output.

df.Name.str.split(expand = True, )
0 1
0 Steve Smith
1 Joe Nadal
2 Roger Federer

If we want to have the results in the original dataframe with specific names, we can add as new columns like shown below.

df[['First', 'Last']] = df.Name.str.split(" ", expand = True, )
df

Note that we applied str.split method without specifying any specific delimiter. By default, str.split uses a single space as delimiter and we can specify a delimiter as follows. For example, if the text in our column were separated by under score,

df = pd.DataFrame({
   'Name': ['Steve_Smith', 'Joe_Nadal',
      'Roger_Federer'
   ],
   'Age': [32, 34, 36]
})
df
Age Name
0 32 Steve_Smith
1 34 Joe_Nadal
2 36 Roger_Federer

we can use under score as our delimiter to split the column into two columns.

df[['First', 'Last']] = df.Name.str.split("_", expand = True, )
df
Age Name First Last
0 32 Steve_Smith Steve Smith
1 34 Joe_Nadal Joe Nadal
2 36 Roger_Federer Roger Federer

Suggestion : 4

Once split the strings are kept in two columns we’ll add to the dataframe: ‘last_name’,’first_name’,We user str.split() method to first convert the Series to a string., How to customize Matplotlib plot titles color, position and fonts? ,Expand=True has to be specified, as otherwise the string will not be divided into different columns.

Most probably you’ll be acquiring your data from an API, database, text or comma separated value file. But in this example we’ll use a simple dataframe that we’ll define manually out of a dictionary.

# Python3
import pandas as pd
targets = pd.DataFrame({
   "manager": ["Johns;Tim ", "Mcgregor; Dave", "DeRocca; Leo", "Haze; Jim"],
   "target": [42000, 85000, 45000, 33000]
})

Let’s look at the data:

targets.head()
3._
manager = targets['manager']
targets[['last_name', 'first_name']] = manager.str.split(";", n = 1, expand = True)
targets

if you would like to keep only one of the new columns we just created, you can use the following code:

targets.drop('first_name', axis = 1)

Suggestion : 5

In Pandas, the apply() method can also be used to split one column values into multiple columns. The DataFrame.apply method() can execute a function on all values of single or multiple columns. Then inside that function, we can split the string value to multiple values. Then we can assign all these splitted values into new columns.,Example of DataFrame.apply() method to split a column into multiple columns. Where an underscore is the delimiter.,Example of DataFrame.apply() method with comma as a delimiter, to split two different columns values into four new columns.,In Pandas, a DataFrame column can contain delimited string values. It means, multiple values in a single column that are either separated by dashes, whitespace, or comma. For example,

1._
   RollNo student_name student_address
   0 10 Reema Surat_Gujarat
   1 20 Rekha Pune_Maharastra
   2 30 Jaya Delhi_Uttar Pradesh

Here, we have the requirement to split a single column into two different columns. For example, in the above DataFrame split the student_address column to two different columns “city” and “state” like,

   RollNo student_name city state
   0 10 Reema Surat Gujarat
   1 20 Rekha Pune Maharastra
   2 30 Jaya Delhi Uttar Pradesh

Syntax of Series.str.split() method

Series.str.split(pat = None, n = -1, expand = False)

Output

   RollNo student_name student_address
   0 10 Reema Surat_Gujarat
   1 20 Rekha Pune_Maharastra
   2 30 Jaya Delhi_Uttar Pradesh

      **
      ** ** ** ** *

      RollNo student_name student_address city state
   0 10 Reema Surat_Gujarat Surat Gujarat
   1 20 Rekha Pune_Maharastra Pune Maharastra
   2 30 Jaya Delhi_Uttar Pradesh Delhi Uttar Pradesh

Split two different columns values into four new columns, where comma is the delimiter.

import pandas as pd

# create a Dataframe
df = pd.DataFrame({
   'RollNo': [10, 20, 30],
   'student_name': ['Reema,Thakkar', 'Rekha,Chande', 'Jaya,Sachde'],
   'student_address': ['Surat,Gujarat', 'Pune,Maharastra', 'Delhi,Uttar Pradesh']
})

# show the dataframe
print(df)

print('***********')

# Split column student_name to Name and Surname
df[['Name', 'Surname']] = df["student_name"].str.split(",", expand = True)

# Split column student_address to City and State
df[['City', 'State']] = df["student_address"].str.split(",", expand = True)

print(df)

Suggestion : 6

We first initialize a pandas dataframe. Then we split the dataframe according to a delimeter, which in this case is comma (,). We consider the first row of the split output as header and the rest as data.,In Python, the pandas library includes built-in functionalities that allow the performance of different tasks with only a few lines of code. One of these functionalities is to split a text column into two separate columns.,Learn in-demand tech skills in half the time

#importing pandas library
import pandas as pd

#initializing pandas input dataframe
df = pd.DataFrame(["ID, Name, City, Country",
      "21, Ali, Islamabad, Pakistan",
      "22, Usman, Multan, Pakistan",
      "23, Ahmad,  Karachi, Pakistan",
      "24, Arslan, Lahore, Pakistan"
   ],
   columns = ['row'])

#splitting rows based on commas
result = df.row.str.split(',', expand = True)

#making first row as header of the data
header = result.iloc[0]
result = result[1: ]
result.columns = header
print(result)

Suggestion : 7

Mar 14, 2022 , Mar 17, 2022 , Mar 18, 2022 , Apr 11, 2022

 user_df['name'].str.split()