how to i break a data frame in pandas into a noncontinuous subset

  • Last Update :
  • Techknowledgy :

You can build up a list of column names like:

columns = ['word', 'count'] + ['norm%d' % i
   for i in range(1, 51)
]
wordvecs_df.loc[: , columns]

E.g.:

word_loc = wordvecs_df.columns.get_loc('word')
count_loc = wordvecs_df.columns.get_loc('count')
norm1_loc = wordvecs_df.columns.get_loc('norm1')
norm50_loc = wordvecs_df.columns.get_loc('norm50')

slice = np.r_[word_loc, count_loc, norm1_loc: norm50_loc]

df.iloc[: , slice]

You can use pd.concat:

pd.concat([df[['word', 'count']], df.loc[: , 'norm1': 'norm50']], 1)

Setup
Let's use a smaller example

i = [0, 1]
c = range(1, 5)
wordvecs_df = pd.concat([
   pd.DataFrame(1, i, ['word', 'count']),
   pd.DataFrame(1, i, c).add_prefix('norm'),
   pd.DataFrame(1, i, c).add_prefix('v')
], axis = 1)

wordvecs_df

word count norm1 norm2 norm3 norm4 v1 v2 v3 v4
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1

Solution
Use pd.DataFrame.filter to grab all things that look like 'norm'

wordvecs_df.filter(regex = '^norm\d\d?')

norm1 norm2 norm3 norm4
0 1 1 1 1
1 1 1 1 1

We can tack it on to our other two columns via pd.DataFrame.join or pd.concat

wordvecs_df[['word', 'count']].join(
   wordvecs_df.filter(regex = '^norm\d\d?'))

word count norm1 norm2 norm3 norm4
0 1 1 1 1 1 1
1 1 1 1 1 1 1

Suggestion : 2

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.,By default, the first observed row of a duplicate set is considered unique, but each method has a take_last parameter that indicates the last observed row should be taken instead.,duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.,Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done so. There are a couple of different ways.

In [1]: dates = date_range('1/1/2000', periods=8)

In [2]: df = DataFrame(randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [3]: df
Out[3]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [4]: panel = Panel({'one' : df, 'two' : df - df.mean()})

In [5]: panel
Out[5]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D
In[6]: s = df['A']

In[7]: s[dates[5]]
Out[7]: -0.67368970808837025

In[8]: panel['two']
Out[8]:
   A B C D
2000 - 01 - 01 0.409571 0.113086 - 0.610826 - 0.936507
2000 - 01 - 02 1.152571 0.222735 1.017442 - 0.845111
2000 - 01 - 03 - 0.921390 - 1.708620 0.403304 1.270929
2000 - 01 - 04 0.662014 - 0.310822 - 0.141342 0.470985
2000 - 01 - 05 - 0.484513 0.962970 1.174465 - 0.888276
2000 - 01 - 06 - 0.733231 0.509598 - 0.580194 0.724113
2000 - 01 - 07 0.345164 0.972995 - 0.816769 - 0.840143
2000 - 01 - 08 - 0.430188 - 0.761943 - 0.446079 1.044010
In[9]: df
Out[9]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885

In[10]: df[['B', 'A']] = df[['A', 'B']]

In[11]: df
Out[11]:
   A B C D
2000 - 01 - 01 - 0.282863 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 - 0.173215 1.212112 0.119209 - 1.044236
2000 - 01 - 03 - 2.104569 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 - 0.706771 0.721555 - 1.039575 0.271860
2000 - 01 - 05 0.567020 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 0.113648 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 0.577046 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 - 1.157892 - 0.370647 - 1.344312 0.844885
In[12]: sa = Series([1, 2, 3], index = list('abc'))

In[13]: dfa = df.copy()
In[14]: sa.b
Out[14]: 2

In[15]: dfa.A
Out[15]:
   2000 - 01 - 01 - 0.282863
2000 - 01 - 02 - 0.173215
2000 - 01 - 03 - 2.104569
2000 - 01 - 04 - 0.706771
2000 - 01 - 05 0.567020
2000 - 01 - 06 0.113648
2000 - 01 - 07 0.577046
2000 - 01 - 08 - 1.157892
Freq: D, Name: A, dtype: float64

In[16]: panel.one
Out[16]:
   A B C D
2000 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632
2000 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236
2000 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
2000 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860
2000 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
2000 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988
2000 - 01 - 07 0.404705 0.577046 - 1.715002 - 1.039268
2000 - 01 - 08 - 0.370647 - 1.157892 - 1.344312 0.844885
In[17]: sa.a = 5

In[18]: sa
Out[18]:
   a 5
b 2
c 3
dtype: int64

In[19]: dfa.A = list(range(len(dfa.index))) # ok
if A already exists

In[20]: dfa
Out[20]:
   A B C D
2000 - 01 - 01 0 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 1 1.212112 0.119209 - 1.044236
2000 - 01 - 03 2 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 3 0.721555 - 1.039575 0.271860
2000 - 01 - 05 4 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 5 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 6 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 7 - 0.370647 - 1.344312 0.844885

In[21]: dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column

In[22]: dfa
Out[22]:
   A B C D
2000 - 01 - 01 0 0.469112 - 1.509059 - 1.135632
2000 - 01 - 02 1 1.212112 0.119209 - 1.044236
2000 - 01 - 03 2 - 0.861849 - 0.494929 1.071804
2000 - 01 - 04 3 0.721555 - 1.039575 0.271860
2000 - 01 - 05 4 - 0.424972 0.276232 - 1.087401
2000 - 01 - 06 5 - 0.673690 - 1.478427 0.524988
2000 - 01 - 07 6 0.404705 - 1.715002 - 1.039268
2000 - 01 - 08 7 - 0.370647 - 1.344312 0.844885

Suggestion : 3

Slice Non-Contiguous and Contiguous Columns in Pandas to the Last Column in DataFrame,Groupby count of non NaN of another column and a specific calculation of the same columns in pandas,To merge multiple columns into one column and count the repetition of unique values and maintain a separate column for each count in pandas dataframe,I want to multiply two columns in a pandas DataFrame and add the result into a new column

With numpy:

dataset.iloc[: , np.r_[0, 2: dataset.shape[1]]]

With pandas:

dataset[[dataset.columns[0], * dataset.columns[2: ]]]