construct sparse matrix using categorical data

  • Last Update :
  • Techknowledgy :

The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:

import numpy as np
from scipy
import sparse

users, I = np.unique(user_item[: , 0], return_inverse = True)
items, J = np.unique(user_item[: , 1], return_inverse = True)

points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))

pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix

import pandas as pd
#construct the data
x = pd.DataFrame([
      ['a', 'abc'],
      ['b', 'def'],
      ['c'
         'ghi'
      ],
      ['d', 'abc'],
      ['a', 'ghi'],
      ['e', 'fg'],
      ['f', 'f76'],
      ['b', 'f76']
   ],
   columns = ['user', 'item'])
print(x)
# user item
# 0 a abc
# 1 b def
# 2 c ghi
# 3 d abc
# 4 a ghi
# 5 e fg
# 6 f f76
# 7 b f76
for col, col_data in x.iteritems():
   if str(col) == 'item':
   col_data = pd.get_dummies(col_data, prefix = col)
x = x.join(col_data)
print(x)
# user item item_abc item_def item_f76 item_fg item_ghi
# 0 a abc 1 0 0 0 0
# 1 b def 0 1 0 0 0
# 2 c ghi 0 0 0 0 0
# 3 d abc 1 0 0 0 0
# 4 a ghi 0 0 0 0 1
# 5 e fg 0 0 0 1 0
# 6 f f76 0 0 1 0 0
# 7 b f76 0 0 1 0 0

Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.

A = np.array([
   ['a', 'abc'],
   ['b', 'def'],
   ['c', 'ghi'],
   ['d', 'abc'],
   ['a', 'ghi'],
   ['e', 'fg'],
   ['f', 'f76'],
   ['b', 'f76']
])

customers = np.unique(A[: , 0])
items = np.unique(A[: , 1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A
   for b in combinations
], dtype = int)
C.reshape((values.size, customers.size)) >>
   array(
      [
         [1, 0, 0, 0, 1, 0],
         [1, 1, 0, 0, 0, 0],
         [0, 0, 1, 1, 0, 0],
         [0, 0, 0, 0, 0, 1],
         [0, 0, 0, 1, 0, 0]
      ])

Here is my approach using pandas, let me know if it performed better:

#create dataframe from your numpy array
x = pd.DataFrame(x, columns = ['User', 'Item'])

#get rows and cols
for your sparse dataframe
cols = pd.unique(x['User']);
ncols = cols.shape[0]
rows = pd.unique(x['Item']);
nrows = rows.shape[0]

#initialize your sparse dataframe,
#(this is not sparse, but you can check pandas support
      for sparse datatypes spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns = cols, index = rows)

      #define apply
      function def hasUser(xx):
      spdf.ix[xx.name, xx] = 1

      #groupby and apply to create desired output dataframe g = x.groupby(by = 'Item', sort = False) g['User'].apply(lambda xx: hasUser(xx))

Here is the sampel dataframes for above code:

    spdf
    Out[71]:
       a b c d e f
    abc 1 0 0 1 0 0
    def 0 1 0 0 0 0
    ghi 1 0 1 0 0 0
    fg 0 0 0 0 1 0
    f76 0 1 0 0 0 1

    x
    Out[72]:
       User Item
    0 a abc
    1 b def
    2 c ghi
    3 d abc
    4 a ghi
    5 e fg
    6 f f76
    7 b f76

Suggestion : 2

Create sparse matrix in CSR/COO format for a huge feature vector from categorical data stored in Pandas DataFrame,Want to create a sparse matrix like dataframe from a dataframe in pandas/python,Python Pandas - from data frame create an array or matrix for multiplication,Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Use:

scipy.sparse.coo_matrix(df_dummies)

but do not forget to create df_dummies sparse in the first place...

df_dummies = pandas.get_dummies(df, sparse = True)

This answer will keep the data as sparse as possible and avoids memory issues when using Pandas get_dummies.

import pandas as pd
import numpy as np
from sklearn.preprocessing
import OneHotEncoder
from sklearn.preprocessing
import LabelEncoder
from scipy
import sparse

df = pd.DataFrame({
   'rowid': [1, 2, 3, 4, 5],
   'category': ['c1', 'c2', 'c1', 'c3', 'c1']
})

print 'Input data frame\n{0}'.format(df)

print 'Encode column category as numerical variables'
print LabelEncoder().fit_transform(df.category)

print 'Encode column category as dummy matrix'
print OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1, 1)).todense()

print 'Concat with the original data frame as a matrix'
dummy_matrix = OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1, 1))
df_as_sparse = sparse.csr_matrix(df.drop(labels = ['category'], axis = 1).as_matrix())
sparse_combined = sparse.hstack((df_as_sparse, dummy_matrix), format = 'csr')
print sparse_combined.todense()

Suggestion : 3

In this Vignette we will see how to transform a dense data.frame (dense = few zeroes in the matrix) with categorical variables to a very sparse matrix (sparse = lots of zero in the matrix) of numeric features.,1.5 Measure feature importance 1.5.1 Build the feature importance data.table 1.5.2 Plotting the feature importance 1.5.3 Do these results make sense? ,1.5.1 Build the feature importance data.table,According to the plot above, the most important features in this dataset to predict if the treatment will work are :

require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
data(Arthritis)
df < -data.table(Arthritis, keep.rownames = FALSE)
head(df)
1._
data(Arthritis)
df < -data.table(Arthritis, keep.rownames = FALSE)
2._
head(df)
3._
# # ID Treatment Sex Age Improved
# # 1: 57 Treated Male 27 Some
# # 2: 46 Treated Male 29 None
# # 3: 77 Treated Male 30 None
# # 4: 17 Treated Male 32 Marked
# # 5: 36 Treated Male 46 Marked
# # 6: 23 Treated Male 58 Marked
5._
## Classes 'data.table' and 'data.frame': 84 obs. of 5 variables:
## $ ID : int 57 46 77 17 36 23 75 39 33 55 ...
## $ Treatment: Factor w/ 2 levels "Placebo","Treated": 2 2 2 2 2 2 2 2 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 27 29 30 32 46 58 59 59 63 63 ...
## $ Improved : Ord.factor w/ 3 levels "None"<"Some"<..: 2 1 1 3 3 3 1 3 1 1 ... ## - attr(*, ".internal.selfref" )=<externalptr>
str(df)
head(df[, AgeDiscret: = as.factor(round(Age / 10, 0))])

Suggestion : 4

A symmetric sparse matrix arises as the adjacency matrix of an undirected graph; it can be stored efficiently as an adjacency list. ,sprs implements sparse matrix data structures and linear algebra algorithms in pure Rust.,Trilinos, a large C++ library, with sub-libraries dedicated to the storage of dense and sparse matrices and solution of corresponding linear systems.,Many software libraries support sparse matrices, and provide solvers for sparse matrix equations. The following are open-source:

V = [5 8 3 6]
COL_INDEX = [0 1 2 1]
ROW_INDEX = [0 1 2 3 4]

To extract a row, we first define:

row_start = ROW_INDEX[row]
row_end = ROW_INDEX[row + 1]

Suggestion : 5

sparse_categorical_bottleneck , sparse_categorical_speed , sparse_categorical

import numpy as np
import pysal
import scipy.sparse as sp
import itertools as iter
from scipy.stats
import f, chisqprob
import numpy.linalg as la
import pandas as pd
from datetime
import datetime as dt
import matplotlib.pyplot as plt %
   pylab inline
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab
import has clobbered these variables: ['f']
`%matplotlib`
prevents importing * from pylab and numpy
#OLD
   ""
"
def spcategorical2(n_cat_ids):
   ''
'
Returns a dummy matrix given an array of categorical variables.
Parameters
-- -- -- -- --
n_cat_ids: array
A 1 d vector of the categorical labels
for n observations.

Returns
-- -- -- --
dummy: array
A sparse matrix of dummy(indicator / binary) variables
for the
categorical data.

''
'
if np.squeeze(n_cat_ids).ndim == 1:
   cat_set = np.unique(n_cat_ids)
n = len(n_cat_ids)
C = len(cat_set)
row_map = dict((id, np.where(cat_set == id)[0]) for id in n_cat_ids)
indices = np.array([row_map[row]
   for row in n_cat_ids
]).flatten()
indptr = np.zeros((n + 1, ), dtype = int)
indptr[: -1] = list(np.arange(n))
indptr[-1] = n
return sp.csr_matrix((np.ones(n), indices, indptr))
else:
   raise IndexError("The index %s is not understood" % col)
""
"
def spcategorical2(n_cat_ids):
   ''
'
Returns a dummy matrix given an array of categorical variables.
Parameters
-- -- -- -- --
n_cat_ids: array
A 1 d vector of the categorical labels
for n observations.

Returns
-- -- -- --
dummy: array
A sparse matrix of dummy(indicator / binary) variables
for the
categorical data.

''
'
if np.squeeze(n_cat_ids).ndim == 1:
   cat_set = np.unique(n_cat_ids)
n = len(n_cat_ids)
C = len(cat_set)
indices = n_cat_ids
indptr = np.arange(n + 1, dtype = int)
return sp.csr_matrix((np.ones(n), indices, indptr))
else:
   raise IndexError("The index %s is not understood" % col)
def spcategorical1(data):
   ''
'
Returns a dummy matrix given an array of categorical variables.
Parameters
-- -- -- -- --
data: array
A 1 d vector of the categorical variable.

Returns
-- -- -- --
dummy_matrix
A sparse matrix of dummy(indicator / binary) variables
for the
categorical data.

''
'
if np.squeeze(data).ndim == 1:
   tmp_arr = np.unique(data)
tmp_dummy = sp.csr_matrix((0, len(data)))
for each in tmp_arr[: , None]:
   row = sp.csr_matrix((each == data).astype(float))
tmp_dummy = sp.vstack([tmp_dummy, row])
tmp_dummy = tmp_dummy.T
return tmp_dummy
else:
   raise IndexError("The index %s is not understood" % col)
def spcategorical1a(data):
   ''
'
Returns a dummy matrix given an array of categorical variables.
Parameters
-- -- -- -- --
data: array
A 1 d vector of the categorical variable.

Returns
-- -- -- --
dummy_matrix
A sparse matrix of dummy(indicator / binary) variables
for the
categorical data.

''
'
if np.squeeze(data).ndim == 1:
   tmp_arr = np.unique(data)
n = len(data)
C = len(tmp_arr)
tmp_dummy = sp.dok_matrix((n, C))
for each in tmp_arr[: , None]:
   row = (each == data).astype(float)
tmp_dummy[: , each[0]] = row.reshape((n, 1))
return tmp_dummy.tocsr()
else:
   raise IndexError("The index %s is not understood" % col)

Suggestion : 6

It can be seen that the number of zeros in a sparse matrix is very high. Representing all zero values in a matrix like this would result in high memory usage, so in practice, only non-zero values of the sparse matrix are stored.,The sparsity of this matrix can be calculated by obtaining the ratio of zero elements to total elements. For this example, sparsity is calculated as:,A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix. Using a sparse matrix representation — where only the non-zero values are stored — the space used for representing data and the time for scanning the matrix are reduced significantly.,A simple, sparse matrix will be constructed to show the representation formats of a sparse matrix in Python.

import numpy as np
from scipy
import sparse
X = np.array([
   [0, 0, 0, 3, 0, 0, 4],
   [0, 5, 0, 0, 0, 0, 0],
   [0, 0, 5, 0, 0, 4, 0],
   [4, 0, 0, 0, 0, 0, 1],
   [0, 2, 0, 0, 3, 0, 0]
])
print(X)
[
   [0 0 0 3 0 0 4]
   [0 5 0 0 0 0 0]
   [0 0 5 0 0 4 0]
   [4 0 0 0 0 0 1]
   [0 2 0 0 3 0 0]
]
sparsity = 1.0 - (np.count_nonzero(X) / X.size)
print('The sparsity of X is ', sparsity)
The sparsity of X is 0.7428571428571429
# Convert X to a sparse matrix

S1 = sparse.csr_matrix(X)

print(f ""
      "
      Type of sparse matrix representation: {
         type(S1)
      }

      Sparse Matrix: \n {
         S1
      }

      Sparse Data: {
         S1.data
      }

      Indices of columns: {
         S1.indices
      }

      Pointers
      for data: {
         S1.indptr
      }
      ""
      ")