pandas groupby - case sensitive issues in groups

  • Last Update :
  • Techknowledgy :

Just as a followup to this you don't actually need to overwrite the keyword when you do the grouping. You can instead do the whole transformation in the call to groupby

grouped = df.groupby(df['Keyword'].str.lower())

So as an example you could then have:

df = pandas.DataFrame({
   'Keyword': ['Attorney', 'ATTORNEY', 'foo'],
   'x': [1, 2, 42]
})

df.groupby(df['Keyword'].str.lower()).sum()

Which outputs:

           x
           Keyword
           attorney 3
           foo 42

Suggestion : 2

pandas groupby - case sensitive issues in groups,Python : Group rows in dataframe and select abs max value in groups using pandas groupby,Calculate value-wise mean and value-wise sum among groups in Pandas groupby,Pandas groupby with overlapping groups / windows

Just as a followup to this you don't actually need to overwrite the keyword when you do the grouping. You can instead do the whole transformation in the call to groupby

grouped = df.groupby(df['Keyword'].str.lower())

So as an example you could then have:

df = pandas.DataFrame({
   'Keyword': ['Attorney', 'ATTORNEY', 'foo'],
   'x': [1, 2, 42]
})

df.groupby(df['Keyword'].str.lower()).sum()

Which outputs:

           x
           Keyword
           attorney 3
           foo 42

Suggestion : 3

Last Updated : 27 Feb, 2020

The original list: ['man', 'a', 'gEek', 'for', 'GEEK', 'FoR']
The list after Categorization: [
   ['man'],
   ['a'],
   ['gEek', 'GEEK'],
   ['for', 'FoR']
]

Suggestion : 4

Flags to pass through to the re module, e.g. re.IGNORECASE.,Character sequence or regular expression.,If True, assumes the pat is a regular expression.,Ensure pat is a not a literal pattern when regex is set to True. Note in the following example one might expect only s2[1] and s2[3] to return True. However, ‘.0’ as a regex matches any character followed by a 0.

>>> s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.NaN]) >>>
   s1.str.contains('og', regex = False)
0 False
1 True
2 False
3 False
4 NaN
dtype: object
>>> ind = pd.Index(['Mouse', 'dog', 'house and parrot', '23.0', np.NaN]) >>>
   ind.str.contains('23', regex = False)
Index([False, False, False, True, nan], dtype = 'object')
>>> s1.str.contains('oG',
   case = True, regex = True)
0 False
1 False
2 False
3 False
4 NaN
dtype: object
>>> s1.str.contains('og', na = False, regex = True)
0 False
1 True
2 False
3 False
4 False
dtype: bool
>>> s1.str.contains('house|parrot', regex = True)
0 False
1 False
2 True
3 False
4 NaN
dtype: object
>>>
import re
   >>>
   s1.str.contains('PARROT', flags = re.IGNORECASE, regex = True)
0 False
1 False
2 True
3 False
4 NaN
dtype: object

Suggestion : 5

Additionally, it is strongly discouraged to use case sensitive column names. Pandas API on Spark disallows it by default.,However, you can turn on spark.sql.caseSensitive in Spark configuration to enable it for use at your own risk.,It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. For instance, see below:,Columns with leading __ and trailing __ are reserved in pandas API on Spark. To handle internal behaviors for, such as, index, pandas API on Spark uses some internal columns. Therefore, it is discouraged to use such column names and they are not guaranteed to work.

from pyspark
import SparkConf, SparkContext
conf = SparkConf()
conf.set('spark.executor.memory', '2g')
# Pandas API on Spark automatically uses this Spark context with the configurations set.
SparkContext(conf = conf)

import pyspark.pandas as ps
   ...
from pyspark.sql
import SparkSession
builder = SparkSession.builder.appName("pandas-on-spark")
builder = builder.config("spark.sql.execution.arrow.pyspark.enabled", "true")
# Pandas API on Spark automatically uses this Spark session with the configurations set.
builder.getOrCreate()

import pyspark.pandas as ps
   ...
>>>
import pyspark.pandas as ps >>>
   psdf = ps.DataFrame({
      'id': range(10)
   }) >>>
   psdf = psdf[psdf.id > 5] >>>
   psdf.spark.explain() ==
   Physical Plan ==
   *
   (1) Filter(id #1L > 5)
+- *(1) Scan ExistingRDD[__index_level_0__# 0 L, id #1L]
>>>
import pyspark.pandas as ps >>>
   psdf = ps.DataFrame({
      'id': range(10)
   }) >>>
   psdf = psdf[psdf.id > 5] >>>
   psdf['id'] = psdf['id'] + (10 * psdf['id'] + psdf['id']) >>>
   psdf = psdf.groupby('id').head(2) >>>
   psdf.spark.explain() ==
   Physical Plan ==
   *
   (3) Project[__index_level_0__ #0L, id# 31 L] +
   - * (3) Filter(isnotnull(__row_number__ #44) AND (__row_number__# 44 <= 2)) +
   -Window[row_number() windowspecdefinition(__groupkey_0__ #36L, __natural_order__# 16 L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __row_number__ #44], [__groupkey_0__# 36 L], [__natural_order__ #16L ASC NULLS FIRST]
      +- *(2) Sort [__groupkey_0__# 36 L ASC NULLS FIRST, __natural_order__ #16L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(__groupkey_0__# 36 L, 200), true, [id = #33]
            +- *(1) Project [__index_level_0__# 0 L, (id #1L + ((id# 1 L * 10) + id #1L)) AS __groupkey_0__# 36 L, (id #1L + ((id# 1 L * 10) + id #1L)) AS id# 31 L, __natural_order__ #16L]
               +- *(1) Project [__index_level_0__# 0 L, id #1L, monotonically_increasing_id() AS __natural_order__# 16 L] +
   - * (1) Filter(id #1L > 5)
                     +- *(1) Scan ExistingRDD[__index_level_0__# 0 L, id #1L]

>>> psdf = psdf.spark.local_checkpoint() # or psdf.spark.checkpoint() >>>
      psdf.spark.explain() ==
      Physical Plan ==
      *
      (1) Project[__index_level_0__ #0L, id# 31 L] +
      - * (1) Scan ExistingRDD[__index_level_0__ #0L,id# 31 L, __natural_order__ #59L]
>>>
import pyspark.pandas as ps >>>
   psdf = ps.DataFrame({
      'id': range(10)
   }).sort_values(by = "id") >>>
   psdf.spark.explain() ==
   Physical Plan ==
   *
   (2) Sort[id #9L ASC NULLS LAST], true, 0
+- Exchange rangepartitioning(id# 9 L ASC NULLS LAST, 200), true, [id = #18]
   +- *(1) Scan ExistingRDD[__index_level_0__# 8 L, id #9L]
>>>
import pyspark.pandas as ps >>>
   psdf = ps.DataFrame({
      'id': range(10)
   }) >>>
   psdf.rank().spark.explain() ==
   Physical Plan ==
   *
   (4) Project[__index_level_0__ #16L, id# 24] +
   -Window[avg(cast(_w0 #26 as bigint)) windowspecdefinition(id# 17 L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS id #24], [id# 17 L] +
      - * (3) Project[__index_level_0__ #16L, _w0# 26, id #17L]
      +- Window [row_number() windowspecdefinition(id# 17 L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _w0 #26], [id# 17 L ASC NULLS FIRST] +
   - * (2) Sort[id #17L ASC NULLS FIRST], false, 0
            +- Exchange SinglePartition, true, [id= #48]
               +- *(1) Scan ExistingRDD[__index_level_0__# 16 L, id #17L]

Suggestion : 6

To confirm if the contents of two strings are not same we can use != operator too. Let’s see an example,,Sometimes is operator is also used to compare strings to check if they are equal or not. But it will not always work because there is a fundamental difference in functionality of is and == operator in python.,So, to compare and check if the contents of a two strings are equal, we should use == operator i.e. for above string objects == operator will return True i.e. ,In this article we will discuss different ways to compare strings in python like, using == operator (with or without ignoring case) or using is operator or using regex.

Suppose we have two strings i.e.

firstStr = "sample"
secStr = "sample"

if firstStr == secStr:
   print('Both Strings are same')
else:
   print('Strings are not same')

Let’s see actual example,

if "abcd" > "abcc":
print('"abcd" is greater than "abcc"')

if "Abc" < "abc":
print('"Abc" is less than "abc"')

if "abcdd" > "abc":
print('"abcdd" is greater than "abc"')

Now let’s change the second variable secStr i.e.

secStr = "sample is".split()[0]

So, to compare and check if the contents of a two strings are equal, we should use == operator i.e. for above string objects == operator will return True i.e.

if firstStr == secStr:
   print('Contents of both Strings are same')

Suggestion : 7

Although database, table, and trigger names are not case-sensitive on some platforms, you should not refer to one of these using different cases within the same statement. The following statement would not work because it refers to a table both as my_table and as MY_TABLE: , Exception: If you are using InnoDB tables and you are trying to avoid these data transfer problems, you should use lower_case_table_names=1 on all platforms to force names to be converted to lowercase. , By default, table aliases are case-sensitive on Unix, but not so on Windows or macOS. The following statement would not work on Unix, because it refers to the alias both as a and as A: , Partition, subpartition, column, index, stored routine, event, and resource group names are not case-sensitive on any platform, nor are column aliases.

Although database, table, and trigger names are not case-sensitive on some platforms, you should not refer to one of these using different cases within the same statement. The following statement would not work because it refers to a table both as my_table and as MY_TABLE:

mysql > SELECT * FROM my_table WHERE MY_TABLE.col = 1;

Although database, table, and trigger names are not case-sensitive on some platforms, you should not refer to one of these using different cases within the same statement. The following statement would not work because it refers to a table both as my_table and as MY_TABLE:

mysql > SELECT * FROM my_table WHERE MY_TABLE.col = 1;

By default, table aliases are case-sensitive on Unix, but not so on Windows or macOS. The following statement would not work on Unix, because it refers to the alias both as a and as A:

mysql > SELECT col_name FROM tbl_name AS a
WHERE a.col_name = 1 OR A.col_name = 2;

Suggestion : 8

When getting the groupby attribute, it will always return the real Python groupby method, which corresponds to the pandas.DataFrame.groupby() method.,As you can see, the groupby method is returned when getting the attribute, but the table groupby parameter is set when setting the attribute. The reason for this is that the dynamic look-thru for CAS actions and table parameters only happens if there isn’t a real Python attribute or method defined. In the case of CASTable objects, the groupby method is defined to match the pandas.DataFrame.groupby() method.,When setting attributes, the name of the attribute is checked against the valid parameter names for a table. If it matches the name of a table attribute, it is set as a CAS table parameter, otherwise it is just set on the object as a standard Python attribute.,Much like the way pandas.DataFrames allows you to access column names as attributes as well as keys, some objects in the SWAT package also have multiple namespaces mapped to their attributes. This is especially true with CASTable objects.

Since the SWAT API tries to blend the world of CAS and Pandas into a single world, you have to be aware of whether you are calling a CAS action or a method from the Pandas API. CAS actions will always return a CASResults object (which is a subclass of Python’s dictionary).

In[1]: out = tbl.summary()

In[2]: type(out)
Out[2]: swat.cas.results.CASResults

In[3]: out = tbl.serverstatus()
Note: Grid node action status report: 1 nodes, 11 total actions executed.

In[4]: type(out)
Out[4]: swat.cas.results.CASResults
2._
In[5]: out = tbl.head()

In[6]: type(out)
Out[6]: swat.SASDataFrame

In[7]: out = tbl.mean()

In[8]: type(out)
Out[8]: pandas.core.series.Series

In[9]: out = tbl.Make

In[10]: type(out)
Out[10]: swat.cas.table.CASColumn
In[5]: out = tbl.head()

In[6]: type(out)
Out[6]: swat.SASDataFrame

In[7]: out = tbl.mean()

In[8]: type(out)
Out[8]: pandas.core.series.Series

In[9]: out = tbl.Make

In[10]: type(out)
Out[10]: swat.cas.table.CASColumn

These collisions can manifest themselves in ways that seem confusing. Here is an example.

In [11]: tbl.groupby
Out[11]: <bound method CASTable.groupby of CASTable('TMPCBHHE__G', caslib='CASUSER(castest)')>

In [12]: tbl.groupby = ['Origin']

In [13]: tbl.groupby
Out[13]: <bound method CASTable.groupby of CASTable('TMPCBHHE__G', caslib='CASUSER(castest)', groupby=['Origin'])>

In [14]: tbl.params
Out[14]: {'name': 'TMPCBHHE__G', 'caslib': 'CASUSER(castest)', 'groupby': ['Origin']}
2._
In[15]: tbl.params.groupby = ['Origin']

In[16]: tbl.params
Out[16]: {
   'name': 'TMPCBHHE__G',
   'caslib': 'CASUSER(castest)',
   'groupby': ['Origin']
}
3._
In[17]: tbl[['Origin', 'Cylinders']].simple.groupby()
Out[17]: [Groupby]

Groupby
for TMPCBHHE__G

Origin Origin_f Cylinders Cylinders_f Rank
0 Asia Asia NaN.14
1 Asia Asia 3.0 3 13
2 Asia Asia 4.0 4 12
3 Asia Asia 6.0 6 11
4 Asia Asia 8.0 8 10
5 Europe Europe 4.0 4 9
6 Europe Europe 5.0 5 8
7 Europe Europe 6.0 6 7
8 Europe Europe 8.0 8 6
9 Europe Europe 12.0 12 5
10 USA USA 4.0 4 4
11 USA USA 6.0 6 3
12 USA USA 8.0 8 2
13 USA USA 10.0 10 1

   +
   Elapsed: 0.0166 s, user: 0.016 s, sys: 0.003 s, mem: 4.43 mb
In[15]: tbl.params.groupby = ['Origin']

In[16]: tbl.params
Out[16]: {
   'name': 'TMPCBHHE__G',
   'caslib': 'CASUSER(castest)',
   'groupby': ['Origin']
}
In[17]: tbl[['Origin', 'Cylinders']].simple.groupby()
Out[17]: [Groupby]

Groupby
for TMPCBHHE__G

Origin Origin_f Cylinders Cylinders_f Rank
0 Asia Asia NaN.14
1 Asia Asia 3.0 3 13
2 Asia Asia 4.0 4 12
3 Asia Asia 6.0 6 11
4 Asia Asia 8.0 8 10
5 Europe Europe 4.0 4 9
6 Europe Europe 5.0 5 8
7 Europe Europe 6.0 6 7
8 Europe Europe 8.0 8 6
9 Europe Europe 12.0 12 5
10 USA USA 4.0 4 4
11 USA USA 6.0 6 3
12 USA USA 8.0 8 2
13 USA USA 10.0 10 1

   +
   Elapsed: 0.0166 s, user: 0.016 s, sys: 0.003 s, mem: 4.43 mb
In [18]: tbl.groupby('Origin')
Out[18]: <swat.cas.table.CASTableGroupBy at 0x7f633e338250>