python: convert multiple instances of the same key into multiple rows

  • Last Update :
  • Techknowledgy :

product from itertools and

from itertools
import product
from collections
import OrderedDict
a = OrderedDict({
   'id': ['1221AB12'],
   's': ['Party!'],
   'flag': ['urgent', 'english'],
   't': ['0523456789', '0301234567'],
   'f': ['0412345678']
})
res = product( * a.values())
for line in res:
   print " ".join(["%s=%s" % (m, n) for m, n in zip(a.keys(), line)])

result

s = Party!f = 0412345678 flag = urgent id = 1221 AB12 t = 0523456789
s = Party!f = 0412345678 flag = urgent id = 1221 AB12 t = 0301234567
s = Party!f = 0412345678 flag = english id = 1221 AB12 t = 0523456789
s = Party!f = 0412345678 flag = english id = 1221 AB12 t = 0301234567

Suggestion : 2

03/07/2022

A simple expansion of a single column:

datatable(a: int, b: dynamic)[1, dynamic({
      "prop1": "a",
      "prop2": "b"
   })] |
   mv - expand b

Expanding two columns will first 'zip' the applicable columns and then expand them:

datatable(a: int, b: dynamic, c: dynamic)[1, dynamic({
      "prop1": "a",
      "prop2": "b"
   }), dynamic([5, 4, 3])] |
   mv - expand b, c

If you want to get a Cartesian product of expanding two columns, expand one after the other:

datatable(a: int, b: dynamic, c: dynamic)[
      1,
      dynamic({
         "prop1": "a",
         "prop2": "b"
      }),
      dynamic([5, 6])
   ] |
   mv - expand b |
   mv - expand c

Expansion of an array with with_itemindex:

range x from 1 to 4 step 1
   |
   summarize x = make_list(x) |
   mv - expand with_itemindex = Index x

Suggestion : 3

The simple join operator is an inner join.3 Only keys that are present in both pair RDDs are output. When there are multiple values for the same key in one of the inputs, the resulting pair RDD will have an entry for every possible pair of values with that key from the two input RDDs. A simple way to understand this is by looking at Example 4-17.,If it is a value we have seen before while processing that partition, it will instead use the provided function, mergeValue(), with the current value for the accumulator for that key and the new value.,Apply a function that returns an iterator to each value of a pair RDD, and for each element returned, produce a key/value entry with the old key. Often used for tokenization.,There are also multiple other actions on pair RDDs that save the RDD, which we will describe in Chapter 5.

Example 4-1. Creating a pair RDD using the first word as the key in Python
pairs = lines.map(lambda x: (x.split(" ")[0], x))
Example 4-2. Creating a pair RDD using the first word as the key in Scala
val pairs = lines.map(x => (x.split(" ")(0), x))
Example 4-3. Creating a pair RDD using the first word as the key in Java
PairFunction<String, String, String> keyData =
  new PairFunction<String, String, String>() {
  public Tuple2<String, String> call(String x) {
    return new Tuple2(x.split(" ")[0], x);
  }
};
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
Example 4-4. Simple filter on second element in Python
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)
Example 4-5. Simple filter on second element in Scala
pairs.filter {
   case (key, value) => value.length < 20
}
Example 4-6. Simple filter on second element in Java
Function<Tuple2<String, String>, Boolean> longWordFilter =
  new Function<Tuple2<String, String>, Boolean>() {
    public Boolean call(Tuple2<String, String> keyValue) {
      return (keyValue._2().length() < 20);
    }
  };
JavaPairRDD<String, String> result = pairs.filter(longWordFilter);

Suggestion : 4

Last Updated : 18 Jan, 2022

Output in Python:

(4, 'Linnett', 79)
(5, 'Jayden', 45)
(6, 'Sam', 63)
(7, 'Zooey', 82)
(8, 'Robb', 97)
(9, 'Jon', 38)
(10, 'Sansa', 54)
(11, 'Arya', 78)
(12, 'sarah', 90)
(13, 'Ray', 81)

Suggestion : 5

The existence of multiple row/column indices at the same time has not been mentioned within these tutorials. Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.,The air quality measurement station coordinates are stored in a data file air_quality_stations.csv, downloaded using the py-openaq package.,The air quality parameters metadata are stored in a data file air_quality_parameters.csv, downloaded using the py-openaq package.,Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table air_quality_pm25):

In[1]: import pandas as pd
In[2]: air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
      ...: parse_dates = True)
   ...:

   In[3]: air_quality_no2 = air_quality_no2[["date.utc", "location",
      ...: "parameter", "value"
   ]]
   ...:

   In[4]: air_quality_no2.head()
Out[4]:
   date.utc location parameter value
0 2019 - 06 - 21 00: 00: 00 + 00: 00 FR04014 no2 20.0
1 2019 - 06 - 20 23: 00: 00 + 00: 00 FR04014 no2 21.8
2 2019 - 06 - 20 22: 00: 00 + 00: 00 FR04014 no2 26.5
3 2019 - 06 - 20 21: 00: 00 + 00: 00 FR04014 no2 24.9
4 2019 - 06 - 20 20: 00: 00 + 00: 00 FR04014 no2 21.4
In[5]: air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
      ...: parse_dates = True)
   ...:

   In[6]: air_quality_pm25 = air_quality_pm25[["date.utc", "location",
      ...: "parameter", "value"
   ]]
   ...:

   In[7]: air_quality_pm25.head()
Out[7]:
   date.utc location parameter value
0 2019 - 06 - 18 06: 00: 00 + 00: 00 BETR801 pm25 18.0
1 2019 - 06 - 17 08: 00: 00 + 00: 00 BETR801 pm25 6.5
2 2019 - 06 - 17 07: 00: 00 + 00: 00 BETR801 pm25 18.5
3 2019 - 06 - 17 06: 00: 00 + 00: 00 BETR801 pm25 16.0
4 2019 - 06 - 17 05: 00: 00 + 00: 00 BETR801 pm25 7.5
In[8]: air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis = 0)

In[9]: air_quality.head()
Out[9]:
   date.utc location parameter value
0 2019 - 06 - 18 06: 00: 00 + 00: 00 BETR801 pm25 18.0
1 2019 - 06 - 17 08: 00: 00 + 00: 00 BETR801 pm25 6.5
2 2019 - 06 - 17 07: 00: 00 + 00: 00 BETR801 pm25 18.5
3 2019 - 06 - 17 06: 00: 00 + 00: 00 BETR801 pm25 16.0
4 2019 - 06 - 17 05: 00: 00 + 00: 00 BETR801 pm25 7.5
In[10]: print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)
Shape of the ``
air_quality_pm25``
table: (1110, 4)

In[11]: print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)
Shape of the ``
air_quality_no2``
table: (2068, 4)

In[12]: print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)
Shape of the resulting ``
air_quality``
table: (3178, 4)
In[13]: air_quality = air_quality.sort_values("date.utc")

In[14]: air_quality.head()
Out[14]:
   date.utc location parameter value
2067 2019 - 05 - 07 01: 00: 00 + 00: 00 London Westminster no2 23.0
1003 2019 - 05 - 07 01: 00: 00 + 00: 00 FR04014 no2 25.0
100 2019 - 05 - 07 01: 00: 00 + 00: 00 BETR801 pm25 12.5
1098 2019 - 05 - 07 01: 00: 00 + 00: 00 BETR801 no2 50.5
1109 2019 - 05 - 07 01: 00: 00 + 00: 00 London Westminster pm25 8.0