how to efficiently left outer join two sorted lists

  • Last Update :
  • Techknowledgy :

Here's memory-efficient version that produces one key/value pair at a time:

def left_outer_join(keys, pairs,
      default = None):
   right = iter(pairs)
right_key = float('-inf') # sentinel: any left key must be larger than it
for left_key in keys:
   if left_key == right_key: # * keys * and * right * are in sync
value = right_value # from previous iteration
elif left_key < right_key: # * keys * is behind * right *
   value =
   default
else: # left_key > right_key: * keys * is ahead of * right *
   for right_key, right_value in right: #
catch up with * keys *
   if left_key <= right_key: # drop
while left_key > right_key
break
value = right_value
if left_key == right_key
else default
yield left_key, value

Example:

left_sorted_list = [1, 2, 3, 4, 5]
right_sorted_list = [
   [2, 21],
   [4, 45],
   [6, 67]
]
print(list(left_outer_join(left_sorted_list, right_sorted_list)))
# - > [(1, None), (2, 21), (3, None), (4, 45), (5, None)]

This is no more efficient than what you have, but it is more pythonic.

left_sorted_list = [1, 2, 3, 4, 5]
right_sorted_list = [
   [2, 21],
   [4, 45]
]

right_dict = dict(right_sorted_list)
left_outer_join = [
   [l, right_dict.get(l)]
   for l in left_sorted_list
]

For result I used tuples, so there is less square brackets ;)

left_sorted_list = [1, 2, 3, 4, 5]
right_sorted_list = [
   [2, 21],
   [4, 45]
]

d = dict(right_sorted_list) #
if you have a list of pairs, just pass it to dict()
print[(x, d[x]
   if x in d
   else None) for x in left_sorted_list]

# #--End pasted text--[(1, None), (2, 21), (3, None), (4, 45), (5, None)]

Suggestion : 2

Last Updated : 24 Dec, 2018

The original list 1 is: [1, 5, 6, 9, 11]
The original list 2 is: [3, 4, 7, 8, 10]
The combined sorted list is: [1, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Suggestion : 3

03/11/2022

Person magnus = new("Magnus", "Hedlund");
Person terry = new("Terry", "Adams");
Person charlotte = new("Charlotte", "Weiss");
Person arlene = new("Arlene", "Huff");

Pet barley = new("Barley", terry);
Pet boots = new("Boots", terry);
Pet whiskers = new("Whiskers", charlotte);
Pet bluemoon = new("Blue Moon", terry);
Pet daisy = new("Daisy", magnus);

// Create two lists.
List<Person> people = new() { magnus, terry, charlotte, arlene };
List<Pet> pets = new() { barley, boots, whiskers, bluemoon, daisy };

var query =
    from person in people
    join pet in pets on person equals pet.Owner into gj
    from subpet in gj.DefaultIfEmpty()
    select new
    {
        person.FirstName,
        PetName = subpet?.Name ?? string.Empty
    };

foreach (var v in query)
{
    Console.WriteLine($"{v.FirstName + ":",-15}{v.PetName}");
}

record class Person(string FirstName, string LastName);
record class Pet(string Name, Person Owner);

// This code produces the following output:
//
// Magnus:        Daisy
// Terry:         Barley
// Terry:         Boots
// Terry:         Blue Moon
// Charlotte:     Whiskers
// Arlene:

Suggestion : 4

We can load these CSV files as Pandas DataFrames into pandas using the Pandas read_csv command, and examine the contents using the DataFrame head() command.,Inner Merge / Inner join – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.,High performance database joins with Pandas, a comparison of merge speeds  by Wes McKinney, creator of Pandas.,Combining DataFrames with Pandas on “Python for Ecologists” by DataCarpentry

Lets see how we can correctly add the “device” and “platform” columns to the user_usage dataframe using the Pandas Merge command.

result = pd.merge(user_usage,
   user_device[['use_id', 'platform', 'device']],
   on = 'use_id')
result.head()

You can change the merge to a left-merge with the “how” parameter to your merge command. The top of the result dataframe contains the successfully matched items, and at the bottom contains the rows in user_usage that didn’t have a corresponding use_id in user_device.

result = pd.merge(user_usage,
   user_device[['use_id', 'platform', 'device']],
   on = 'use_id',
   how = 'left')

For examples sake, we can repeat this process with a right join / right merge, simply by replacing how=’left’ with how=’right’ in the Pandas merge command.

result = pd.merge(user_usage,
   user_device[['use_id', 'platform', 'device']],
   on = 'use_id',
   how = 'right')

Coming back to our original problem, we have already merged user_usage with user_device, so we have the platform and device for each user. Originally, we used an “inner merge” as the default in Pandas, and as such, we only have entries for users where there is also device information. We’ll redo this merge using a left join to keep all users, and then use a second left merge to finally to get the device manufacturers in the same dataframe.

# First, add the platform and device to the user usage - use a left join this time.
result = pd.merge(user_usage,
   user_device[['use_id', 'platform', 'device']],
   on = 'use_id',
   how = 'left')

# At this point, the platform and device columns are included
# in the result along with all columns from user_usage

# Now, based on the "device"
column in result, match the "Model"
column in devices.
devices.rename(columns = {
   "Retail Branding": "manufacturer"
}, inplace = True)
result = pd.merge(result,
   devices[['manufacturer', 'Model']],
   left_on = 'device',
   right_on = 'Model',
   how = 'left')
print(result.head())

With our merges complete, we can use the data aggregation functionality of Pandas to quickly work out the mean usage for users based on device manufacturer. Note that the small sample size creates even smaller groups, so I wouldn’t attribute any statistical significance to these particular results!

result.groupby("manufacturer").agg({
   "outgoing_mins_per_month": "mean",
   "outgoing_sms_per_month": "mean",
   "monthly_mb": "mean",
   "use_id": "count"
})

Suggestion : 5

left_on: Columns or index levels from the left DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series.,right_on: Columns or index levels from the right DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series.,Strings passed as the on, left_on, and right_on parameters may refer to either column names or index level names. This enables merging DataFrame instances on a combination of index levels and columns without resetting indexes.,on: Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames and/or Series will be inferred to be the join keys.

In[1]: df1 = pd.DataFrame(
      ...: {
         ...: "A": ["A0", "A1", "A2", "A3"],
         ...: "B": ["B0", "B1", "B2", "B3"],
         ...: "C": ["C0", "C1", "C2", "C3"],
         ...: "D": ["D0", "D1", "D2", "D3"],
         ...:
      },
      ...: index = [0, 1, 2, 3],
      ...: )
   ...:

   In[2]: df2 = pd.DataFrame(
      ...: {
         ...: "A": ["A4", "A5", "A6", "A7"],
         ...: "B": ["B4", "B5", "B6", "B7"],
         ...: "C": ["C4", "C5", "C6", "C7"],
         ...: "D": ["D4", "D5", "D6", "D7"],
         ...:
      },
      ...: index = [4, 5, 6, 7],
      ...: )
   ...:

   In[3]: df3 = pd.DataFrame(
      ...: {
         ...: "A": ["A8", "A9", "A10", "A11"],
         ...: "B": ["B8", "B9", "B10", "B11"],
         ...: "C": ["C8", "C9", "C10", "C11"],
         ...: "D": ["D8", "D9", "D10", "D11"],
         ...:
      },
      ...: index = [8, 9, 10, 11],
      ...: )
   ...:

   In[4]: frames = [df1, df2, df3]

In[5]: result = pd.concat(frames)
pd.concat(
   objs,
   axis = 0,
   join = "outer",
   ignore_index = False,
   keys = None,
   levels = None,
   names = None,
   verify_integrity = False,
   copy = True,
)
In[6]: result = pd.concat(frames, keys = ["x", "y", "z"])
In[7]: result.loc["y"]
Out[7]:
   A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
frames = [process_your_file(f) for f in files]
result = pd.concat(frames)

Suggestion : 6

The matching values of the key variables in the left and right tables do not have to be in the same order. Outer joins can perform one-to-many and many-to-one matches between the key variables of the two tables. That is, a value that occurs once in a key variable of the left table can have multiple matches in the right table. Similarly, a value that occurs once in a key variable of the right table can have multiple matches in the left table.,Variables in table T that came from Tleft contain null values in the rows that have no match from Tright. Similarly, variables in T that came from Tright contain null values in those rows that had no match from Tleft.,Use the outerjoin function to create a new table, T, with data from tables Tleft and Tright. Match up rows with common values in the key variable, Key1, but also retain rows whose key values don’t have a match.,The vectors of row labels of Tleft and Tright can be key variables. Row labels are the row names of a table, or the row times of a timetable.

Tleft = table([5;12;23;2;15;6], ...{
      'cheerios';
      'pizza';
      'salmon';
      'oreos';
      'lobster';
      'pizza'
   }, ...
   'VariableNames', {
      'Age',
      'FavoriteFood'
   }, ...
   'RowNames', {
      'Amy',
      'Bobby',
      'Holly',
      'Harry',
      'Marty',
      'Sally'
   })
Tleft = 6× 2 table
Age FavoriteFood
___ ____________

Amy 5 {
   'cheerios'
}
Bobby 12 {
   'pizza'
}
Holly 23 {
   'salmon'
}
Harry 2 {
   'oreos'
}
Marty 15 {
   'lobster'
}
Sally 6 {
   'pizza'
}
Tright = table({
      'cheerios';
      'oreos';
      'pizza';
      'salmon';
      'cake'
   }, ...[110;160;140;367;243], ...{
      'A-';
      'D';
      'B';
      'B';
      'C-'
   }, ...
   'VariableNames', {
      'FavoriteFood',
      'Calories',
      'NutritionGrade'
   })
Tright = 5× 3 table
FavoriteFood Calories NutritionGrade
____________ ________ ______________

{
   'cheerios'
}
110 {
   'A-'
} {
   'oreos'
}
160 {
   'D'
} {
   'pizza'
}
140 {
   'B'
} {
   'salmon'
}
367 {
   'B'
} {
   'cake'
}
243 {
   'C-'
}
T = outerjoin(Tleft, Tright)
T = 7× 5 table
Age FavoriteFood_Tleft FavoriteFood_Tright Calories NutritionGrade
___ __________________ ___________________ ________ ______________

NaN {
   0x0 char
} {
   'cake'
}
243 {
   'C-'
}
5 {
   'cheerios'
} {
   'cheerios'
}
110 {
   'A-'
}
15 {
   'lobster'
} {
   0x0 char
}
NaN {
   0x0 char
}
2 {
   'oreos'
} {
   'oreos'
}
160 {
   'D'
}
12 {
   'pizza'
} {
   'pizza'
}
140 {
   'B'
}
6 {
   'pizza'
} {
   'pizza'
}
140 {
   'B'
}
23 {
   'salmon'
} {
   'salmon'
}
367 {
   'B'
}