checking for overlaps in two long lists of items in python

  • Last Update :
  • Techknowledgy :

To complete @Jeff's answer, we can compare the time of computation for the two methods:

import numpy as np
import time

test = np.random.randint(1, 50000, 10000)
train = np.random.randint(1, 50000, 10000)

start_list = time.time()
overlap = [e
   for e in test
   if e in train
]
end_list = time.time()
print("with list comprehension: " + str(end_list - start_list))

set_test = set(test)
set_train = set(train)

start_set = time.time()
overlap = set_test.intersection(set_train)
end_set = time.time()
print("with sets: " + str(end_set - start_set))

We get the output:

with list comprehension: 0.08894968032836914
with sets: 0.0003533363342285156

There are, however, specific methods to determine the intersection and the union of two sets. As long as ordering isn't important*, here's how you might do it:

train_set = set(train) # Use frozenset
if no mutation is required
test_set = set(test)
common_elements = train_set & test_set # or, equivalently
common_elements = train_set.intersection(test_set)
set_test = set(e)

set_train = set(train)

overlap = set_test.intersection(set_train)

You can use numpy's intersect1d():-

import random
import numpy as np

train = [random.randint(1, 51) for
   var in range(1, 9000)
] #Your list
test = [random.randint(1, 51) for
   var in range(1, 9000)
] #Your list

train = np.array(train) #Converting list into numpy 's array
test = np.array(test)

overlap = np.intersect1d(train, test)
print(overlap)

Suggestion : 2

Last Updated : 01 Sep, 2021,GATE CS 2021 Syllabus

1._
Input:
   lst1 = [15, 9, 10, 56, 23, 78, 5, 4, 9]
lst2 = [9, 4, 5, 36, 47, 26, 10, 45, 87]
Output: [9, 10, 4, 5]

Input:
   lst1 = [4, 9, 1, 17, 11, 26, 28, 54, 69]
lst2 = [9, 9, 74, 21, 45, 11, 63, 28, 26]
Output: [9, 11, 26, 28]

Output: 
 

[9, 11, 26, 28]

Working: The filter part takes each sublist’s item and checks to see if it is in the source list. The list comprehension is executed for each sublist in list2. 
Output: 
 

[
   [13, 32],
   [7, 13, 28],
   [1, 6]
]

Suggestion : 3

Objects of different types, except different numeric types, never compare equal. The == operator is always defined but for some object types (for example, class objects) is equivalent to is. The <, <=, > and >= operators are only defined where they make sense; for example, they raise a TypeError exception when one of the arguments is a complex number.,There are really two flavors of function objects: built-in functions and user-defined functions. Both support the same operation (to call the function), but the implementation is different, hence the different object types.,All numeric types (except complex) support the following operations (for priorities of the operations, see Operator precedence):,Python defines several iterator objects to support iteration over general and specific sequence types, dictionaries, and other more specialized forms. The specific types are not important beyond their implementation of the iterator protocol.

>>> n = -37 >>>
   bin(n)
'-0b100101' >>>
n.bit_length()
6
def bit_length(self):
   s = bin(self) # binary representation: bin(-37) -- > '-0b100101'
s = s.lstrip('-0b') # remove leading zeros and minus sign
return len(s) # len('100101') -- > 6
>>> n = 19 >>>
   bin(n)
'0b10011' >>>
n.bit_count()
3
   >>>
   (-n).bit_count()
3
def bit_count(self):
   return bin(self).count("1")
>>> (1024).to_bytes(2, byteorder = 'big')
b '\x04\x00' >>>
   (1024).to_bytes(10, byteorder = 'big')
b '\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00' >>>
   (-1024).to_bytes(10, byteorder = 'big', signed = True)
b '\xff\xff\xff\xff\xff\xff\xff\xff\xfc\x00' >>>
   x = 1000 >>>
   x.to_bytes((x.bit_length() + 7) // 8, byteorder='little')
      b '\xe8\x03'
>>> int.from_bytes(b '\x00\x10', byteorder = 'big')
16
   >>>
   int.from_bytes(b '\x00\x10', byteorder = 'little')
4096
   >>>
   int.from_bytes(b '\xfc\x00', byteorder = 'big', signed = True) -
   1024 >>>
   int.from_bytes(b '\xfc\x00', byteorder = 'big', signed = False)
64512
   >>>
   int.from_bytes([255, 0, 0], byteorder = 'big')
16711680