selective re-memoization of dataframes

  • Last Update :
  • Techknowledgy :

You can separate the cache, functino into two pairs:

from tempfile
import mkdtemp
from joblib
import Memory

memory1 = Memory(cachedir = mkdtemp(), verbose = 0)
memory2 = Memory(cachedir = mkdtemp(), verbose = 0)

@memory1.cache
def run_my_query1()
# run query_1
return df

@memory2.cache
def run_my_query2()
# run query_2
return df

Now, you can selectively clear the cache:

memory2.clear()

You can use call method of decorated function. But as you can see in the following example, the return value is different from the normal call. You should take care of it.

>>>
import tempfile
   >>>
   import joblib >>>
   memory = joblib.Memory(cachedir = tempfile.mkdtemp(), verbose = 0) >>>
   @memory.cache
   ...def run(x):
   ...print('called with {}'.format(x)) #
for debug
   ...
   return x
      ...
      >>>
      run(1)
called with 1
1
   >>>
   run(2)
called with 2
2
   >>>
   run(3)
called with 3
3
   >>>
   run(2) # Cached
2
   >>>
   run.call(2) # Force call of the original
function
called with 2
   (2, {
      'duration': 0.0011069774627685547,
      'input_args': {
         'x': '2'
      }
   })

Example code:

#!/usr/bin/env python

import sys
from shutil
import rmtree

import joblib

cachedir = "joblib-cache"
memory = joblib.Memory(cachedir)

@memory.cache
def foo():
   print("running foo")
return 42

@memory.cache
def oof():
   print("running oof")
return 24

def main():
   rmtree(cachedir)

print(f "{sys.version=}")
print(f "{joblib.__version__=}")

print(foo())
print(oof())
print()

print("*" * 20 + " These should now be cached " + "*" * 20)
print(foo())
print(oof())
print()

foo.clear()
print("*" * 20 + " `foo` should now be recaculated " + "*" * 20)
print(foo())
print(oof())

if __name__ == "__main__":
   main()

Output:

sys.version='3.9.6 (default, Jun 30 2021, 10:22:16) \n[GCC 11.1.0]'
joblib.__version__='1.0.1'
________________________________________________________________________________
[Memory] Calling __main__--tmp-tmp.DaQHHlsA2H-clearcache.foo...
foo()
running foo
______________________________________________________________foo - 0.0s, 0.0min
42
________________________________________________________________________________
[Memory] Calling __main__--tmp-tmp.DaQHHlsA2H-clearcache.oof...
oof()
running oof
______________________________________________________________oof - 0.0s, 0.0min
24

******************** These should now be cached ********************
42
24

WARNING:root:[MemorizedFunc(func=<function foo at 0x7f9cd7d8e040>, location=joblib-cache/joblib)]: Clearing function cache identified by __main__--tmp-tmp/DaQHHlsA2H-clearcache/foo
   ******************** `foo` should now be recaculated ********************
   ________________________________________________________________________________
   [Memory] Calling __main__--tmp-tmp.DaQHHlsA2H-clearcache.foo...
   foo()
   running foo
   ______________________________________________________________foo - 0.0s, 0.0min
   42
   24

Suggestion : 2

4 days ago Sep 23, 2014  · Say I setup memoization with Joblib as follows (using the solution provided here): from tempfile import mkdtemp cachedir = mkdtemp() from joblib import Memory memory = Memory(cachedir=cachedir, v... ,Say I setup memoization with Joblib as follows (using the solution provided here):,You can use call method of decorated function. But as you can see in the following example, the return value is different from the normal call. You should take care of it.,  › Class ckbrowserswitcherviewcontroller overrides the traitcollection getter w


from tempfile
import mkdtemp cachedir = mkdtemp() from joblib
import Memory memory = Memory(cachedir = cachedir, verbose = 0) @memory.cache def run_my_query(my_query)...
   return df

run_my_query(query_1) run_my_query(query_1) # < -Uses cached output run_my_query(query_2) run_my_query(query_2) # < -Uses cached output

from tempfile
import mkdtemp from joblib
import Memory memory1 = Memory(cachedir = mkdtemp(), verbose = 0) memory2 = Memory(cachedir = mkdtemp(), verbose = 0) @memory1.cache def run_my_query1() # run query_1
return df @memory2.cache def run_my_query2() # run query_2
return df
from tempfile
import mkdtemp cachedir = mkdtemp() from joblib
import Memory memory = Memory(cachedir = cachedir, verbose = 0) @memory.cache def run_my_query(my_query)...
   return df
run_my_query(query_1) run_my_query(query_1) # < -Uses cached output run_my_query(query_2) run_my_query(query_2) # < -Uses cached output
from tempfile
import mkdtemp from joblib
import Memory memory1 = Memory(cachedir = mkdtemp(), verbose = 0) memory2 = Memory(cachedir = mkdtemp(), verbose = 0) @memory1.cache def run_my_query1() # run query_1return df @memory2.cache def run_my_query2() # run query_2return df

Suggestion : 3

Last Updated : 25 May, 2022

120
120

result saved in memory
result saved in memory
result saved in memory
result saved in memory
result saved in memory
120
returning result from saved memory
120

120
120

Explanation: 
1. A function called memoize_factorial has been defined. Its main purpose is to store the intermediate results in the variable called memory. 
2. The second function called facto is the function to calculate the factorial. It has been annotated by a decorator(the function memoize_factorial). The facto has access to the memory variable as a result of the concept of closures. The annotation is equivalent to writing, 

facto = memoize_factorial(facto)

Suggestion : 4

DataFrames are hashable, so it should work fine. Here's an example.,It is not as efficient as it could be as it just uses pickle internally for the DataFrame (although it compresses it on the fly, so it is not horrible in terms of memory use; just slower than it could be).,Import multiple csv files into pandas and concatenate into one DataFrame,Creating an empty Pandas DataFrame, then filling it?

Author of jug here: jug works fine. I just tried the following and it works:

from jug
import TaskGenerator
import pandas as pd
import numpy as np

@TaskGenerator
def gendata():
   return pd.DataFrame(np.arange(343440).reshape((10, -1)))

@TaskGenerator
def compute(x):
   return x.mean()

y = compute(gendata())

DataFrames are hashable, so it should work fine. Here's an example.

In[2]: func = lambda df: df.apply(np.fft.fft)

In[3]: memoized_func = memoized(func)

In[4]: df = DataFrame(np.random.randn(1000, 1000))

In[5]: % timeit func(df)
10 loops, best of 3: 124 ms per loop

In[9]: % timeit memoized_func(df)
1000000 loops, best of 3: 1.46 us per loop