Replies: 4 comments 1 reply
-
Hi again, Here are the checks made: between the 2 codes, and comparing to import numpy as np
from numba import guvectorize
import vaex
# vaex 1 // buffer within the vectorized function
@guvectorize('void(float64[:], float64[:])', '(m)->(m)', nopython=True)
def crazy(data: np.ndarray, res: np.ndarray):
buffer = 0
for idx, val in np.ndenumerate(data):
idx, = idx
res[idx] = val+buffer
buffer = res[idx]
@vaex.register_function()
def crazy_v(ar):
result = np.zeros(len(ar))
crazy(ar, result)
return result
# vaex 2 // buffer outside the vectorized function
@guvectorize('void(float64[:], float64[:], float64[:])', '(m), (n)->(m)', nopython=True)
def crazy2(data: np.ndarray, buffer: np.ndarray, res: np.ndarray):
for idx, val in np.ndenumerate(data):
idx, = idx
res[idx] = val+buffer[0]
buffer[0] = res[idx]
@vaex.register_function()
def crazy2_v(ar, buffer):
result = np.zeros(len(ar))
crazy2(ar, buffer, result)
return result
SIZE = 1025
# vaex 1
vdf = vaex.from_arrays(x=np.arange(SIZE))
vdf['res'] = vdf.func.crazy_v(vdf.x)
resV1 = vdf['res'].values[-1]
# vaex 2
vdf2 = vaex.from_arrays(x=np.arange(SIZE))
buffer = np.array([0])
vdf2['res'] = vdf2.func.crazy2_v(vdf2.x, buffer)
resV2 = vdf2['res'].values[-1]
# numpy
resN = np.sum(np.arange(SIZE))
print(f'resV1 equals resV2: {resV1 == resV2}')
print(f'resN equals resV2: {resN == resV2}')
resV1 equals resV2: True
resN equals resV2: False
resN
Out[46]: 524800
resV1
Out[47]: 1024.0 One notices the sum gets resetted at row 1025 (1024 is the value of row 1025). What surprises me is that the 'buffer strategy' of 2nd approach does not work. buffer
Out[45]: array([0]) @maartenbreddels I understand it may not be that simple. |
Beta Was this translation helpful? Give feedback.
-
Hi there, i am 'reviving' this thread.
This would allow for custom calculations. |
Beta Was this translation helpful? Give feedback.
-
I will try with |
Beta Was this translation helpful? Give feedback.
-
You need to use shift if you want access to the previous row, and that is indeed what is used to implement rolling window calculations (see df.rolling.sum e.g.) |
Beta Was this translation helpful? Give feedback.
-
Hi,
I issued few days ago ticket #1428 in which a complementary question was raised, as underlined by @maartenbreddels :
Can vaex row results depend on previous results?
@maartenbreddels , your answer has been:
Not yet, [...] the fundamental issue in vaex to support this is that you need the chunks that are processed to overlap and vary its results to the next chunk (like in cumulative min/max)
I take the opportunity to pursue the discussion here.
I do understand the approach you describe (I think).
But now, what if I give vaex a 'vectorized' function that has its own buffer to keep track result from previous rows?
I tried in the example below, and I have no error message. Now, I don't know how vaex is using this function.
Do you see a limitation to the example below?
Limitation 1?
Is vaex using it in batch, multiple times, meaning the buffer is resetted each time the function is executed again?
If yes, I can change the code as proposed below (keeping the buffer variable outside the loop so that it is not resetted when the function is executed multiple times).
Limitation 2?
Limitation 2 I can see if above code is ok is that each execution of the vectorized function is probably distributed in different threads, hence, computations are not sequential and buffer is not updated as it should with the right value when a new computation is done.
In this latter case, is there a way to tell vaex not to distribute the computations?
Please, does one of the above code seem applicable to you?
I thank you in advance for your feedback and help.
Bests,
PS: in above codes, both approach give the same result, and I guess the array is not big enough so that vaex runs the function in batch and distributes it to different threads.
Beta Was this translation helpful? Give feedback.
All reactions