Constraint in vaex to have results depending of previous rows? #1458

yohplala · 2021-07-06T20:39:14Z

yohplala
Jul 6, 2021

Hi,

I issued few days ago ticket #1428 in which a complementary question was raised, as underlined by @maartenbreddels :
Can vaex row results depend on previous results?

@maartenbreddels , your answer has been:
Not yet, [...] the fundamental issue in vaex to support this is that you need the chunks that are processed to overlap and vary its results to the next chunk (like in cumulative min/max)

I take the opportunity to pursue the discussion here.
I do understand the approach you describe (I think).
But now, what if I give vaex a 'vectorized' function that has its own buffer to keep track result from previous rows?

I tried in the example below, and I have no error message. Now, I don't know how vaex is using this function.
Do you see a limitation to the example below?

import numpy as np
from numba import guvectorize
import vaex

@guvectorize('void(float64[:], float64[:])', '(m)->(m)', nopython=True)
def crazy(data: np.ndarray, res: np.ndarray):
    buffer = 0
    for idx, val in np.ndenumerate(data):
        idx, = idx
        res[idx] = val+buffer
        buffer = res[idx]

@vaex.register_function()
def crazy_v(ar):
    result = np.zeros(len(ar))
    crazy(ar, result)
    return result

vdf = vaex.from_arrays(x=np.arange(1000))
vdf['res'] = vdf.func.crazy_v(vdf.x)

Limitation 1?
Is vaex using it in batch, multiple times, meaning the buffer is resetted each time the function is executed again?

If yes, I can change the code as proposed below (keeping the buffer variable outside the loop so that it is not resetted when the function is executed multiple times).

import numpy as np
from numba import guvectorize
import vaex

@guvectorize('void(float64[:], float64[:], float64[:])', '(m), (n)->(m)', nopython=True)
def crazy2(data: np.ndarray, buffer: np.ndarray, res: np.ndarray):
    for idx, val in np.ndenumerate(data):
        idx, = idx
        res[idx] = val+buffer[0]
        buffer[0] = res[idx]

@vaex.register_function()
def crazy2_v(ar, buffer):
    result = np.zeros(len(ar))
    crazy2(ar, buffer, result)
    return result

vdf2 = vaex.from_arrays(x=np.arange(1000))
buffer = np.array([0])
vdf2['res'] = vdf2.func.crazy2_v(vdf2.x, buffer)

Limitation 2?
Limitation 2 I can see if above code is ok is that each execution of the vectorized function is probably distributed in different threads, hence, computations are not sequential and buffer is not updated as it should with the right value when a new computation is done.
In this latter case, is there a way to tell vaex not to distribute the computations?

Please, does one of the above code seem applicable to you?

I thank you in advance for your feedback and help.
Bests,

PS: in above codes, both approach give the same result, and I guess the array is not big enough so that vaex runs the function in batch and distributes it to different threads.

all(vdf['res'].values == vdf2['res'].values)
Out[16]: True

yohplala · 2021-07-07T07:22:29Z

yohplala
Jul 7, 2021
Author

Hi again,
I made further investigations.
So above code does not work indeed, but the way it does not work 'raises more questions to me'.

Here are the checks made: between the 2 codes, and comparing to np.sum as reference result.
Starting from a SIZE of 1025, results are nook.
(at 1024, it is still ok)

import numpy as np
from numba import guvectorize
import vaex

# vaex 1 // buffer within the vectorized function
@guvectorize('void(float64[:], float64[:])', '(m)->(m)', nopython=True)
def crazy(data: np.ndarray, res: np.ndarray):
    buffer = 0
    for idx, val in np.ndenumerate(data):
        idx, = idx
        res[idx] = val+buffer
        buffer = res[idx]

@vaex.register_function()
def crazy_v(ar):
    result = np.zeros(len(ar))
    crazy(ar, result)
    return result

# vaex 2 // buffer outside the vectorized function
@guvectorize('void(float64[:], float64[:], float64[:])', '(m), (n)->(m)', nopython=True)
def crazy2(data: np.ndarray, buffer: np.ndarray, res: np.ndarray):
    for idx, val in np.ndenumerate(data):
        idx, = idx
        res[idx] = val+buffer[0]
        buffer[0] = res[idx]

@vaex.register_function()
def crazy2_v(ar, buffer):
    result = np.zeros(len(ar))
    crazy2(ar, buffer, result)
    return result


SIZE = 1025
# vaex 1
vdf = vaex.from_arrays(x=np.arange(SIZE))
vdf['res'] = vdf.func.crazy_v(vdf.x)
resV1 = vdf['res'].values[-1]
# vaex 2
vdf2 = vaex.from_arrays(x=np.arange(SIZE))
buffer = np.array([0])
vdf2['res'] = vdf2.func.crazy2_v(vdf2.x, buffer)
resV2 = vdf2['res'].values[-1]
# numpy
resN = np.sum(np.arange(SIZE))

print(f'resV1 equals resV2: {resV1 == resV2}')
print(f'resN equals resV2: {resN == resV2}')
resV1 equals resV2: True
resN equals resV2: False

resN
Out[46]: 524800

resV1
Out[47]: 1024.0

One notices the sum gets resetted at row 1025 (1024 is the value of row 1025).

What surprises me is that the 'buffer strategy' of 2nd approach does not work. buffer variable outside the function remains 0.

buffer
Out[45]: array([0])

@maartenbreddels I understand it may not be that simple.
Do you know if any tweak can be made to have this working?
Thanks in advance for your help,
Bests

0 replies

yohplala · 2021-10-29T13:08:01Z

yohplala
Oct 29, 2021
Author

Hi there, i am 'reviving' this thread.
Please,

is there any way to tweak vaex 'execute-by-chunk' engine so that it runs sequentially instead of in parallel?
(Is this how 'apply' already works? run in chunk, but sequentially?)
is there any way to have a shared space for all 'jobs' of the 'execute-by-chunk' engine, so that results from previous 'jobs' (i.e. iteration of the 'execute-by-chunk' engine) can be used by next ones? I would essentially need a pointer to a mutable container (for instance dict or list or numpy array)

This would allow for custom calculations.
Thanks for any help.
Bests

0 replies

yohplala · 2021-10-29T13:21:37Z

yohplala
Oct 29, 2021
Author

I will try with apply. I just noticed that vectorize=True and multiprocessing=False may be what i need, if vectorize means data is processed in chunks (arrays of 1024 rows then?) and multiprocessing=False means 'runs sequentially'. Sorry for the newbie questions.

0 replies

maartenbreddels · 2021-10-29T15:18:21Z

maartenbreddels
Oct 29, 2021
Maintainer

You need to use shift if you want access to the previous row, and that is indeed what is used to implement rolling window calculations (see df.rolling.sum e.g.)
Does this help?

1 reply

yohplala Nov 15, 2021
Author

Hi @maartenbreddels
Thanks for trying to help.
I think I have not been clear in my explanation.

Calculation of row n depends of result from calculation of row (n-1), and so forth. So, as I see it, the row processing needs to be done sequentially. I have opened ticket #1706 for this question. I think example provided is a bit more clear (I hope).
Bests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constraint in vaex to have results depending of previous rows? #1458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Constraint in vaex to have results depending of previous rows? #1458

yohplala Jul 6, 2021

Replies: 4 comments · 1 reply

yohplala Jul 7, 2021 Author

yohplala Oct 29, 2021 Author

yohplala Oct 29, 2021 Author

maartenbreddels Oct 29, 2021 Maintainer

yohplala Nov 15, 2021 Author

yohplala
Jul 6, 2021

Replies: 4 comments 1 reply

yohplala
Jul 7, 2021
Author

yohplala
Oct 29, 2021
Author

yohplala
Oct 29, 2021
Author

maartenbreddels
Oct 29, 2021
Maintainer

yohplala Nov 15, 2021
Author