-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE-REQUEST] How to share results from calculations on previous rows for calculations on next rows in apply
?
#1706
Comments
apply
?apply
?
To give you a view on how import numpy as np
from numba import guvectorize
@guvectorize('void(float64[:], float64[:], float64[:])', '(m), (n)->(m)', nopython=True)
def sum_v(data: np.ndarray, buffer: np.ndarray, res: np.ndarray):
for idx, val in np.ndenumerate(data):
idx, = idx
print(buffer)
res[idx] = val+buffer[0]
buffer[0] = res[idx]
def sum_vp(data): # does the work of partial
buffer = np.array([0])
res = np.empty(len(data))
sum_v(data, buffer, res)
return res
SIZE = 5
data = np.arange(SIZE)
res_ar = sum_vp(data) [0.]
[0.]
[1.]
[3.]
[6.]
res_ar
Out[3]: array([ 0., 1., 3., 6., 10.]) |
Related issue: #1313 |
Hi @JovanVeljanoski Then looking at provided and accepted SO answer, this answer is not the one I am looking for. Provided answer relies on a 'shift-like' approach. It uses a data from a previous row. this data is existing and unchanged during calculation. |
I spent a couple of hours on the topic to see if I could find my way through In both cases, I am rolling out a similar approach:
Using
|
The calculation I intend is not exactly the one presented above (I intend to use min and max, store them temporarily in a buffer till I encounter a new min or max, or a limit value is hit, in which case buffer values are resetted). But above example shows the intennt. It is actually a cumulative sum, and other tickets are indeed open on this topic @JovanVeljanoski Bests |
Thinking loudly (I think will go this way, but need to spend some time on other topics, so will get back to this a bit later). @vaex.register_function
def iter_apply(func, arguments, chunk_size):
vdf = arguments[0].join(arguments[1:]) # re-create a DataFrame of relevant column
arrow_list=[]
for df in vdf.to_pandas_df(chunk_size=chunk_size): # iterate and apply func by chunk sequentially
res = vx.from_pandas(func(df))
file = '~/.vaex/cache/....arrow' # use cache directory to store temporarily results
res.to_arrow(file)
arrow_list.append(file)
return vx.open_many(arrow_list) This is the rough idea, need to check in more details. |
Description
I would like to keep results from calculations on previous rows to operate calculations on next rows in
apply
. Please, how should I do that?With below code, I have a shared variable
buffer
, but it seems rows are processed in parallel and not sequentially. As a result, I don't have the right value when calculating 'next rows'.Is there any way to force in apply a 'sequential execution' (and yet operating on arrays i.e. with
vectorize=True
)? I was thinkingmultiprocessing=False
would do the trick, but it does not force sequential processing.When
buffer
is printed (in the for loop), we can see the values it stores are nook.We should see:
If I deactivate
vectorize
inapply
, then the function expects int / float, not array.But I am willing to use arrays as in a next step, I intend to speed the execution with
@guvectorize
decorator from numba.The text was updated successfully, but these errors were encountered: