Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning data too big to fit in memory at once, how to learn? #7801

Open
Sir-hennihau opened this issue Jul 4, 2023 · 10 comments
Open

Learning data too big to fit in memory at once, how to learn? #7801

Sir-hennihau opened this issue Jul 4, 2023 · 10 comments

Comments

@Sir-hennihau
Copy link

Sir-hennihau commented Jul 4, 2023

I have the problem that my dataset became too large to fit in memory at once in tensorflow js. What are good solutions to learn from all data entries? My data comes from a mongodb instance and needs to be loaded asynchronously.

I tried to play with generator functions, but couldnt get async generators to work yet. I was also thinking that maybe fitting the model in batches to the data would be possible?

It would be great if someone could provide me with a minimal example on how to fit on data that is loaded asynchronously through either batches or a database cursor.

For example when trying to return promises from the generator, I get a typescript error.

    const generate = function* () {
        yield new Promise(() => {});
    };

    tf.data.generator(generate);

Argument of type '() => Generator<Promise<unknown>, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

Also, you cant use async generators. This is the error that would happen if you try to:

tf.data.generator(async function* () {})

throws

Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

@gaikwadrahul8
Copy link
Contributor

gaikwadrahul8 commented Jul 5, 2023

Hi, @Sir-hennihau

Thank you for bringing this issue to our attention and As far I know, you can use tf.data.generator or tf.data.Dataset with either .batch or .prefetch and also refer this answer from stack overflow so could you please give it try and let us know whether is it resolving your issue or not ?

If issue still persists please let us know ? Thank you!

@Sir-hennihau
Copy link
Author

Sir-hennihau commented Jul 5, 2023

Hey @gaikwadrahul8 ,
I tried to play around with the functions that you suggested but couldnt find success yet unfortunately.

The code snippet from stackoverflow results in a typescript error, because it says that async generators are not assignable. I tried to play around a bit with //@ts-ignore, but couldnt get it to work yet. I also can't find an example in the documentation nor online where the dataset is populated by loading data from the network using async await.

Just for completeness, the snippet from stackoverflow

const dataset = tf.data.generator(async function* () {
    const dataToDownload = await fetch(/* ... */);
    while (/* ... */) {
        const moreData = await fetch(/* ... */);
        yield otherData;
    }
});

throws
Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

At that point, I don't even know if the implementation or the typings are wrong.

Can you maybe take a look and try to produce a minimal working example where the dataset uses async await loaded data from some remote source? Would be highly appreciated to move this problem forward.

@Sir-hennihau
Copy link
Author

Sir-hennihau commented Dec 5, 2023

Push, I still didn't find a satisfying solution to that problem yet :s
@mattsoulanille :D

@Antony-Lester
Copy link

@Sir-hennihau I am just started facing the same issue,

(linking mongodb to tensorflow with batched data using typscript tfjs-node-gpu)
as i have started hitting the v8 heap limit

if i find a solution/work around i will share within a week or so.

@Sir-hennihau
Copy link
Author

@Antony-Lester any news?

@Antony-Lester
Copy link

Antony-Lester commented Feb 8, 2024

@Sir-hennihau only working solution i have found so far is to move to Incremental Batch Training, cant speak about its accuracy as i don't have a baseline for comparison.

so holding all of the validation data and one chunks worth of training data in heap memory.

```const {trainCount} = await countDataPoints(db)
 const batchSize = Math.ceil(trainCount / 100)
 const validationDataResult = await validationData(db)
 const totalBatches = Math.ceil(trainCount / batchSize)
 const trainDataPipelineArray = await trainDataPipeline(db)
 // Train model incrementally
 for (let i = 0; i < trainCount; i += batchSize) {
        const batchPipeline = [...trainDataPipelineArray, { $skip: i }, { $limit: batchSize }]
        const data = await db.collection('myCollection').aggregate(batchPipeline).toArray()
        const metricsData = data.map(item => item.metrics)
        const xs = tf.tensor2d(metricsData, [metricsData.length, metricsData[0].length])
        const resultData = data.map(item => item.result)
        const ys = tf.tensor2d(resultData, [resultData.length, 1])
        await model.fit(xs, ys, {
            epochs: epochs,
            validationData: validationDataResult,
            shuffle: true,
            batchSize: 64,
            callbacks: [],
            verbose: 1,
        })
```

From copilot:
Advantages of Incremental Batch Training:

Memory Efficiency: It's more memory-efficient as it only needs to load a small batch into memory, which is beneficial when dealing with large datasets that can't fit into memory.

Speed: It can lead to faster convergence because the model parameters are updated more frequently.

Noise: The noise in the gradient estimation can sometimes help escape shallow local minima, leading to better solutions.

Real-time Learning: It allows the model to learn from new data on-the-go without retraining from scratch.

Disadvantages of Incremental Batch Training:

Less Accurate Gradient Estimation: The gradient estimation can be less accurate because it's based on fewer examples.

Hyperparameter Sensitivity: It's more sensitive to the choice of learning rate and batch size.

Less Stable: The cost function is not guaranteed to decrease every step, and the final parameters can depend on the initial parameters (i.e., the solution can be non-deterministic).

@Sir-hennihau
Copy link
Author

thanks @Antony-Lester , in the meantime i went to first convert my data in a csv file and then use the csv learning methods from tfjs. its a shame that seems to be needed. csv learning seems to be nicely implemented, though. on very large datasets this seems to be very storage inefficient, though.

@Sir-hennihau
Copy link
Author

it would be anyways nice to get an answer from the maintainers on how to solve the issue without using workarounds like convertig to a dataset first

@Antony-Lester
Copy link

in the end, I spawned off python scripts that trained the model while watching the scripts console output. not ideal but I can use the whole memory now.

@tharvik
Copy link
Contributor

tharvik commented Oct 4, 2024

you can in fact simply ts-ignore the async generator, it is internally supported.

  // @ts-expect-error
  tf.data.generator(async function* () {  });

openned #8408 to expose it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants