Learning data too big to fit in memory at once, how to learn? #7801

Sir-hennihau · 2023-07-04T19:18:06Z

I have the problem that my dataset became too large to fit in memory at once in tensorflow js. What are good solutions to learn from all data entries? My data comes from a mongodb instance and needs to be loaded asynchronously.

I tried to play with generator functions, but couldnt get async generators to work yet. I was also thinking that maybe fitting the model in batches to the data would be possible?

It would be great if someone could provide me with a minimal example on how to fit on data that is loaded asynchronously through either batches or a database cursor.

For example when trying to return promises from the generator, I get a typescript error.

    const generate = function* () {
        yield new Promise(() => {});
    };

    tf.data.generator(generate);

Argument of type '() => Generator<Promise<unknown>, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

Also, you cant use async generators. This is the error that would happen if you try to:

tf.data.generator(async function* () {})

throws

Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

The text was updated successfully, but these errors were encountered:

gaikwadrahul8 · 2023-07-05T12:06:21Z

Hi, @Sir-hennihau

Thank you for bringing this issue to our attention and As far I know, you can use tf.data.generator or tf.data.Dataset with either .batch or .prefetch and also refer this answer from stack overflow so could you please give it try and let us know whether is it resolving your issue or not ?

If issue still persists please let us know ? Thank you!

Sir-hennihau · 2023-07-05T21:24:28Z

Hey @gaikwadrahul8 ,
I tried to play around with the functions that you suggested but couldnt find success yet unfortunately.

The code snippet from stackoverflow results in a typescript error, because it says that async generators are not assignable. I tried to play around a bit with //@ts-ignore, but couldnt get it to work yet. I also can't find an example in the documentation nor online where the dataset is populated by loading data from the network using async await.

Just for completeness, the snippet from stackoverflow

const dataset = tf.data.generator(async function* () {
    const dataToDownload = await fetch(/* ... */);
    while (/* ... */) {
        const moreData = await fetch(/* ... */);
        yield otherData;
    }
});

throws
Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

At that point, I don't even know if the implementation or the typings are wrong.

Can you maybe take a look and try to produce a minimal working example where the dataset uses async await loaded data from some remote source? Would be highly appreciated to move this problem forward.

Sir-hennihau · 2023-12-05T13:59:32Z

Push, I still didn't find a satisfying solution to that problem yet :s
@mattsoulanille :D

Antony-Lester · 2024-01-25T02:18:42Z

@Sir-hennihau I am just started facing the same issue,

(linking mongodb to tensorflow with batched data using typscript tfjs-node-gpu)
as i have started hitting the v8 heap limit

if i find a solution/work around i will share within a week or so.

Sir-hennihau · 2024-02-06T12:51:51Z

@Antony-Lester any news?

Antony-Lester · 2024-02-08T00:14:04Z

@Sir-hennihau only working solution i have found so far is to move to Incremental Batch Training, cant speak about its accuracy as i don't have a baseline for comparison.

so holding all of the validation data and one chunks worth of training data in heap memory.

```const {trainCount} = await countDataPoints(db)
 const batchSize = Math.ceil(trainCount / 100)
 const validationDataResult = await validationData(db)
 const totalBatches = Math.ceil(trainCount / batchSize)
 const trainDataPipelineArray = await trainDataPipeline(db)
 // Train model incrementally
 for (let i = 0; i < trainCount; i += batchSize) {
        const batchPipeline = [...trainDataPipelineArray, { $skip: i }, { $limit: batchSize }]
        const data = await db.collection('myCollection').aggregate(batchPipeline).toArray()
        const metricsData = data.map(item => item.metrics)
        const xs = tf.tensor2d(metricsData, [metricsData.length, metricsData[0].length])
        const resultData = data.map(item => item.result)
        const ys = tf.tensor2d(resultData, [resultData.length, 1])
        await model.fit(xs, ys, {
            epochs: epochs,
            validationData: validationDataResult,
            shuffle: true,
            batchSize: 64,
            callbacks: [],
            verbose: 1,
        })
```

From copilot:
Advantages of Incremental Batch Training:

Memory Efficiency: It's more memory-efficient as it only needs to load a small batch into memory, which is beneficial when dealing with large datasets that can't fit into memory.

Speed: It can lead to faster convergence because the model parameters are updated more frequently.

Noise: The noise in the gradient estimation can sometimes help escape shallow local minima, leading to better solutions.

Real-time Learning: It allows the model to learn from new data on-the-go without retraining from scratch.

Disadvantages of Incremental Batch Training:

Less Accurate Gradient Estimation: The gradient estimation can be less accurate because it's based on fewer examples.

Hyperparameter Sensitivity: It's more sensitive to the choice of learning rate and batch size.

Less Stable: The cost function is not guaranteed to decrease every step, and the final parameters can depend on the initial parameters (i.e., the solution can be non-deterministic).

Sir-hennihau · 2024-03-25T21:31:47Z

thanks @Antony-Lester , in the meantime i went to first convert my data in a csv file and then use the csv learning methods from tfjs. its a shame that seems to be needed. csv learning seems to be nicely implemented, though. on very large datasets this seems to be very storage inefficient, though.

Sir-hennihau · 2024-05-27T18:27:26Z

it would be anyways nice to get an answer from the maintainers on how to solve the issue without using workarounds like convertig to a dataset first

Antony-Lester · 2024-08-05T22:22:09Z

in the end, I spawned off python scripts that trained the model while watching the scripts console output. not ideal but I can use the whole memory now.

tharvik · 2024-10-04T09:04:47Z

you can in fact simply ts-ignore the async generator, it is internally supported.

  // @ts-expect-error
  tf.data.generator(async function* () { … });

openned #8408 to expose it.

Sir-hennihau added the type:others label Jul 4, 2023

gaikwadrahul8 self-assigned this Jul 5, 2023

gaikwadrahul8 added the comp:data label Jul 5, 2023

gaikwadrahul8 added the stat:awaiting response label Jul 5, 2023

google-ml-butler bot removed the stat:awaiting response label Jul 5, 2023

gaikwadrahul8 assigned mattsoulanille Sep 11, 2023

gaikwadrahul8 added the stat:awaiting tensorflower label Sep 11, 2023

cleong110 mentioned this issue Mar 26, 2024

Feature request: download all features but only load only part of DGS Corpus at a time? sign-language-processing/datasets#68

Open

tharvik mentioned this issue Oct 4, 2024

[tfjs-data] support async generator #8408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning data too big to fit in memory at once, how to learn? #7801

Learning data too big to fit in memory at once, how to learn? #7801

Sir-hennihau commented Jul 4, 2023 •

edited

Loading

gaikwadrahul8 commented Jul 5, 2023 •

edited

Loading

Sir-hennihau commented Jul 5, 2023 •

edited

Loading

Sir-hennihau commented Dec 5, 2023 •

edited

Loading

Antony-Lester commented Jan 25, 2024

Sir-hennihau commented Feb 6, 2024

Antony-Lester commented Feb 8, 2024 •

edited

Loading

Sir-hennihau commented Mar 25, 2024

Sir-hennihau commented May 27, 2024

Antony-Lester commented Aug 5, 2024

tharvik commented Oct 4, 2024

Learning data too big to fit in memory at once, how to learn? #7801

Learning data too big to fit in memory at once, how to learn? #7801

Comments

Sir-hennihau commented Jul 4, 2023 • edited Loading

gaikwadrahul8 commented Jul 5, 2023 • edited Loading

Sir-hennihau commented Jul 5, 2023 • edited Loading

Sir-hennihau commented Dec 5, 2023 • edited Loading

Antony-Lester commented Jan 25, 2024

Sir-hennihau commented Feb 6, 2024

Antony-Lester commented Feb 8, 2024 • edited Loading

Sir-hennihau commented Mar 25, 2024

Sir-hennihau commented May 27, 2024

Antony-Lester commented Aug 5, 2024

tharvik commented Oct 4, 2024

Sir-hennihau commented Jul 4, 2023 •

edited

Loading

gaikwadrahul8 commented Jul 5, 2023 •

edited

Loading

Sir-hennihau commented Jul 5, 2023 •

edited

Loading

Sir-hennihau commented Dec 5, 2023 •

edited

Loading

Antony-Lester commented Feb 8, 2024 •

edited

Loading