Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution controller OOMs when large asset is used in job #3595

Closed
godber opened this issue Apr 18, 2024 · 3 comments
Closed

Execution controller OOMs when large asset is used in job #3595

godber opened this issue Apr 18, 2024 · 3 comments
Assignees
Labels
bug k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice

Comments

@godber
Copy link
Member

godber commented Apr 18, 2024

We have been doing further testing of the S3 backed asset store and we recently tested with an internal asset that was 60MB zipped. Unzipped the asset had the following composition:

.
├── [ 192]  __static_assets
│   ├── [ 22M]  data1.json.gz
│   ├── [7.1K]  data2.txt
│   ├── [1.2K]  data3.json
│   └── [ 37M]  data4.json.gz
├── [ 200]  asset.json
└── [7.5M]  index.js

It should be sufficient to create a mock asset with roughly the same characteristics and get a job to start up with this asset. The execution controller should then OOM when run in k8s using the default memory limit of 512MB. We tested increasing the memory limit (to 6GB) and the execution controller did not OOM.

Here is some of the log output:

[2024-04-17T22:41:06.009Z] DEBUG: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: getting record with id: 46f47558baf4e3f0e8f736ad5c91827a53cc4b4b from s3 minio_test1 connection, ts-assets-teraslice-tmp1 bucket. (assignment=execution_controller, module=assets_storage, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:06.533Z]  INFO: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: loading assets: a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 (assignment=execution_controller, module=asset_loader, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:06.808Z]  INFO: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: decompressing and saving asset a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 to /app/assets/a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 (assignment=execution_controller, module=asset_loader, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:10.938Z] ERROR: teraslice/7 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: Teraslice Worker shutting down due to failure! (assignment=execution_controller)
    Error: Failure to get assets, caused by exit code null
        at ChildProcess.<anonymous> (file:///app/source/packages/teraslice/dist/src/lib/workers/assets/spawn.js:45:31)
        at ChildProcess.emit (node:events:517:28)
        at maybeClose (node:internal/child_process:1098:16)
        at ChildProcess._handle.onexit (node:internal/child_process:303:5)

If necessary, I can supply the internal asset separately.

cc @busma13

@godber godber added bug k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice labels Apr 18, 2024
@godber
Copy link
Member Author

godber commented Apr 18, 2024

After further discussions with Peter and Joseph there are a number of other things that limit asset size:

  • Of course in ES there are a number of limits that apply to the binary field and response sizes
  • when building an asset with earl there is a limit at nodes internal buffer size (2gb?)

It's possible that our choice of zip archives for assets make them not streamable ... so we might be a bit stuck there too.

Regardless, we are, at the very least, going to look at reducing overall memory usage during the asset load process to increase that size.

@sotojn
Copy link
Contributor

sotojn commented Apr 23, 2024

Steps to recreate this issue locally:

  1. Mock up a local teraaslice in kubernetes by running yarn k8s:minio --asset-storage='s3'.

  2. Upload said 60mb zipped asset using earl or add the zipped asset to the autoload folder to skip this step

  3. Create and register a job that uses said 60mb asset asset and start the job

  4. Run kubectl get pods -n ts-dev1 to view all the running pods in the namespace

  5. The pod with the name that starts with ts-exc should be seen restarting and having a status of OOM

godber pushed a commit that referenced this issue Apr 30, 2024
This PR makes the following changes: 

- Improves the s3 backend get() requests to grab assets in a more memory
efficient way
- This resolves and issue where pulling and decompressing large assets
from s3 would cause and OOM on the execution controller on job startup
- Add error message when asset loader would close with an error that
advises what to do in the case of an OOM issue

Ref to issue #3595
@godber
Copy link
Member Author

godber commented Apr 30, 2024

The changes in #3598 are sufficient to resolve this issue.

@godber godber closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice
Projects
None yet
Development

No branches or pull requests

2 participants