Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

Closed
1 of 15 tasks
tvalentyn opened this issue Aug 31, 2023 · 29 comments · Fixed by #29255
Closed
1 of 15 tasks

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. #28246

tvalentyn opened this issue Aug 31, 2023 · 29 comments · Fixed by #29255
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python

Comments

@tvalentyn
Copy link
Contributor

tvalentyn commented Aug 31, 2023

What happened?

We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to protobuf==4.x.x. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.

[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.

Mitigation

Until Beam 2.52.0 is released, consider any of the following workarounds:

  • Use apache-beam==2.46.0 or below.

  • Install protobuf 3.x in the submission and runtime environment. For example, you can use a --requirements_file pipeline option with a file that includes:

    protobuf==3.20.3
    grpcio-status==1.48.2
    

    For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

  • Use a python implementation of protobuf by setting a PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from a Dockerfile that looks like the following:

    FROM apache/beam_python3.10_sdk:2.47.0
    ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
    

    For more information, see: https://beam.apache.org/documentation/runtime/environments/

  • Install protobuf==4.25.0 or newer in the submission and runtime environment.

Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.

Additional details

The leak can be reproduced by a pipeline:

  with beam.Pipeline(options=pipeline_options) as p:
    # duplicate reads to increase throughput
    inputs = []
    for i in range(32):
      inputs.append(
          p | f"Read pubsub{i}" >> ReadFromPubSub(topic='projects/pubsub-public-data/topics/taxirides-realtime', with_attributes=True)
      )

    inputs | beam.Flatten()

Dataflow pipeline options for the above pipeline: --max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32

The leak was triggered by Beam switching default protobuf package version from 3.19.x to 4.22.x in #24599. The new versions of protobuf also switched the default protobuf implemetation to a upb implementation. The upb implementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest available protobuf==4.24.4 does not yet have the fix, but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 5, 2023

Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318. (also mentioned in the description).

@tvalentyn
Copy link
Contributor Author

Remaining work: updgrade protobuf lower bound once their fixes are released.

@chleech
Copy link

chleech commented Sep 7, 2023

I'm trying to install protobuf version 4.24.3 which contains the fixed based on

Added malloc_trim() calls to Python allocator so RSS will decrease when memory is freed (https://github.com/protocolbuffers/upb/commit/b0f5d5d94d9faafed2ab0fcaa9396cb4a984a2c1)

However, apache beam 2.50.0 depends on protobuf (>=3.20.3,<4.24.0). Is this comment meant to address that?


Just looked at the PR in detail. Will there be a patch released to include that change or is only going to get released in 2.51.0? If so, when is 2.51.0 going to be released?

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 7, 2023

However, apache beam 2.50.0 depends on protobuf (>=3.20.3,<4.24.0). Is this comment meant to address that?

you should be able to force-install and use the newer version of protobuf without adverse effects in this case, even though it doens't fit the restriction.

Beam community produces a release roughly every 6 weeks.

re comment: I was hoping to have a restriction protobuf>=4.24.3, but it is a bit more involved.

@tvalentyn
Copy link
Contributor Author

@chleech note that you need to install the new version of protobuf also in the runtime environment.

@chleech
Copy link

chleech commented Sep 8, 2023

@chleech note that you need to install the new version of protobuf also in the runtime environment.

Got it thank you! Actually, is it possible to only install it in the runtime env and not the build time one?

@chleech
Copy link

chleech commented Sep 8, 2023

I'm getting dependency conflict when trying to build a runtime image with

FROM apache/beam_python3.9_sdk:2.50.0

ARG WORKDIR=/dataflow/container
ARG TEMPLATE_NAME=none
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

RUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir poetry \
  && pip check

It's failing with

tensorflow 2.13.0 has requirement typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.7.1.

is this a known issue?

@tvalentyn
Copy link
Contributor Author

Unfortunately I am still seeing a leak on protobuf==4.24.3, asked in protocolbuffers/protobuf#10088 (comment) .

@tvalentyn
Copy link
Contributor Author

I'm getting dependency conflict when trying to build a runtime image with
...
is this a known issue?

Must be side effects from poetry installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies for tips on using constraint files that might help.

docker run --rm -it --entrypoint=/bin/bash apache/beam_python3.9_sdk:2.50.0
root@710cb268df2a:/# pip check
No broken requirements found.

@chleech
Copy link

chleech commented Sep 11, 2023

Unfortunately I am still seeing a leak on protobuf==4.24.3, asked in protocolbuffers/protobuf#10088 (comment) .

Same here! I let the pipeline run for 3 days and still got this plot. Am glad that I am not the only one.

My set up is

  • beam 2.50.0
  • RUN pip install --force-reinstall -v "protobuf==4.24.3"
image

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 12, 2023

@chleech This is actively investigated. I encourage you to try other mitigations above in the meantime.

@chleech
Copy link

chleech commented Sep 12, 2023

were you able to get this to work with beam 2.48.0? the last I tried it didn't change anything

FROM apache/beam_python3.10_sdk:2.47.0
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 12, 2023

were you able to get this to work with beam 2.48.0? the last I tried it didn't change anything

Yes, I tried that couple of times, it has an effect in the pipelines I run. The memory growth decreases significantly. Make sure you are specifying the custom image via --sdk_container_image option.

@kennknowles
Copy link
Member

What is the status of this? The release branch is cut but we can cherry pick a fix if it would otherwise make 2.51.0 unusable.

@tvalentyn
Copy link
Contributor Author

It's not yet fixed and we don't have a cherry-pick yet unfortunately, it will likely carry over to 2.51.0

@kennknowles
Copy link
Member

You mean 2.52.0?

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Sep 21, 2023

i meant the leak might carry over to 2.51.0 unless i find a fix before the release and CP it.

@chleech
Copy link

chleech commented Sep 26, 2023

hey @tvalentyn - any luck fixing the mem issue in 2.51.0?

@kennknowles kennknowles removed this from the 2.51.0 Release milestone Oct 3, 2023
@kennknowles kennknowles added this to the 2.52.0 Release milestone Oct 3, 2023
@tvalentyn
Copy link
Contributor Author

tvalentyn commented Oct 11, 2023

2.51.0 does not have the fix yet. I can confirm with fairly high confidence that memory is leaking during execution metrics collection in

for transform_id, op in self.ops.items():
tag_to_pcollection_id = self.process_bundle_descriptor.transforms[
transform_id].outputs
all_monitoring_infos_dict.update(
op.monitoring_infos(transform_id, dict(tag_to_pcollection_id)))
; I should have more info soon.

@chleech
Copy link

chleech commented Oct 11, 2023

@tvalentyn that sounds really promising. thank you for your hard work, can't wait!

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Oct 13, 2023

The memory appears to be lost when creating references here: https://github.com/apache/beam/blob/47d0fd566f86aaad35d26709c52ee555381823a4/sdks/python/apache_beam/runners/worker/bundle_processor.py#L1189C1-L1190C32 , even if we don't collect any metrics later.

Filed protocolbuffers/protobuf#14571 with a repro for protobuf folks to take a further look.

@chleech
Copy link

chleech commented Nov 2, 2023

@tvalentyn which version of apache beam should we use to get the mem fix?

@damccorm
Copy link
Contributor

damccorm commented Nov 2, 2023

It will be in version 2.52.0, which should be released in the next few weeks.

@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 7, 2023
@tvalentyn
Copy link
Contributor Author

tvalentyn commented Jan 30, 2024

I'd like to add more info about the investigation process for future reference.

Edit: see also: https://cwiki.apache.org/confluence/display/BEAM/Investigating+Memory+Leaks

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Jan 30, 2024

Initially, I inspected whether leaking memory is occupied by objects allocated on the Python heap. It was not the case, but there are couple of ways how the heap could be inspected:

  • Pass the --experiments=enable_heap_dump option. Then, heap dumps will be appended to the SDK status responses, which SDK can provide to the runner. Dataflow workers serve the SDK status page on localhost:8081/sdk_status, and it can be queried via: gcloud compute ssh --zone "xx-somezone-z" "some-dataflow-gce-worker-01300848-wqox-harness-bvf7" --project "some-project-id" --command "curl localhost:8081/sdk_status" .

  • The per-workitem heap profiling options could be used to inspect the objects that are left in the heap after a bundle execution.

    parser.add_argument(
    '--profile_memory',
    action='store_true',
    help='Enable work item heap profiling.')
    parser.add_argument(
    '--profile_location',
    default=None,
    help='path for saving profiler data.')
    parser.add_argument(
    '--profile_sample_rate',
    type=float,
    default=1.0,
    help='A number between 0 and 1 indicating the ratio '
    'of bundles that should be profiled.')
    .

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Jan 30, 2024

Then, the suspicion was that the leak might happen when C/C++ memory allocations are not released. Such leak could be caused by Python extensions used by Beam SDK or its dependencies. Such leaks might not be visible when inspecting objects that live in the Python interpreter heap, but might be visible when inspecting allocations performed by the Python process using a memory profiler that tracks memory allocations.

I experimented with substituting the memory allocator library to tcmalloc. It helped to confirm the presence of the leak and attribute it to _upb_Arena_SlowMalloc call, but it wasn't very helpful to pinpoint the source of the leak in the Python portion of apache_beam / pipeline code. Posting for reference, but I'd probably use memray first (see below), if I have to do a similar analysis again.

Substituting the allocator can be done in a custom container. A Dockerfile for a custom container that substitutes memory allocator might look like the following:

FROM apache/beam_python3.10_sdk:2.53.0
RUN apt update ; apt install -y google-perftools
# Note: this enables TCMalloc globally for all applications running in the container 
ENV LD_PRELOAD /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
ENV HEAPPROFILE /tmp/profile

Analyzing the profile needs to happen in the same or identical environment where the profiled binary runs, and have access to symbols from shared libraries used by the profiled binary. To access Dataflow worker environment, one can SSH to the VM and run commands in the running docker container.

gcloud compute ssh --zone "xx-somezone-z" "some-dataflow-gce-worker-..." --project "some-project-id"

$ docker ps -a | grep python
# some_container_id   us-central1-artifactregistry....             
$ docker exec -it some_container_id /bin/bash
$ google-pprof /usr/local/bin/python profile.0740.heap --base=profile.0100.heap
Using local file /usr/local/bin/python.
Using local file profile.0740.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 37.2 MB
    36.7  98.6%  98.6%     36.7  98.6% _upb_Arena_SlowMalloc
     0.4   1.0%  99.6%      0.4   1.0% _PyMem_RawMalloc (inline)
     0.1   0.3% 100.0%      0.1   0.3% std::__shared_count::__shared_count
     0.0   0.1% 100.0%      0.0   0.1% list_resize.part.0
...

$ google-pprof --inuse_objects  /usr/local/bin/python profile.0740.heap --base=profile.0100.heap
(pprof) top
Total: 556 objects
     333  59.9%  59.9%      333  59.9% PyThread_allocate_lock.localalias
      97  17.4%  77.3%       97  17.4% _PyMem_RawMalloc (inline)
      49   8.8%  86.2%       49   8.8% _upb_Arena_SlowMalloc
      27   4.9%  91.0%       43   7.7% _PyObject_GC_Resize.localalias
      23   4.1%  95.1%       23   4.1% upb_Arena_InitSlow
...

For information on analyzing heap dumps collected with tcmalloc, see: https://gperftools.github.io/gperftools/heapprofile.html

@tvalentyn
Copy link
Contributor Author

I tried several other profilers and had most success with memray: (https://pypi.org/project/memray/).

  • In my experience, Python profilers are most effective when the Python program that leaks memory is launched by the profiler as opposed to attaching the profiler at runtime, after the process has already started.
  • It is best if the profiler can collect and output memory allocation statistics while the process is still running. Some tools only output the collected data after the process under investigation terminates, which made it more complicated to use such tools to profile a Beam pipeline.

Instrumenting Beam SDK container to use memray required changing Beam container and its entrypoint:

  1. Install memray in the container and launch SDK harness from memray: https://github.com/apache/beam/pull/30151/files.

  2. To rebuild the Beam SDK container once can use following command (from a checked out copy of Beam Repo):

./gradlew :sdks:python:container:py310:docker

However, rebuilding container from scratch is a bit slow, and to reduce the feedback loop, we can rebuild only the boot entrypoint and include updated entrypoint in a preexisting image. For example:

```
# From beam repo root, make changes to boot.go.	
your_editor sdks/python/container/boot.go 

# Rebuild the entrypoint
./gradlew :sdks:python:container:gobuild

cd sdks/python/container/build/target/launcher/linux_amd64

# Create a simple Dockerfile to install memray and use the custom entrypoint

cat >Dockerfile <<EOF
FROM apache/beam_python3.10_sdk:2.53.0
RUN pip install memray
COPY boot /opt/apache/beam/boot_modified
ENTRYPOINT ["/opt/apache/beam/boot_modified"]
EOF

# Build the image
docker build . --tag gcr.io/my-project/custom-image:tag
docker push gcr.io/my-project/custom-image:tag](http://gcr.io/my-project/custom-image:tag  
```
  1. Run a pipeline with the custom image.

@tvalentyn
Copy link
Contributor Author

tvalentyn commented Jan 30, 2024

Retrieving the profile and creating a report required SSHing to the running worker, and creating a memray report in the running SDK harness container :

$ gcloud compute --project "project-id" ssh --zone "us-west1-b" pipeline-0rc0-atta-11150311-ej0v-harness-tmj9 

# Find Python SDK container ID
$ docker ps -a
$ docker ps -a | grep python
# 8a643a4638ae   us-central1-artifactregistry....             
$ docker exec -it 8a643a4638ae /bin/bash

# locate profiler output, create reports, copy them it to GCS.
memray table --leak output.bin -o table.html --force
gsutil cp table.html gs://some-bucket/

I found the Memray Table reporter most convenient during my debugging, but other reporters can also be useful.

As a reminder, creating a report from a profile needs to happen in the same or identical environment where the profile was created. The identical environment might be a container started from the same image, but I created my reports on the running worker.

It should be possible to simplify the process of collecting and analyzing profiles, and we'll track improvements in #20298.

@tvalentyn
Copy link
Contributor Author

The table reporter attributed most of the leaked usage to a line in bundle_processor.py. With that info, I reproduced the leak in DirectRunner in a much simpler pipeline, and a very simple setup:

Given a test_pipeline.py:

import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args()

with beam.Pipeline(argv=pipeline_args) as p:
  p | beam.Create([1]) | beam.Map(lambda x: x+1)

Run:

pip install apache-beam==2.47.0 memray==1.11.0
memray run  -o output.bin --force test_pipeline.py --direct_runner_bundle_repeat=10000
memray table --leak output.bin -o table.html --force

The leak is visible in table.html, after double-sorting the table by Size column, and is increasing with an increase in number of iteration. The leak was later attributed to a regression protobuf in protocolbuffers/protobuf#14571 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P2 python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants