Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry module #9016

Closed
esigo opened this issue Sep 3, 2022 · 49 comments
Closed

OpenTelemetry module #9016

esigo opened this issue Sep 3, 2022 · 49 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@esigo
Copy link
Member

esigo commented Sep 3, 2022

This issue is for tracking the steps needed to be taken for using OpenTelemetry.

Currently, the OpenTelemetry compilation is done in #8585, but there is still one issue for loading the module (the init container copies the needed files to a wrong path).

Here is the list of steps:

  1. implement loading the image with the controller (fix the init container path).
  2. integration
  3. user-interface.
  4. remove currently integrated bits of opentracing, jaeger, zipkin etc.

Once all done, the user will be able to send telemetry data to any backend that receives otlp-gRPC directly, or any backend through otel-collector:

graph TD
subgraph Node
   nginx
end
nginx["nginx module"]-->|otlp-gRPC| backend["backend"]
Loading

Or:

graph TD
subgraph Node
   nginx
end
nginx["nginx module"] -->|otlp-gRPC| OTEL-collector{{"Otel Collector"}} -->backend{{"backend:<br>Zipkin<br>Jaeger <br>"}}
Loading
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Sep 3, 2022
@esigo
Copy link
Member Author

esigo commented Sep 3, 2022

Please assign the issue to me.

@longwuyuan
Copy link
Contributor

You can just type & post /assign

@longwuyuan
Copy link
Contributor

And we need to add collector code and collector compilation to the Otel bits in Makefile, I am guessing. Or is that pluggable.

@longwuyuan
Copy link
Contributor

And remove currently integrated bits of opentracing, jaeger, zipkin etc.

@esigo
Copy link
Member Author

esigo commented Sep 3, 2022

/assign

@esigo
Copy link
Member Author

esigo commented Sep 3, 2022

And we need to add collector code and collector compilation to the Otel bits in Makefile, I am guessing. Or is that pluggable.

No, We don't need to change the Makefile. Everything is in init image.

@longwuyuan
Copy link
Contributor

Will wait for your PR. You can begin the title with string "WIP" to indicate work-in-progress if you like. Looking forward to see your design

@longwuyuan
Copy link
Contributor

longwuyuan commented Sep 5, 2022

@esigo since that PR merrged, we will get image with LD_LIB_PATH. That is good for loading. But can you write here or in that PR about the collector. Is it already compiled and ready for use with the current build ? If yes, is that also under the same /mount....... path ?

@esigo
Copy link
Member Author

esigo commented Sep 5, 2022

@strongjz
Copy link
Member

strongjz commented Sep 6, 2022

/triage accepted
/priority backlog
/kind feature

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-priority needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 6, 2022
@esigo esigo mentioned this issue Sep 6, 2022
11 tasks
@longwuyuan
Copy link
Contributor

longwuyuan commented Sep 8, 2022

@esigo Please test this image after path fix

image

docker pull gcr.io/k8s-staging-ingress-nginx/opentelemetry@sha256:aa079daa7efd93aa830e26483a49a6343354518360929494bad1d0ad3303142e
OR
docker pull gcr.io/k8s-staging-ingress-nginx/opentelemetry:v20220906-controller-v1.3.1-3-g981ce38a7

Based on your update, we can promote it or do other needful

@kuzaxak
Copy link

kuzaxak commented Sep 8, 2022

gcr.io/k8s-staging-ingress-nginx/opentelemetry:v20220906-controller-v1.3.1-3-g981ce38a7 has the same error, can't load grpc module when enabling load_module /modules_mount/etc/nginx/modules/otel/otel_ngx_module.so

@esigo
Copy link
Member Author

esigo commented Sep 8, 2022

Hi @longwuyuan,
Thanks, it works.
Would it be possible to show me how you promote the image? Thanks

nginx.conf:
image

opentelemetry image:
image

file on disk:
image

log:
image

@kuzaxak
Copy link

kuzaxak commented Sep 8, 2022

Screen Shot 2022-09-08 at 19 37 58

Screen Shot 2022-09-08 at 19 36 51

Strange, I have exactly the same problem.

I built my own image using this dockerfile

@longwuyuan
Copy link
Contributor

longwuyuan commented Sep 8, 2022

@kuzaxak you are using image built on 6th. The new image s built on 7th.

EDIT: sorry, the comment above from me is wrong. You are using the latest. And I think esigo mentioned line 76 problem in your environ.

@esigo
Copy link
Member Author

esigo commented Sep 8, 2022

@kuzaxak your line 76 doesn't copy all the files that are needed.

@bixu
Copy link

bixu commented Dec 23, 2022

@esigo, wanted to express my appreciation for your work here. Very nice. I have a team at $job that would be happy to help test this in live environments if you have any pre-release artifacts to share.

@esigo
Copy link
Member Author

esigo commented Jan 1, 2023

pending fix on open-telemetry/opentelemetry-cpp-contrib#252 and #9472

@esigo
Copy link
Member Author

esigo commented Mar 22, 2023

/close
as completed with #9062

@agronholm
Copy link

agronholm commented Mar 27, 2023

Was this supposed to work on v1.7.0? Doesn't for me:

Error: exit status 1
2023/03/27 07:02:57 [emerg] 41#41: dlopen() "/modules_mount/etc/nginx/modules/otel/otel_ngx_module.so" failed (Error loading shared library /modules_mount/etc/nginx/modules/otel/otel_ngx_module.so: No such file or directory) in /tmp/nginx/nginx-cfg3874777628:10
nginx: [emerg] dlopen() "/modules_mount/etc/nginx/modules/otel/otel_ngx_module.so" failed (Error loading shared library /modules_mount/etc/nginx/modules/otel/otel_ngx_module.so: No such file or directory) in /tmp/nginx/nginx-cfg3874777628:10
nginx: configuration file /tmp/nginx/nginx-cfg3874777628 test failed

One more thing: does this require HTTPS? I don't see any option to toggle between HTTPS and HTTP.
EDIT: Looks like TLS is off by default, there is just no way to toggle that with the Helm chart settings.

@fcuello-fudo
Copy link
Contributor

@agronholm probably you are using a daemonset instead of a deployment, I had the same issue, and found that the helm template for the daemonset is missing some settings compared to the deployment.

I have a local fix and I'm preparing a pull request but I need to sign CLA, etc.

@agronholm
Copy link

No, I'm using deployment.

@fcuello-fudo
Copy link
Contributor

@agronholm , did you add:

controller:
  opentelemetry:
    enabled: true

in addition to:

controller:
  config:
    enable-opentelemetry: "true"

?

@agronholm
Copy link

I didn't. I tried that and my controller starts up now. Is this documented somewhere?

@fcuello-fudo
Copy link
Contributor

fcuello-fudo commented Mar 27, 2023

@agronholm , no, I also found that by debugging the chart. There are also many mistakes in the documentation :(

@agronholm
Copy link

Okay. Thanks for the info! What mistakes did you find in the documentation?

@esigo esigo moved this from In Progress to Done in [SIG Network] Ingress NGINX Mar 27, 2023
@hargut
Copy link

hargut commented Mar 30, 2023

@fcuello-fudo
Thank you for the hint & fix for the DaemonSet issue, I was able to successfully use the patched version from your repository. 👍

@fcuello-fudo
Copy link
Contributor

Ref.: #9792

@eshizhan
Copy link

How can I enable opentelemetry without helm, only using kubectl apply command with some yaml manifests. I can not find any way to do this, is support now?

@premjha9625
Copy link

Hi , i want to export ingress controller traces to azure app insights. how can i achieve that using the open telemetry provided by ingress controller. any solution ??

@sereinity
Copy link

Hello,

I wanted to give it a try, and I got a huge CPU usage issue, every core are saturated by nginx. I simplified my setup to reduce external causes, I deleted ALL my ingress objects, and deploy the helm chart with this values.yaml:

controller:
  config:
    enable-opentelemetry: "true"
    zipkin-collector-host: otel-collector.opentelemetry.svc
  opentelemetry:
    enabled: true
  replicaCount: 2
  service:
    loadBalancerIP: 1.1.1.1 # a public IP

Yes, otel is

And there is what a get from a top in one of the pod (the other looks the same):

Mem: 6941004K used, 25938868K free, 55596K shrd, 275676K buff, 4664064K cached
CPU0:  83% usr  16% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU1: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU2: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU3: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU4: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU5: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU6: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
CPU7: 100% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 10.13 10.66 11.38 14/1028 357
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   75    26 www-data S     154m   0%   1  13% nginx: cache manager process
   30    26 www-data S     168m   0%   0  13% nginx: worker process
   39    26 www-data S     168m   0%   5  13% nginx: worker process
   35    26 www-data S     168m   0%   7  13% nginx: worker process
   32    26 www-data S     168m   0%   3  12% nginx: worker process
   33    26 www-data S     168m   0%   1  11% nginx: worker process
   36    26 www-data S     168m   0%   3   9% nginx: worker process
   34    26 www-data S     168m   0%   1   8% nginx: worker process
   31    26 www-data S     168m   0%   4   8% nginx: worker process
    7     1 www-data S     737m   2%   1   0% /nginx-ingress-controller --publish-service=ingress/ingress-nginx-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress/ingress-nginx-controller --validat
  346     0 www-data S     257m   1%   3   0% top
  352     0 www-data R     257m   1%   1   0% top
   26     7 www-data S     156m   0%   1   0% nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
    1     0 www-data S      216   0%   7   0% /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=ingress/ingress-nginx-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress/ingress-ngin

If I comment enable-opentelemetry: "true" I get a normal CPU usage:

Mem: 7518752K used, 25361120K free, 94200K shrd, 285928K buff, 5047600K cached
CPU0:  40% usr  20% sys   0% nic  40% idle   0% io   0% irq   0% sirq
CPU1:   0% usr  16% sys   0% nic  83% idle   0% io   0% irq   0% sirq
CPU2:  20% usr   0% sys   0% nic  80% idle   0% io   0% irq   0% sirq
CPU3:  20% usr   0% sys   0% nic  80% idle   0% io   0% irq   0% sirq
CPU4:   0% usr  20% sys   0% nic  80% idle   0% io   0% irq   0% sirq
CPU5:   0% usr   0% sys   0% nic 100% idle   0% io   0% irq   0% sirq
CPU6:   0% usr   0% sys   0% nic 100% idle   0% io   0% irq   0% sirq
CPU7:   0% usr   0% sys   0% nic 100% idle   0% io   0% irq   0% sirq
Load average: 5.22 9.04 10.70 5/1109 633
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  367    26 www-data S     167m   0%   4   3% nginx: worker process
    7     1 www-data S     737m   2%   1   0% /nginx-ingress-controller --publish-service=ingress/ingress-nginx-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress/ingress-nginx-controller --validat
  346     0 www-data S     257m   1%   4   0% top
  628     0 www-data R     257m   1%   2   0% top
  368    26 www-data S     167m   0%   3   0% nginx: worker process
  363    26 www-data S     167m   0%   4   0% nginx: worker process
  365    26 www-data S     167m   0%   0   0% nginx: worker process
  369    26 www-data S     167m   0%   1   0% nginx: worker process
  362    26 www-data S     167m   0%   1   0% nginx: worker process
  366    26 www-data S     167m   0%   1   0% nginx: worker process
  364    26 www-data S     167m   0%   4   0% nginx: worker process
   26     7 www-data S     156m   0%   2   0% nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
  370    26 www-data S     154m   0%   3   0% nginx: cache manager process
    1     0 www-data S      216   0%   7   0% /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=ingress/ingress-nginx-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress/ingress-ngin

Despise the huge CPU usage, I get some traces (the stack is ingress + otel + a collector + tempo + grafana).

I may have miss something, does someone have an idea?

@svmaris
Copy link

svmaris commented Apr 28, 2023

@sereinity I had the same issue. This was fixed when I set otel-schedule-delay-millis to something like 5000 and otel-max-export-batch-size to 512 in the ConfigMap. The defaults are causing every span to be sent to the collector without any batching/delay, which results in huge CPU usage.

@strongjz
Copy link
Member

strongjz commented Jun 8, 2023

Otel module has been completed, I am going to close this issue.

If folks have issues with the modules, please open separate issues tickets for those. thank you.

/close

@strongjz
Copy link
Member

strongjz commented Jun 8, 2023

@sereinity @svmaris i have created this ticket #10063

@deejgregor
Copy link

deejgregor commented Jun 9, 2023

@sereinity @svmaris as you might have seen in #10063, the performance issues were fixed with defaults changes in #9978 (see ac9a507). This was released in v1.8.0 of the controller / v4.7.0 of the Helm chart.

@deejgregor
Copy link

deejgregor commented Jun 9, 2023

Hi , i want to export ingress controller traces to azure app insights. how can i achieve that using the open telemetry provided by ingress controller. any solution ??

@premjha9625 I haven't tried this, but it looks like the way to do this today is to send your ingress-nginx trace to OpenTelemetry Collector and use the Azure Monitor Exporter as mentioned here: https://learn.microsoft.com/en-us/azure/azure-monitor/faq#can-i-use-the-opentelemetry-collector-

@Angelin01
Copy link

Wonderful work! However, the documentation could be improved a bit still. For example.

# specifies the name to use for the server span
opentelemetry-operation-name

We are also given this example elsewhere: opentelemetry-operation-name: "HTTP $request_method $service_name $uri"

But what are the available variables for this? Is this information available? If so, where can I find it?

Either way, I could create another issue and even take a crack at this if someone points me in the right direction. My random code dives haven't been fruitful, unfortunately.

@longwuyuan
Copy link
Contributor

@Angelin01 maybe you can reach to @esigo on the ingress-nginx-dev channel of kubernetes.slack.com. Register at slack.k8s.io if required.

@lsunsi
Copy link

lsunsi commented Nov 8, 2023

@eshizhan Did you find out how to enable opentelemetry using the bare apply method instead of helm method? I have the controller deployed and I can't find anyway to pass this "controller: true" people are talking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

No branches or pull requests