Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppInsights blocks Node.JS process from exiting for hours #958

Closed
PavelBansky opened this issue May 2, 2022 · 11 comments · Fixed by #1001
Closed

AppInsights blocks Node.JS process from exiting for hours #958

PavelBansky opened this issue May 2, 2022 · 11 comments · Fixed by #1001
Assignees

Comments

@PavelBansky
Copy link

Even very simple telemetry submission code is preventing NodeJS process from exiting. Sometimes for minutes, sometimes for hours. The problem is especially noticeable on Mac OS X, where the delays are always at least a minutes.

This is being used in Azure DevOps extension, that adds build task into the pipeline. Adding minutes or hours to the build is really a big problem.

Removing AppInisghts makes the code run under a second.

  try {
      addBuildVariableTelemetry();           
      const totalTimeTaken = Math.round((Date.now() - startTime) / 1000); // get time in seconds
      addMetric("ExecutionTime", totalTimeTaken);             

      const telemetryClient = new appInsights.TelemetryClient(TELEMETRY_KEY);
      
      /* None of this helps 
      telemetryClient.config.maxBatchIntervalMs = 0;
      telemetryClient.config.maxBatchSize = 0;        
      telemetryClient.config.httpsAgent = new https.Agent();
      telemetryClient.config.enableAutoCollectPerformance = false ; 
      telemetryClient.config.enableUseAsyncHooks = false;
      */
      
      telemetryClient.trackEvent({ name: taskName, properties: EventProperties as any, measurements: Metrics as any });
      if (Traces.length > 0) {
          telemetryClient.trackTrace({ message: Traces.join("\n"), severity: appInsights.Contracts.SeverityLevel.Information, properties: EventProperties as any});
      }

      telemetryClient.flush();        
      tl.debug("Telemetry submitted");
      hasPublishedFinalTelemetry = true;
  }
  catch (error) {
      console.log("##vso[task.LogIssue type=warning;]Telemetry wrapper:", error); // don't let telemetry crash us
  }
@hectorhdzg
Copy link
Member

@PavelBansky I need more information to determine the problem, hope you can help with that, can you enable debug and warning logs to see if there are any errors coming from the SDK?, also flush have a callback parameter that will be called when completed, can you try to wait for that callback? I wonder if this is related to the way Azure DevOps run the node.js process

const telemetryClient = new appInsights.TelemetryClient();
telemetryClient.config.enableInternalDebugLogging = true;
telemetryClient.config.enableInternalWarningLogging = true;
telemetryClient.trackEvent({
    name: "TestEvent"
});
telemetryClient.flush({ callback: () => { console.log("Done flushing") } });

@patroza
Copy link

patroza commented May 6, 2022

This is real, we have a similar experience when upgrading from 2.3.1 to 2.3.2; on GitHub actions, e2e tests that spin up the http server (per test or file), which is wrapped with application insights, only the first out of 6 files completes, all tests from the other 5 fail with test timeout.
Reverting application insights (after a long and painful process of elimination) revealed the culprit.
Don't have a small repro available though.

@PavelBansky
Copy link
Author

@hectorhdzg , I have tried both. Callback and not callback the result is always the same.

function publishTelemetryEvents(isFinalTelemetry: boolean, callback: (v: string) => void, attemptNumber = 1) {
    const maxAttempts = 5;

    if (hasPublishedFinalTelemetry) {
        // do not log telemetry after we have already logged final telemetry
        // this also stops recursive calls from process.beforeExit handler
        return;
    }

    try {
        addBuildVariableTelemetry();           
        const totalTimeTaken = Math.round((Date.now() - startTime) / 1000); // get time in seconds
        addMetric("ExecutionTime", totalTimeTaken);

        const telemetryClient = new appInsights.TelemetryClient(TELEMETRY_KEY);
        telemetryClient.trackEvent({ name: taskName, properties: EventProperties as any, measurements: Metrics as any });
        if (Traces.length > 0) {
            telemetryClient.trackTrace({ message: Traces.join("\n"), severity: appInsights.Contracts.SeverityLevel.Information, properties: EventProperties as any});
        }
        telemetryClient.flush({
            callback: (response) => {
                try {
                    const parsedObject = JSON.parse(response);
                    if (parsedObject.errors.length) {
                        throw parsedObject.errors; // this will reach the outer catch and is handled the same as an unparseable response.
                    }
                } catch (parseError) {
                    if (attemptNumber < maxAttempts) {
                        // This is a failed telemetry send case. If we encounter an error, we retry, unless we reach max attempts..
                        publishTelemetryEvents(isFinalTelemetry, callback, attemptNumber + 1);
                        return; 
                    }
                    console.log("##vso[task.LogIssue type=warning;]Telemetry flush:", parseError);
                }
                hasPublishedFinalTelemetry = hasPublishedFinalTelemetry || isFinalTelemetry;
                tl.debug("Telemetry submitted");

                callback(response);
            }
        });
    }
    catch (error) {
        console.log("##vso[task.LogIssue type=warning;]Telemetry wrapper:", error); // don't let telemetry crash us
    }
}

I will try with the debugging enabled.

@PavelBansky
Copy link
Author

@hectorhdzg : I have enabled logging and see no warnings or errors on the console.

telemetryClient.config.enableInternalDebugLogging = true;
telemetryClient.config.enableInternalWarningLogging = true;

At this point we will have to drop AppInsights all together until this is resolved :(

@hectorhdzg
Copy link
Member

@PavelBansky I don't have repro of the issue and there are not logs so not much I can do here, if you can share you Azure DevOps build privately will be helpful

@antoine-coulon
Copy link

Hey @hectorhdzg,

I'm currently having the same problem here on 2.3.3 on MacOS with Node.js 16.14.0 with a very minimalist repro:

import { post } from "superagent";
import * as appInsights from "applicationinsights";

appInsights
  .setup(
    "InstrumentationKey=ENV_KEY;IngestionEndpoint=https://westeurope-5.in.applicationinsights.azure.com"
  )
  .start();

post("https://westeurope-5.in.applicationinsights.azure.com/v2.1/track")
  .send([fakeTelemetryRequestBody])
  .then(appInsights.defaultClient.flush)
  .catch(console.error);

When starting the script, the Node.js process hangs during few minutes before exiting 7 times out of 10 (from what I could observe). It happens that the process exits directly but it's clearly not consistent.

Do you have any information to share about that?

Thanks

@hectorhdzg
Copy link
Member

@antoine-coulon flush method have a callback you can pass to ensure is done; can you try that?

(async function main() {
    post("https://www.bing.com/")
        .send([{}])
        .then(async () => {
            await new Promise((resolve) => {
                appInsights.defaultClient.flush({
                    callback: (msg) => {
                        console.log(msg);
                        resolve();
                    }
                });
            });
        })
        .catch(console.error);
})();

@antoine-coulon
Copy link

@hectorhdzg I already tried it as you were already suggesting it but it doesn't work either.
The callback is invoked with {"itemsReceived":1,"itemsAccepted":1,"errors":[]} which seems expected after the trigger of the HTTP request. However, the process still hangs afterwards.

Here is an example with an even more minimalist reproduction:

appInsights.setup(
  "InstrumentationKey=KEY;IngestionEndpoint=https://westeurope-5.in.applicationinsights.azure.com"
);

appInsights.start();
appInsights.defaultClient.flush();

Do you agree on the fact that this should allow the Node.js process to exit? Even if flush is asynchronous, as there is nothing to process everything should be flushed almost directly allowing resources to be teared down

@hectorhdzg
Copy link
Member

@antoine-coulon I cannot reproduce in Windows or Linux but I'm working on getting a MacOS machine to try it out, is there something else involved in your repro? any other packages or code that could be affecting the behavior?

@antoine-coulon
Copy link

@hectorhdzg no nothing else, I've just tested it in a raw project with only the appinsights dependency.

As a reminder here is my setup:
OS: MacOS Big Sur 11.5.1
Node.js: 16.14.0 & 18.4.0
appinsights: *

@JamieMagee
Copy link
Member

Is it possible to add timeouts to all HTTP calls? This falls under CWE-10881:

CWE-1088: Synchronous Access of Remote Resource without Timeout

The code has a synchronous call to a remote resource, but there is no timeout for the call, or the timeout is set to infinite.
This issue can prevent the software from running reliably, since an outage for the remote resource can cause the software to hang. If the relevant code is reachable by an attacker, then this reliability problem might introduce a vulnerability.

Footnotes

  1. https://cwe.mitre.org/data/definitions/1088.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants