Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular Flow: Deployment failure: Operation has timed out #5480

Open
yuliiamir opened this issue Nov 20, 2023 · 3 comments
Open

Regular Flow: Deployment failure: Operation has timed out #5480

yuliiamir opened this issue Nov 20, 2023 · 3 comments
Assignees

Comments

@yuliiamir
Copy link
Collaborator

A regular flow deployment failure has been caught during test execution on Jenkins. It is a flaky issue and when it occurs, it becomes a root cause of some test failures (switch validation, meter checks).
Two timeouts occurred:
kilda-NB
Error MessageError{correlation_id=9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5, timestamp=1700246791401, error-type=Operation has timed out, error-message=No response received, error-description=Timeout exceeded} caught.
kilda-storm
Flow 17Nov184559_878_saffloweroil4690 error - Global timeout reached for create operation on flow "17Nov184559_878_saffloweroil4690"
BUT flow was created.
Request:

curl -X POST \
http://localhost:8080/api/v2/flows \
-H 'Accept: application/json, application/*+json' \
-H 'correlation_id: fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5' \
-H 'Content-Type: application/json' \
-H 'Content-Length: 797' \
-d '{
  "flow_id" : "17Nov184559_878_saffloweroil4690",
  "source" : {
    "switch_id" : "00:00:00:00:00:01:00:01",
    "port_number" : 10,
    "vlan_id" : 3241,
    "inner_vlan_id" : 0,
    "detect_connected_devices" : {
      "lldp" : false,
      "arp" : false
    }
  },
  "destination" : {
    "switch_id" : "00:00:00:00:00:01:00:02",
    "port_number" : 10,
    "vlan_id" : 637,
    "inner_vlan_id" : 0,
    "detect_connected_devices" : {
      "lldp" : false,
      "arp" : false
    }
  },
  "maximum_bandwidth" : 500,
  "ignore_bandwidth" : false,
  "strict_bandwidth" : false,
  "periodic_pings" : false,
  "description" : "autotest flow: It seems she hangs upon the cheek of night like a rich jewel in an Ethiope's ear.",
  "max_latency" : null,
  "max_latency_tier2" : null,
  "priority" : null,
  "diverse_flow_id" : null,
  "affinity_flow_id" : null,
  "pinned" : false,
  "allocate_protected_path" : false,
  "encapsulation_type" : null,
  "path_computation_strategy" : null,
  "statistics" : null
}' 
==================== request end ====================
2023-11-17T18:46:31,464 DEBUG LoggingRequestInterceptor:57 - 
==================== response begin ====================
Status code  : 408 REQUEST_TIMEOUT
Status text  : 
Headers      : [correlation_id:"9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5", X-Content-Type-Options:"nosniff", X-XSS-Protection:"1; mode=block", Cache-Control:"no-cache, no-store, max-age=0, must-revalidate", Pragma:"no-cache", Expires:"0", X-Frame-Options:"DENY", Content-Type:"application/json", Transfer-Encoding:"chunked", Date:"Fri, 17 Nov 2023 18:46:31 GMT", Connection:"close"]
Response body: {
  "correlation_id" : "9482853d-fe70-4bd7-8a1e-bcf5c54ca944 : fn-tests-4cf5d794-e4e3-497e-bca5-d3b17518ecd5",
  "timestamp" : 1700246791401,
  "error-type" : "Operation has timed out",
  "error-message" : "No response received",
  "error-description" : "Timeout exceeded"
}

FlowDeploymentFailureTimoutIssue

@IvanChupin IvanChupin self-assigned this Jan 8, 2024
@pkazlenka
Copy link
Collaborator

Seems like there's another 'face' of this problem. Sometimes flow is stuck in 'Down' state, not in 'In Progress' one. The logs are attached.
Seems like FLOW_VALIDATION_SPEAKER_WORKER ignores message which confirms successful installation of the non-ingress rules:

FLOW_VALIDATION_SPEAKER_WORKER:32 drop worker async response. because 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273 key is not listed in pending response list [key: 33ecfdb2-cbe6-11ee-8ccc-159f8849f32d : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, message: SpeakerFlowSegmentResponse(super=SpeakerResponse(super=AbstractMessage(messageContext=MessageContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990006567), messageCookie=null), commandId=33ecfdb2-cbe6-11ee-8ccc-159f8849f32d, switchId=00:00:00:00:00:00:00:02), metadata=FlowSegmentMetadata(flowId=15Feb093937_966_saffron8334, cookie=0x2000000000011E55, multiTable=true), success=true, requestCreateTime=1707990006567, executionTime=2000000), context: CommandContext(correlationId=33b7e3f0-cbe6-11ee-8ccc-4ba1d5498525 : 48c85531-d8a2-46d6-97a9-cb5bd609cb66 : fn-tests-feef6158-a013-4581-b864-45ed811f4273, createTime=1707990008657, kafkaTopic=kilda.speaker.flowhs.priv, kafkaPartition=1, kafkaOffset=1)]

Hopefully, logs would help to investigate why does it happen.
logs.json
flow history.json

@yuliiamir
Copy link
Collaborator Author

yuliiamir commented Feb 22, 2024

Logs for another flow with Timeout error was checked and noticed that this key is not listed in pending response list' logs were after Global timeout` from Floodlight.
TimeoutDuringFlowCreation
ResponseFromFlWasSentAfterGlobalTimeout
KeyIsNotListenedInPendingResponseListAfterGlobalTimeout

elasticsearch_dump (1).json.gz

@IvanChupin
Copy link
Collaborator

Initial study of the problem shows that Global timeout of 1 minutes reached because of too long queries to Oriend Db.

On the following screenshot
Screenshot 2024-08-08 at 12 21 03
time between two messages where we are going to execute query to Oriend DB is approximately 10 seconds.

Moving further along with the algorithm of creating flow we have the following screenshot
Screenshot 2024-08-08 at 12 21 46
time between two messages where we are going to execute another query to Oriend DB is approximately 20 seconds.
Also on the screenshot we can see that there is a timeout from northbound side at 17:22:31.

And after 1 minutes here is Global timeout occured.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants