Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix 20284]: Enhance smartswitch environment variables parsing #21209

Merged
merged 1 commit into from
Jan 19, 2025

Conversation

Pterosaur
Copy link
Contributor

@Pterosaur Pterosaur commented Dec 18, 2024

Why I did it

Fix #20284

In 202405 and above, two extra steps are added before the start of every container which checks NUM_DPU and IS_DPU_DEVICE by parsing the platform.json file using the jq tool. This is only relevant for Smartswitch. However, this is adding some delay during the reconciliation phase of WR/FR resulting

When there is load on CPU, both the jq calls are adding > 1 sec to the start of swss and almost 0.5 sec to the start of syncd. There are also present in teamd and bgp container start flow which may cause extra contention on the CPU.

image

Work item tracking
  • Microsoft ADO (number only):

How I did it

Set the environment variables for systemd by systemd-sonic-generator.

How to verify it

  1. Check there is no jq command under the swss.sh start and syncd.sh start from the sonic-bootchart
  2. Regression test:
echo '{"DPUS":{"dpu0":{}, "dpu1":{}}}' | sudo tee /usr/share/sonic/device/x86_64-kvm_x86_64-r0/platform.json
docker rm -f database
sudo reboot

...

admin@vlab-01:~$ sudo systemctl list-unit-files '*midplane*'
UNIT FILE                    STATE           PRESET
midplane-network-dpu.service disabled        enabled
midplane-network-npu.service enabled-runtime  enabled


# The DPU databases should be created
admin@vlab-01:~$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS         PORTS     NAMES
4455f7ab611c   docker-database:latest               "/usr/local/bin/dock…"   5 minutes ago    Up 5 minutes             databasedpu0
8e983b5beb27   docker-database:latest               "/usr/local/bin/dock…"   5 minutes ago    Up 5 minutes             databasedpu1

# Environment variables should be inserted to the service file
admin@vlab-01:~$ cat $(sudo find /etc -name 'swss*service')
[Unit]
Description=switch state service
Requires=database@dpu0.service

Requires=database@dpu1.service

After=database@dpu0.service

After=database@dpu1.service

Requires=config-setup.service
After=config-setup.service
BindsTo=sonic.target
After=sonic.target
Before=ntp-config.service
StartLimitIntervalSec=1200

StartLimitBurst=3

[Service]
Environment="NUM_DPU=2"
Environment="IS_DPU_DEVICE=false"
User=root
Environment=sonic_asic_platform=vs
ExecStartPre=/usr/local/bin/swss.sh start
ExecStart=/usr/local/bin/swss.sh wait
ExecStop=/usr/local/bin/swss.sh stop
RestartSec=30

[Install]
WantedBy=sonic.target

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Ze Gan <ganze718@gmail.com>
@Pterosaur Pterosaur requested a review from lguohan as a code owner December 18, 2024 09:47
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur
Copy link
Contributor Author

/azpw ms_conflict

@Pterosaur
Copy link
Contributor Author

/azpw ms_conflict

@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

env_vars["IS_DPU_DEVICE"] = (smart_switch_dpu ? "true" : "false");
env_vars["NUM_DPU"] = std::to_string(num_dpus);

for (const auto& [key, value] : env_vars) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think services files such as database, swss, syncd are enabled at build time.
Are we certain that, these will be started only after systemd-sonic-generator is run atleast once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also verify the changes on smartswitch platform and make sure multiple database instances are created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
Yes, the generator will be run only once and before all services started.
I did confirm the smartswitch scenario and pasted the results on the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing


std::unordered_map<std::string, std::string> env_vars;
env_vars["IS_DPU_DEVICE"] = (smart_switch_dpu ? "true" : "false");
env_vars["NUM_DPU"] = std::to_string(num_dpus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good approach to solve this issue but i see one setback. If we need to add new env variables, ssg code should be updated. which IMO is not very flexible. I have an idea for generic solution, let me know what you think

  1. Add a oneshot service very early in the boot. Read static env variables (Eg: $PLATFORM, $NUM_ASIC, $NUM_DPU, $SONIC_BOOT_TYPE etc) and write them to a common file /etc/sonic/static-env-variables
  2. We can leverage ssg to write EnvironmentFile=/etc/sonic/static-env-variables option to all the services. making the ssg code minimal and flexible.
  3. We can potentially clean a lot of code under docker_image_ctl with this approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can have a try.

Copy link
Contributor Author

@Pterosaur Pterosaur Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it, but I feel it wasn't easy yet.

  1. The shell of oneshot service you mentioned might look like the following. But at this time, the database service hasn't ready, so the sonic-cfggen would not work and I have to parse the /host/machine.conf as same as what systemd-sonic-generator did if I would like to get the variable. The SSG has done this by an efficient function, C function, why should I do it again by a shell?
SYSTEMD_ENV_FILE="/etc/sonic/static_env"

# Load platform from sonic-cfggen

PLATFORM=${PLATFORM:-`sonic-cfggen -H -v DEVICE_METADATA.localhost.platform`}

echo "PLATFORM='${PLATFORM}'" >> "$SYSTEMD_ENV_FILE"

# Parse environment from platform.json

PLATFORM_JSON=/usr/share/sonic/device/$PLATFORM/platform.json

if [ -f "$PLATFORM_JSON" ]; then

# Environment variables for Smart Switch

    NUM_DPU=$(jq -r '.DPUS | length' $PLATFORM_JSON 2>/dev/null)
    if [[ -z "$NUM_DPU" ]]; then
        NUM_DPU=0
    fi

    jq -e '.DPU' $PLATFORM_JSON >/dev/null
    if [[ $? -eq 0 ]]; then
        IS_DPU_DEVICE="true"
    else
        IS_DPU_DEVICE="false"
    fi

    echo "NUM_DPU='${NUM_DPU}'" >> "$SYSTEMD_ENV_FILE"
    echo "IS_DPU_DEVICE='${IS_DPU_DEVICE}'" >> "$SYSTEMD_ENV_FILE"
fi
  1. As your proposal, if we want to add new env variables, we have to update the shell script. I don't see any difference or challenge in doing this via SSG. If you feel that C code isn't flexible, I have to say we had an old SSG with python code previously, But we discarded it and rewrote it via C due to the efficiency issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, script couldn't use DB just like SSG

Hmm, difference is we need to run the script once. Since the oneshot service runs before other services are even started, CPU should be fairly free and should be executed quickly. we do need to benchmark this solution to measure impact.

Advantage being load on SSG is less, all it does it to add EnvironmentFIle= and with some optimization we don't even need to edit and write to the .service file after the first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, i'm okay with the current solution. We might move to the generic oneshot service if required in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR looks good to you, could you please help to approve it.

env_vars["NUM_DPU"] = std::to_string(num_dpus);

for (const auto& [key, value] : env_vars) {
tmp_file << "Environment=\"" << key << "=" << value << "\"" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Environment= option if the file doesn't have[Service] section in the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that the Service section might be introduced by other functions or modules in the future. And I don't see any side-effects if I define some environment variables at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure, systemd won't throw an error if we add Environment= values after the last section without [Service]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no Service section, I will add it.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@vivekrnv vivekrnv requested a review from Yakiv-Huryk December 20, 2024 06:27
@vivekrnv vivekrnv removed the request for review from Yakiv-Huryk January 13, 2025 19:49
@Pterosaur
Copy link
Contributor Author

Hi @lguohan , Please help to review or merge this PR.

@lguohan
Copy link
Collaborator

lguohan commented Jan 19, 2025

@Pterosaur , can you put delay numbers in the description?

@lguohan lguohan merged commit 7e03462 into sonic-net:master Jan 19, 2025
22 checks passed
VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Jan 21, 2025
Fix sonic-net#20284

In 202405 and above, two extra steps are added before the start of every container which checks NUM_DPU and IS_DPU_DEVICE by parsing the platform.json file using the jq tool. This is only relevant for Smartswitch. However, this is adding some delay during the reconciliation phase of WR/FR resulting

How I did it
Set the environment variables for systemd by systemd-sonic-generator.

Signed-off-by: Ze Gan <ganze718@gmail.com>
BYGX-wcr pushed a commit to BYGX-wcr/sonic-buildimage that referenced this pull request Jan 21, 2025
Fix sonic-net#20284

In 202405 and above, two extra steps are added before the start of every container which checks NUM_DPU and IS_DPU_DEVICE by parsing the platform.json file using the jq tool. This is only relevant for Smartswitch. However, this is adding some delay during the reconciliation phase of WR/FR resulting

How I did it
Set the environment variables for systemd by systemd-sonic-generator.

Signed-off-by: Ze Gan <ganze718@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checking for DPUs in platform.json is adding delay to WR/FR reconciliation
5 participants