Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workflow][kunlunxin] add klx training pre-pr-check workflow #318

Merged
merged 27 commits into from
Nov 30, 2023

Conversation

dynamicheart
Copy link
Contributor

@dynamicheart dynamicheart commented Nov 9, 2023

测试workflow完整示例可参考:https://github.com/dynamicheart/FlagPerf/actions/runs/6977313432/job/18987019681?pr=2

开发人员使用CI指南

  1. KLX使用的CI机器为44,需要再44上准备数据集,在第二步填入对应的数据集在host上的路径。

  2. PR的标题(强制)需按照[kunlunxin][MODEL_NAME][DATASET_PATH] xxx的格式即可自动触发CI。PR的描述里可以传入三个参数(可选):MAX_EPOCHPIP_SOURCEMAX_SAMPLES_TERMINATION
    image

runner部署【面向CI后台部署人员】

  1. 需要部署self-hosted runner,并且打上三个标签:self-hosted、klx、r480,并且设置代理,使其可以与github通信。

    部署self-hosted runner的方法:链接
    注意点:

    • 添加self-hosted runner, 执行config.sh和run.sh 之前需要加上代理
      示例:http_prpxy=<YOUR_PROXY_ADDR> && https_prpxy=<YOUR_PROXY_ADDR> && config.sh 以及 http_prpxy=<YOUR_PROXY_ADDR> && https_prpxy=<YOUR_PROXY_ADDR> && run.sh
    • 在参考github的Add new self-hosted runner 文档,执行./config.sh --url https://github.com/\<username\>/FlagPerf --token <YOUR_TOKEN> 之前,需要先刷新github的文档页面获取最新token,否则可能会因为tokenk过期,返回请求接口 404 NOT Found错误。
  2. 为保证安全,建议在self-hosted runner机器上,新建一个ci用户(无sudo权限),使用ci用户来运行self-hosted runner程序。注意点:

    • 新建的用户同样要配置本机的ssh免密登录(免密登录127.0.0.1),才能使用run.py一键启动程序;
    • 需要将ci用户加到docker用户组,使得ci用户有权限执行docker命令。
    • 【遗留问题】需要限制ci用户的可访问权限
    • 【遗留问题】可能得考虑下,docker mount目录的风险(可以mount任意目录作为数据集目录,容易被攻击)
    • 【遗留问题】用于ci的机器建议跟其它机器隔离、网络隔离(避免被端口扫描),删除ssh key
    • 【TODO】安全问题解决方法:后续可以加上需要有人approve之后才能触发CI
  3. 【已完成】需要修改run.py在训练失败时返回错误码/或者提供其他机制能够使得actions可以判断训练是否成功。(可能可以考虑使用rank0.log日志文件里面的Event.FINISHED事件)

  4. 【已完成】需要建立一个机制(例如,设置max_epoch),使得1x1和1x8可以在适当的时间中断训练,防止ci时间过长。

    • 设置了max_epoch=1以及max_samples_termination=20
  5. 【已完成】每次运行时会将run.py里面的VERSION设置为v0.1_<model_name>_<timestamp>,保证每次运行都会进行新镜像的构建,并且在运行结束时删除该镜像,回收空间。此外,还额外提供了一个workflow: klx-training-remove-image,可以手动触发删除无用镜像。

特性说明

本workflow包含的机制有:

  • 识别解析PR标题自动触发CI机制
  • 输入参数校验机制,防止输入恶意参数造成未知系统错误
  • 根据rank0.log里面是否有FINISHED判断训练是否成功
  • 根据时间戳设置镜像的TAG保证每次构建最新镜像
  • CI无论成功和失败自动清理镜像和容器,防止硬盘空间占用过大
  • 通过设置max_epoch和max_samples_termination限制CI时长
  • 分布式训练随机选择端口防止端口冲突
  • 自动拉取xpu_smi数据获取CI过程中的最大峰值内存

解决的问题有:

  • 解决非sudo状态下运行run.py无法启动xpu_smi monotor的问题

其它使用方法

【补充】手动触发的方法:

选择klx-training-test-manually workflow,传入case、dataset路径等参数,手动触发:

image

手动触发ci的教程:https://docs.github.com/en/actions/using-workflows/manually-running-a-workflow

【补充】清理镜像workflow可参考:

https://github.com/dynamicheart/FlagPerf/actions/runs/6940580129

@dynamicheart dynamicheart force-pushed the features/klx_workflow branch from 3242369 to 95f38ac Compare November 9, 2023 05:54
@yuzhou03 yuzhou03 changed the title add klx training pre-pr-check workflow [kulunxin] add klx training pre-pr-check workflow Nov 15, 2023
@yuzhou03
Copy link
Contributor

yuzhou03 commented Nov 15, 2023

验证过程

  1. KLX 02机器,部署运行run.sh,启动runner

  2. github页面手动运行 klx-training-pre-pr-check 工作流. 1x1 case skipped, 1x8 case completed, but not finished.

image

@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 2 times, most recently from 2bb78a8 to 4892099 Compare November 16, 2023 08:03
@dynamicheart dynamicheart changed the title [kulunxin] add klx training pre-pr-check workflow [kunlunxin] add klx training pre-pr-check workflow Nov 16, 2023
@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 9 times, most recently from 19b5dd7 to 9f1076d Compare November 22, 2023 03:40
@yuzhou03
Copy link
Contributor

klx-training-remove-image 支持输入image_id 和 image_name: tag_name两种方式。

by image_id
https://github.com/yuzhou03/FlagPerf/actions/runs/6966024117/job/18955543221
by image_name: tag_name
https://github.com/yuzhou03/FlagPerf/actions/runs/6966060139/job/18955633484

@yuzhou03
Copy link
Contributor

t5_small 模型的验证过程 【Succeed】
https://github.com/yuzhou03/FlagPerf/actions/runs/6953190134/job/18917958330

@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 2 times, most recently from a2ef133 to 4eea1e0 Compare November 23, 2023 08:31
@yuzhou03
Copy link
Contributor

yuzhou03 commented Nov 23, 2023

如果ci过程太长,cancel job后,镜像可以自动删除。
image

https://github.com/yuzhou03/FlagPerf/actions/runs/6967552812/job/18959709463

@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 3 times, most recently from 30f290f to c7f7fcc Compare November 24, 2023 02:00
@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 4 times, most recently from 2b6b695 to 98dae59 Compare November 24, 2023 03:50
@dynamicheart dynamicheart changed the title [kunlunxin] add klx training pre-pr-check workflow [testing][kunlunxin] add klx training pre-pr-check workflow Nov 24, 2023
@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 7 times, most recently from b41112d to 74fbe9a Compare November 24, 2023 05:45
@dynamicheart dynamicheart changed the title [testing][kunlunxin] add klx training pre-pr-check workflow [workflow][kunlunxin] add klx training pre-pr-check workflow Nov 24, 2023
@dynamicheart dynamicheart force-pushed the features/klx_workflow branch 2 times, most recently from d3ddfb1 to 21a4a09 Compare November 24, 2023 06:10
@upvenly upvenly merged commit 1ffc126 into FlagOpen:main Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants