PT/TF examples vs XGBoost examples #1178

Jeffwan · 2022-12-05T19:17:47Z

Jeffwan
Dec 5, 2022

Hi community, I am trying to understand more on the xgboost examples.

I notice Federated Learning RFC dmlc/xgboost#7778 was proposed to address federated learning support of XGBoost. That means the examples in https://github.com/NVIDIA/NVFlare/tree/main/examples/xgboost have to use the XGBoost with above PRs, right?
Currently, tensorflow and pytorch examples schedule tasks to clients and server is responsible to aggregate the parameters and pass updated weights to worker in next round. Why XGBoost can not adopt the same way? I am trying to understand why Federated Learning RFC dmlc/xgboost#7778 is necessary?
With the change Federated Learning RFC dmlc/xgboost#7778, i think XGBoost can treat each client as the worker in distributed manner and run federated and distributed training directly (horizontal FL). If TF/Pytorch wants to do similar thing, Does similar change need to be made in pytorch/tensorflow core?

/cc @rongou @YuanTingHsieh @IsaacYangSLA

Answered by YuanTingHsieh

Jan 26, 2023

Hi @Jeffwan thanks for the discussion.

@rongou already provided some good points.

Let me tag @ZiyueXu77 as he is our scientist that work on the algorithm side of tree-based method, and he has analyzed some of these "model accuracy" part, maybe he can share some insights.

If that true, federated TF/PyTorch has some cons as tree-based solution, what's the reason "But so far parameter aggregation seems to work pretty well for deep models, so it may not be necessary." ?

^As Rong replied: "Deep models seem to be more forgiving with stochastic parameter updates (e.g. https://arxiv.org/abs/1106.5730)."

To add additional point, as stated in: https://github.com/NVIDIA/NVFlare/tree/dev/examples/x…

View full answer

rongou · 2022-12-05T19:28:26Z

rongou
Dec 5, 2022

Thanks for asking! Replies inline.

I notice Federated Learning RFC dmlc/xgboost#7778 was proposed to address federated learning support of XGBoost. That means the examples in https://github.com/NVIDIA/NVFlare/tree/main/examples/xgboost have to use the XGBoost with above PRs, right?

There are two flavors in the examples, "histogram-based" and "tree-based". Only the histogram-based approach relies on dmlc/xgboost#7778.

Currently, tensorflow and pytorch examples schedule tasks to clients and server is responsible to aggregate the parameters and pass updated weights to worker in next round. Why XGBoost can not adopt the same way? I am trying to understand why Federated Learning RFC dmlc/xgboost#7778 is necessary?

This is the tree-based approach. It has some limitations on training speed and model accuracy. The histogram-based approach is "lossless", trained models should be identical to those from xgboost distributed training, speed should be comparable given fast networks.

With the change Federated Learning RFC dmlc/xgboost#7778, i think XGBoost can treat each client as the worker in distributed manner and run federated and distributed training directly (horizontal FL). If TF/Pytorch wants to do similar thing, Does similar change need to be made in pytorch/tensorflow core?

Yes if you want federated TF/Pytorch to work exactly like distributed training. But so far parameter aggregation seems to work pretty well for deep models, so it may not be necessary.

2 replies

Jeffwan Dec 6, 2022
Author

@rongou Thanks for the details.

Based on my understanding, It has some limitations on training speed and model accuracy this is because the samples in different clients are different and fedAvg parameters is probably not a good strategy and then histogram-based approach is introduced to solve the problem?

If that true, federated TF/PyTorch has some cons as tree-based solution, what's the reason "But so far parameter aggregation seems to work pretty well for deep models, so it may not be necessary." ?

Jeffwan Dec 6, 2022
Author

Any down side on The histogram-based approach? I am thinking if it's running in a distributed way, if the compute power or data are skewed across clients, it might be worse than "tree-based approach"? What's the best practice how these strategies?

Jeffwan · 2022-12-06T23:36:24Z

Jeffwan
Dec 6, 2022
Author

Yes if you want federated TF/Pytorch to work exactly like distributed training

A separate question, seems it plays a role of Parameter Server, in that case, workers are communicated with Parameter Server to exchange data. there's no worker and worker communication as well. In that case, I think all NVFlare native design would be implementing NVFlare server as parameter server and user does need some framework changes to glue NVFlare and framework distributed training together? This sounds like an ideal solution but not necessary since we leverage scatter and gather mode now.

1 reply

YuanTingHsieh Jan 26, 2023
Maintainer

From my understanding, federated learning is like distributed learning with privacy preserving.

So one way to approach it is to assume there is already a Parameter Server (with some protocols it has to communicate with clients)
And you can build a layer on top of this parameter server to do privacy preserving things?

But that is only federated (deep) learning, I am thinking of what about federated statistics and some other linear models?
I think NVFlare takes a more general way of building a platform/library for Federated "Computing" rather than just federated learning.

rongou · 2022-12-08T19:37:08Z

rongou
Dec 8, 2022

@Jeffwan The two approaches were worked on in parallel, mostly me on the histogram-based approach, and the NVFlare team on the tree-based one.

As I mentioned, the histogram-based approach is "lossless". Any dataset you can train with distributed xgboost you can translate directly into the federated environment. However, currently there is no strong privacy guarantee since gradients are shared freely between participants. This needs to be worked on. Also since each gradient sum requires a gRPC call from all the workers, it's sensitive to network latency, especially when grow_policy=lossguide which requires more frequent gradient updates. In a test I manually inserted a 100 ms delay to each gRPC call, the training time increased by ~4x for grow_policy=depthwise, but over 21x for grow_policy=lossguide.

The tree-based approach is more research oriented, and may work well in some scenarios, but may also suffer with model accuracy in others (e.g. data skew). Since only the trees are shared, it's probably more privacy preserving at the moment. Also there is less communication, so if you are running over a very slow network it may be faster.

Deep models seem to be more forgiving with stochastic parameter updates (e.g. https://arxiv.org/abs/1106.5730), so having a "lossless" federated training approach may not be necessary.

0 replies

YuanTingHsieh · 2023-01-26T01:09:59Z

YuanTingHsieh
Jan 26, 2023
Maintainer

Hi @Jeffwan thanks for the discussion.

@rongou already provided some good points.

Let me tag @ZiyueXu77 as he is our scientist that work on the algorithm side of tree-based method, and he has analyzed some of these "model accuracy" part, maybe he can share some insights.

If that true, federated TF/PyTorch has some cons as tree-based solution, what's the reason "But so far parameter aggregation seems to work pretty well for deep models, so it may not be necessary." ?

^As Rong replied: "Deep models seem to be more forgiving with stochastic parameter updates (e.g. https://arxiv.org/abs/1106.5730)."

To add additional point, as stated in: https://github.com/NVIDIA/NVFlare/tree/dev/examples/xgboost/tree-based#tree-based-federated-learning-for-xgboost
The tree-based model with bagging aggregation is more close to a random forest.
So it kind of comes down to Random Forest with bagging aggregation VS Deep models with FedAvg.

1 reply

ZiyueXu77 Jan 26, 2023
Maintainer

Hi @Jeffwan, I suggest you get more knowledge and details from the paper and basic theories. Here I highlight some distinctions between deep neural network and xgboost:

Most significantly, the basic settings are different: 1) deep learning is an iterative numerical optimization process, hence we can use optimizers like SGD or Adam to solve it. The updates are expected to be arithmetically added to the existing model, and we have "steps" v.s. "epochs" to landmark the progress; while 2) xgboost is a tree-based model, which means optimizers does not apply here, and that the "updates" cannot be simply "arithmetically added" to the existing model - a step/update is one additional tree in this case.
Therefore, for federated xgboost, we have two choices: 1) to model it the same as distributed xgboost ("histogram-based"), where each client compute the "half-baked" tree stats, then forward to server for completing the tree building process - note in this case, each round we have 1 single new tree; 2) to model it similar to deep learning FedAvg ("tree-based"), where each client complete the tree building process, then submit the update tree to server for aggregation - the challenge being that how to combine the K trees from K clients - we choose to bagging them together following the "forest" idea, while there are some research around combining K trees to 1 tree.
So you can notice that histogram-based method is "lossless" in the sense that it is identical to centralized training where you gather all data together, and in the end for a N-round xgboost, it will result in N sequential trees. While for tree-based method, it will be different from centralized result since the trees are independently built, and in the end it will result in N*K trees -> N sequential forests, each forest has K parallel trees.
As for performance, it depends on the data distribution - tree-based method can perform better than histogram-based as our example shows, one reason can be that it has more model capacity (NK v.s. N trees), in other cases it can be worse. Histogram is always the same as centralized.
As for "histogram-based" deep learning, this will be distributed deep learning - parameters need to be aggregated each step, and the optimizers need to be synced - and in fact, federated learning is invented to avoid these shortcomings that makes it unaffordable for practical usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PT/TF examples vs XGBoost examples #1178

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PT/TF examples vs XGBoost examples #1178

Jeffwan Dec 5, 2022

Replies: 4 comments · 4 replies

rongou Dec 5, 2022

Jeffwan Dec 6, 2022 Author

Jeffwan Dec 6, 2022 Author

Jeffwan Dec 6, 2022 Author

YuanTingHsieh Jan 26, 2023 Maintainer

rongou Dec 8, 2022

YuanTingHsieh Jan 26, 2023 Maintainer

ZiyueXu77 Jan 26, 2023 Maintainer

Jeffwan
Dec 5, 2022

Replies: 4 comments 4 replies

rongou
Dec 5, 2022

Jeffwan Dec 6, 2022
Author

Jeffwan Dec 6, 2022
Author

Jeffwan
Dec 6, 2022
Author

YuanTingHsieh Jan 26, 2023
Maintainer

rongou
Dec 8, 2022

YuanTingHsieh
Jan 26, 2023
Maintainer

ZiyueXu77 Jan 26, 2023
Maintainer