Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression balancing #66

Closed
thesps opened this issue Apr 26, 2024 · 6 comments
Closed

Expression balancing #66

thesps opened this issue Apr 26, 2024 · 6 comments

Comments

@thesps
Copy link
Owner

thesps commented Apr 26, 2024

The implementation of summation over tree scores in HLS backend prevents automatic expression balancing when saturation and rounding are used in the score precision type.

    Trees:
    for(int i = 0; i < n_trees; i++){
      Classes:
      for(int j = 0; j < fn_classes(n_classes); j++){
        score[j] += scores[i][j];
      }
    }

This should be replaced with a balanced tree reduce implementation.

@francescobrivio
Copy link
Contributor

Hi @thesps!

I think I have a possible implementation of the solution in francescobrivio@341627c (but further testing is needed on this preliminary implementation!) which drastically reduce the latency in case the AP_SAT flag is used in the ScorePrecision.
However this comes at the cost of a slight increase in resources usage due to the need of "inverting"
scores[n_trees][fn_classes(n_classes)] (in order to apply the tree reduce method).

So I think this can potentially be further improved if we can change this line:

void tree_scores(input_t x, score_t scores[n_trees][fn_classes(n_classes)]) const;

to fill the scores in "reversed order", i.e. score_t scores[fn_classes(n_classes)][n_trees], but I'm not sure if this requires much deeper changes in the code and/or if it's acceptable :D

What do you think?

@francescobrivio
Copy link
Contributor

I have run a few tests using the same identical xgb model and changing the Conifer "version" and the ScorePrecision:

Conifer version ScorePrecision vsynth LUT vsynth FF Latency
master ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51936 5032 133
My commit ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
-------- -------- -------- -------- --------
master ap_fixed<11,4,AP_RND_CONV> 43329 3542 4
My commit ap_fixed<11,4,AP_RND_CONV> 42992 4504 5

So I have to correct my previous post: in case of AP_SAT my commit shows better latency and resources, while it has a slightly negative impact on both when AP_SAT is not used (to be improved).

Note:
Please note that I still have to run the inference on few events to validate the conifer scores with respect to xgb.

@thesps
Copy link
Owner Author

thesps commented May 2, 2024

Thanks for the development @francescobrivio this looks really promising! I think your proposal to swap the order of the indices of the scores array makes complete sense to avoid the inversion in silico and there should be no problem doing that.

As well as changing the BDT_rolled.cpp HLS code, you would need to change the Python writer for the fully unrolled optimization, that should be constrained to this line, I think:

newline += f'  scores[{it}][{ic}] = tree_{it}_{ic}.decision_function(x);\n'

@francescobrivio
Copy link
Contributor

Thanks for the feedback Sioni!
Ok I have implemented the "swapping" of the scores array and here are the updated results:

Conifer version ScorePrecision vsynth LUT vsynth FF Latency
master ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51936 5032 133
My commit ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
My commit V2 ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
-------- -------- -------- -------- --------
master ap_fixed<11,4,AP_RND_CONV> 43329 3542 4
My commit ap_fixed<11,4,AP_RND_CONV> 42992 4504 5
My commit V2 ap_fixed<11,4,AP_RND_CONV> 42988 3800 4

Considerations:

  • In case of using AP_SAT: the result with the swapped scores is exactly identical...I guess Vitis is smart enough to figure it out?
  • In case of not using AP_SAT: avoiding to swap scores does bring improvements in both latency and resources
  • Overall with the latest changes ("V2") there is gain everywhere except for the usage of FF in no-AP_SAT case, but given FF are the most abundant resource I think we can afford it?

@thesps if you have further suggestions I'm happy to try them out! If not, and you agree, I can open a PR with these changes.

@thesps
Copy link
Owner Author

thesps commented May 7, 2024

Nice, this looks great to me. Please go ahead and open a PR 👍

@thesps
Copy link
Owner Author

thesps commented May 13, 2024

Resolved by @francescobrivio in #68

@thesps thesps closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants