Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out next areas to vectorize #4580

Closed
onelson opened this issue Mar 21, 2022 · 9 comments
Closed

Figure out next areas to vectorize #4580

onelson opened this issue Mar 21, 2022 · 9 comments
Assignees

Comments

@onelson
Copy link
Contributor

onelson commented Mar 21, 2022

This is my current top priority, but it wasn't anywhere on the sprint board. Filing this issue for visibility.

Using the query archiver to help make a data-driven decision about the next item to vectorize, look at the structure of historical queries to see what the functions received by map actually do, then recommend as far reaching an item as possible.

@onelson onelson self-assigned this Mar 21, 2022
@onelson
Copy link
Contributor Author

onelson commented Mar 22, 2022

Been trying to figure out ways to leverage colm to do a smarter search, but I'm there yet. In an effort to at least get the boll rolling, I dumped out a report for "queries that contain map() calls." The resulting output was just under 3Gb of text. Grepping for just the map() callsites, the count is: 2,992,903.

A haystack of this size is not likely to be easy to make sense of with eyeballs alone, but I'm going to try and use grep to gain some weak insights. Maybe I can carve this up so as to divide and conquer.

@onelson
Copy link
Contributor Author

onelson commented Mar 22, 2022

Accounting for some warts in this methodology.

Some number of the callsites grepped out of the full output will be formatted such that there's nothing visible inside the call on the line where the call appears.

$ grep -e '^.* map(fn: (r) =>\s*$' callsites.txt  | wc -l
222381
$ grep -e '^.* map(fn: (r) => ({\s*$' callsites.txt  | wc -l
137599
$ grep -e '^.* map(\s*$' callsites.txt | wc -l
185749

This means I'm losing 545,729 calls worth of detail right out of the gate 😵‍💫. Not yet sure what to do about this, but for now it is what it is.

I'm going to remove these lines from the callsites file and focus on the one liners that are easier to inspect. With this, the haystack is reduced to 2,447,174 lines, or around 215Mb.

@onelson
Copy link
Contributor Author

onelson commented Mar 22, 2022

Some findings, just grepping against the one liner data. Greps included below in case I goofed and accidentally inflated these somehow.

pie
title Occurrences in one line map calls
"dict.get": 63712
"exists": 7550
"if/then/else": 127544
"LT": 14696
"LTE": 1926
"GT": 16043
"GTE": 14600
"Neq": 2673
"Eq": 66388
"Mult": 97604
"Div": 236285
"Sub": 158218
"Add": 46644
"string": 9352
"int": 71812
"float": 169152
"duration": 15815
"uint": 16008
"regexp": 2218
"or": 7207
"and": 16775
Loading
name count without if/then/else
Div 236,285
float 169,152
Sub 158,218
if/then/else 127,544
Mult 97,604
int 71,812
Eq 66,388 5,718
dict.get 63,712
Add 46,644
and 16,775 371
GT 16,043 1,080
uint 16,008
duration 15,815
LT 14,696 545
GTE 14,600 35
string 9,352
exists 7,550
or 7,207 297
LTE 1,926 1
Neq 2,673 715
regexp 2,218

EDIT: The 3rd column was added to see the counts for boolean/comparison ops that don't first require us to have if/then/else vectorized to really benefit.


# Add 
$ grep -e '^.* + .*$' callsites-oneliners.txt | wc -l
46644

# Sub
$ grep -e '^.* - .*$' callsites-oneliners.txt | wc -l
158218

# Div
# N.b. the whitespace in the regex should mean this doesn't give false positives for comments...
$ grep -e '^.* / .*$' callsites-oneliners.txt | wc -l
236285

# Mult
$ grep -e '^.* \* .*$' callsites-oneliners.txt | wc -l
97604

# Eq
$ grep -e '^.* == .*$' callsites-oneliners.txt | wc -l
66388

# Neq
$ grep -e '^.* != .*$' callsites-oneliners.txt | wc -l
2673

# GT
$ grep -e '^.* > .*$' callsites-oneliners.txt | wc -l
16043

# GTE
$ grep -e '^.* >= .*$' callsites-oneliners.txt | wc -l
14600

# LT
$ grep -e '^.* < .*$' callsites-oneliners.txt | wc -l
14696

# LTE
$ grep -e '^.* <= .*$' callsites-oneliners.txt | wc -l
1926

# If/then/else
$ grep -e '^.* if .*$' callsites-oneliners.txt | wc -l
127544

# Exists
grep -e '^.* exists .*$' callsites-oneliners.txt | wc -l
7550

# dict.get()
$ grep -e '^.* dict\.get(.*$' callsites-oneliners.txt | wc -l
63712

$ grep -e '^.* duration(.*$' callsites-oneliners.txt | wc -l
15815

grep -e '^.* regexp\..*$' callsites-oneliners.txt | wc -l
2218

# Boolean or
$ grep -e ' or ' callsites-oneliners.txt | wc -l
7207

# Boolean and
$ grep -e ' and ' callsites-oneliners.txt | wc -l
16775


# Casts...

# float
$ grep -e '^.* float(.*$' callsites-oneliners.txt | wc -l
169152

# int
$ grep -e '^.* int(.*$' callsites-oneliners.txt | wc -l
71812

# string
$ grep -e '^.* string(.*$' callsites-oneliners.txt | wc -l
9352

$ grep -e '^.* uint(.*$' callsites-oneliners.txt | wc -l
16008

@onelson
Copy link
Contributor Author

onelson commented Mar 22, 2022

Briefly discussed the above numbers with the team and it sounds like arithmetic operators are the next step. Comparisons (eq, neq, lt, lte, gt, gte) are only high-impact once if/then/else can be optimized, but arithmetic is applicable for many cases without any prerequisites.

@nathanielc
Copy link
Contributor

Looking at this data if we can do the first four div, float, sub, if/then/else, we can cover > 50% of uses for map. That's a great position to be in as we would be able to see early impact of this work.

My recommendation would be to work on binary operators generally and then do if/then/else as that seems very similar to binary operations. Then follow up with function invocation for the call to float. Thoughts? That seems like a good balance between impact and cost of implementation.

@onelson
Copy link
Contributor Author

onelson commented Mar 23, 2022

I think when I brought these numbers to slack, there was some concern about if/then/else complexity-wise, but maybe working the boolean ops first will help clarify things? I'm not sure.

@onelson
Copy link
Contributor Author

onelson commented Mar 23, 2022

The epic has been updated to show a "quest tracking" section at the bottom of the description, and I've linked the immediate next hops on the roadmap there.

Issues were filed specifically to get through the arithmetic ops, getting us 2 of the top 4.

@nathanielc I was going to close this issue out now that we have numbers to guide us, and next (immediate) steps filed as issues. If you prefer I can try file issues for the extended roadmap you just laid out. Let me know.

@nathanielc
Copy link
Contributor

Yeah I'd like to create a set of issues for if/then/else and then issues for the float call. Those would come after the issues you have already created. Thanks for investigating this is really nice data.

@onelson
Copy link
Contributor Author

onelson commented Mar 25, 2022

Okay, issues for logical ops, conditional expressions, and float are all in the backlog and linked from the epic.

@onelson onelson closed this as completed Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants