Figure out next areas to vectorize #4580

onelson · 2022-03-21T23:39:52Z

This is my current top priority, but it wasn't anywhere on the sprint board. Filing this issue for visibility.

Using the query archiver to help make a data-driven decision about the next item to vectorize, look at the structure of historical queries to see what the functions received by map actually do, then recommend as far reaching an item as possible.

The text was updated successfully, but these errors were encountered:

onelson · 2022-03-22T18:16:52Z

Been trying to figure out ways to leverage colm to do a smarter search, but I'm there yet. In an effort to at least get the boll rolling, I dumped out a report for "queries that contain map() calls." The resulting output was just under 3Gb of text. Grepping for just the map() callsites, the count is: 2,992,903.

A haystack of this size is not likely to be easy to make sense of with eyeballs alone, but I'm going to try and use grep to gain some weak insights. Maybe I can carve this up so as to divide and conquer.

onelson · 2022-03-22T18:40:21Z

Accounting for some warts in this methodology.

Some number of the callsites grepped out of the full output will be formatted such that there's nothing visible inside the call on the line where the call appears.

$ grep -e '^.* map(fn: (r) =>\s*$' callsites.txt  | wc -l
222381
$ grep -e '^.* map(fn: (r) => ({\s*$' callsites.txt  | wc -l
137599
$ grep -e '^.* map(\s*$' callsites.txt | wc -l
185749

This means I'm losing 545,729 calls worth of detail right out of the gate 😵‍💫. Not yet sure what to do about this, but for now it is what it is.

I'm going to remove these lines from the callsites file and focus on the one liners that are easier to inspect. With this, the haystack is reduced to 2,447,174 lines, or around 215Mb.

onelson · 2022-03-22T20:28:24Z

Some findings, just grepping against the one liner data. Greps included below in case I goofed and accidentally inflated these somehow.

pie
title Occurrences in one line map calls
"dict.get": 63712
"exists": 7550
"if/then/else": 127544
"LT": 14696
"LTE": 1926
"GT": 16043
"GTE": 14600
"Neq": 2673
"Eq": 66388
"Mult": 97604
"Div": 236285
"Sub": 158218
"Add": 46644
"string": 9352
"int": 71812
"float": 169152
"duration": 15815
"uint": 16008
"regexp": 2218
"or": 7207
"and": 16775

name	count	without if/then/else
Div	236,285
float	169,152
Sub	158,218
if/then/else	127,544
Mult	97,604
int	71,812
Eq	66,388	5,718
dict.get	63,712
Add	46,644
and	16,775	371
GT	16,043	1,080
uint	16,008
duration	15,815
LT	14,696	545
GTE	14,600	35
string	9,352
exists	7,550
or	7,207	297
LTE	1,926	1
Neq	2,673	715
regexp	2,218

EDIT: The 3rd column was added to see the counts for boolean/comparison ops that don't first require us to have if/then/else vectorized to really benefit.

# Add 
$ grep -e '^.* + .*$' callsites-oneliners.txt | wc -l
46644

# Sub
$ grep -e '^.* - .*$' callsites-oneliners.txt | wc -l
158218

# Div
# N.b. the whitespace in the regex should mean this doesn't give false positives for comments...
$ grep -e '^.* / .*$' callsites-oneliners.txt | wc -l
236285

# Mult
$ grep -e '^.* \* .*$' callsites-oneliners.txt | wc -l
97604

# Eq
$ grep -e '^.* == .*$' callsites-oneliners.txt | wc -l
66388

# Neq
$ grep -e '^.* != .*$' callsites-oneliners.txt | wc -l
2673

# GT
$ grep -e '^.* > .*$' callsites-oneliners.txt | wc -l
16043

# GTE
$ grep -e '^.* >= .*$' callsites-oneliners.txt | wc -l
14600

# LT
$ grep -e '^.* < .*$' callsites-oneliners.txt | wc -l
14696

# LTE
$ grep -e '^.* <= .*$' callsites-oneliners.txt | wc -l
1926

# If/then/else
$ grep -e '^.* if .*$' callsites-oneliners.txt | wc -l
127544

# Exists
grep -e '^.* exists .*$' callsites-oneliners.txt | wc -l
7550

# dict.get()
$ grep -e '^.* dict\.get(.*$' callsites-oneliners.txt | wc -l
63712

$ grep -e '^.* duration(.*$' callsites-oneliners.txt | wc -l
15815

grep -e '^.* regexp\..*$' callsites-oneliners.txt | wc -l
2218

# Boolean or
$ grep -e ' or ' callsites-oneliners.txt | wc -l
7207

# Boolean and
$ grep -e ' and ' callsites-oneliners.txt | wc -l
16775


# Casts...

# float
$ grep -e '^.* float(.*$' callsites-oneliners.txt | wc -l
169152

# int
$ grep -e '^.* int(.*$' callsites-oneliners.txt | wc -l
71812

# string
$ grep -e '^.* string(.*$' callsites-oneliners.txt | wc -l
9352

$ grep -e '^.* uint(.*$' callsites-oneliners.txt | wc -l
16008

onelson · 2022-03-22T21:18:08Z

Briefly discussed the above numbers with the team and it sounds like arithmetic operators are the next step. Comparisons (eq, neq, lt, lte, gt, gte) are only high-impact once if/then/else can be optimized, but arithmetic is applicable for many cases without any prerequisites.

nathanielc · 2022-03-23T15:01:57Z

Looking at this data if we can do the first four div, float, sub, if/then/else, we can cover > 50% of uses for map. That's a great position to be in as we would be able to see early impact of this work.

My recommendation would be to work on binary operators generally and then do if/then/else as that seems very similar to binary operations. Then follow up with function invocation for the call to float. Thoughts? That seems like a good balance between impact and cost of implementation.

onelson · 2022-03-23T17:00:26Z

I think when I brought these numbers to slack, there was some concern about if/then/else complexity-wise, but maybe working the boolean ops first will help clarify things? I'm not sure.

onelson · 2022-03-23T17:08:43Z

The epic has been updated to show a "quest tracking" section at the bottom of the description, and I've linked the immediate next hops on the roadmap there.

Issues were filed specifically to get through the arithmetic ops, getting us 2 of the top 4.

@nathanielc I was going to close this issue out now that we have numbers to guide us, and next (immediate) steps filed as issues. If you prefer I can try file issues for the extended roadmap you just laid out. Let me know.

nathanielc · 2022-03-24T19:46:19Z

Yeah I'd like to create a set of issues for if/then/else and then issues for the float call. Those would come after the issues you have already created. Thanks for investigating this is really nice data.

onelson · 2022-03-25T17:55:19Z

Okay, issues for logical ops, conditional expressions, and float are all in the backlog and linked from the epic.

onelson self-assigned this Mar 21, 2022

onelson mentioned this issue Mar 22, 2022

EPIC: Vectorize execution of map, filter and other Flux functions #4028

Closed

onelson closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out next areas to vectorize #4580

Figure out next areas to vectorize #4580

onelson commented Mar 21, 2022

onelson commented Mar 22, 2022 •

edited

Loading

onelson commented Mar 22, 2022 •

edited

Loading

onelson commented Mar 22, 2022 •

edited

Loading

onelson commented Mar 22, 2022

nathanielc commented Mar 23, 2022

onelson commented Mar 23, 2022

onelson commented Mar 23, 2022

nathanielc commented Mar 24, 2022

onelson commented Mar 25, 2022

Figure out next areas to vectorize #4580

Figure out next areas to vectorize #4580

Comments

onelson commented Mar 21, 2022

onelson commented Mar 22, 2022 • edited Loading

onelson commented Mar 22, 2022 • edited Loading

onelson commented Mar 22, 2022 • edited Loading

onelson commented Mar 22, 2022

nathanielc commented Mar 23, 2022

onelson commented Mar 23, 2022

onelson commented Mar 23, 2022

nathanielc commented Mar 24, 2022

onelson commented Mar 25, 2022

onelson commented Mar 22, 2022 •

edited

Loading

onelson commented Mar 22, 2022 •

edited

Loading

onelson commented Mar 22, 2022 •

edited

Loading