analytics/datasets: make query more efficient #1387

alxndrsn · 2025-02-06T16:40:53Z

What has been done to verify that this works as intended?

Existing tests/CI.

Why is this the best possible solution? Were any other approaches considered?

This PR improves performance significantly, but also introduces a number of style & formatting changes.

An alternative approach would be to minimise changes in this PR to be 100% performance focussed, and introduce more general changes in other PRs.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Should not affect users.

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

No

Before submitting this PR, please make sure you have:

run make test and confirmed all checks still pass OR confirm CircleCI build passes
verified that any code from external sources are properly credited in comments or that everything is internally sourced

ktuite

This looks great, Alex! Easier to read it, and easier for those databases to compute.

lib/model/query/analytics.js

alxndrsn · 2025-02-07T06:51:14Z

There are more changes that might be made, e.g.

collapsing the two joins on the entities table into a single join, and similarly
collapsing the multiple joins on the audits table into a single join

However, compared to the other changes here, these feel like micro-optimisations!

Current query plan:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=54929.19..54958.49 rows=5 width=184) (actual time=653.462..661.422 rows=5 loops=1)
   Output: ds.id, ds."projectId", COALESCE((count(*)), '0'::bigint), COALESCE((count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), '0'::bigint), COALESCE((count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), '0'::bigint), COALESCE((count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), '0'::bigint), COALESCE((count(fd."formId")), '0'::bigint), COALESCE((count(form_attachments."formId")), '0'::bigint), COALESCE((count((a.details -> 'submissionId'::text))), '0'::bigint), COALESCE((count(((a."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), '0'::bigint), COALESCE((count(*)), '0'::bigint), COALESCE((count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), '0'::bigint), COALESCE((count((a_1.details -> 'submissionDefId'::text))), '0'::bigint), COALESCE((count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), '0'::bigint), COALESCE((count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), '0'::bigint), COALESCE((count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), '0'::bigint), COALESCE((count(*)), '0'::bigint), COALESCE((count(((e.conflict IS NULL) OR NULL::boolean))), '0'::bigint), COALESCE((count((a_2.details -> 'submissionDefId'::text))), '0'::bigint), COALESCE((count(((((a_2.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), '0'::bigint), COALESCE((count((((a_2.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), '0'::bigint), COALESCE((count(((((a_2.details -> 'submissionDefId'::text) IS NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), '0'::bigint), COALESCE((count(*)), '0'::bigint), COALESCE((count(((a_3."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), '0'::bigint)
   Inner Unique: true
   Merge Cond: (ds.id = e_3."datasetId")
   Buffers: shared hit=356878 read=33734
   I/O Timings: read=185.453
   ->  Merge Left Join  (cost=34333.45..34343.26 rows=5 width=168) (actual time=309.931..311.638 rows=5 loops=1)
         Output: ds.id, ds."projectId", (count(*)), (count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((e.conflict IS NULL) OR NULL::boolean))), (count(fd."formId")), (count(form_attachments."formId")), (count((a.details -> 'submissionId'::text))), (count(((a."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count((a_1.details -> 'submissionDefId'::text))), (count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), (count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (count((a_2.details -> 'submissionDefId'::text))), (count(((((a_2.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (count((((a_2.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), (count(((((a_2.details -> 'submissionDefId'::text) IS NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)))
         Inner Unique: true
         Merge Cond: (ds.id = e_2."datasetId")
         Buffers: shared hit=5691 read=16883
         I/O Timings: read=90.148
         ->  Merge Left Join  (cost=32480.39..32482.10 rows=5 width=136) (actual time=308.453..310.153 rows=5 loops=1)
               Output: ds.id, ds."projectId", (count(*)), (count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((e.conflict IS NULL) OR NULL::boolean))), (count(fd."formId")), (count(form_attachments."formId")), (count((a.details -> 'submissionId'::text))), (count(((a."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count((a_1.details -> 'submissionDefId'::text))), (count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), (count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)))
               Inner Unique: true
               Merge Cond: (ds.id = e_1."datasetId")
               Buffers: shared hit=5008 read=16883
               I/O Timings: read=90.148
               ->  Merge Left Join  (cost=24544.77..24545.82 rows=5 width=88) (actual time=219.550..219.713 rows=5 loops=1)
                     Output: ds.id, ds."projectId", (count(*)), (count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((e.conflict IS NULL) OR NULL::boolean))), (count(fd."formId")), (count(form_attachments."formId")), (count((a.details -> 'submissionId'::text))), (count(((a."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)))
                     Inner Unique: true
                     Merge Cond: (ds.id = dfd_1."datasetId")
                     Buffers: shared hit=2823 read=16883
                     I/O Timings: read=90.148
                     ->  Merge Left Join  (cost=24485.13..24485.84 rows=5 width=72) (actual time=219.132..219.284 rows=5 loops=1)
                           Output: ds.id, ds."projectId", (count(*)), (count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(*)), (count(((e.conflict IS NULL) OR NULL::boolean))), (count(fd."formId")), (count(form_attachments."formId"))
                           Inner Unique: true
                           Merge Cond: (ds.id = e."datasetId")
                           Buffers: shared hit=2771 read=16883
                           I/O Timings: read=90.148
                           ->  Merge Left Join  (cost=5270.88..5271.54 rows=5 width=56) (actual time=45.543..45.686 rows=5 loops=1)
                                 Output: ds.id, ds."projectId", (count(*)), (count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (count(fd."formId")), (count(form_attachments."formId"))
                                 Inner Unique: true
                                 Merge Cond: (ds.id = entities."datasetId")
                                 Buffers: shared hit=1680
                                 ->  Sort  (cost=51.07..51.08 rows=5 width=24) (actual time=0.745..0.757 rows=5 loops=1)
                                       Output: ds.id, ds."projectId", (count(fd."formId")), (count(form_attachments."formId"))
                                       Sort Key: ds.id
                                       Sort Method: quicksort  Memory: 25kB
                                       Buffers: shared hit=23
                                       ->  Hash Left Join  (cost=49.90..51.01 rows=5 width=24) (actual time=0.725..0.739 rows=5 loops=1)
                                             Output: ds.id, ds."projectId", (count(fd."formId")), (count(form_attachments."formId"))
                                             Inner Unique: true
                                             Hash Cond: (ds.id = form_attachments."datasetId")
                                             Buffers: shared hit=23
                                             ->  Hash Left Join  (cost=23.87..24.95 rows=5 width=16) (actual time=0.248..0.256 rows=5 loops=1)
                                                   Output: ds.id, ds."projectId", (count(fd."formId"))
                                                   Inner Unique: true
                                                   Hash Cond: (ds.id = dfd."datasetId")
                                                   Buffers: shared hit=14
                                                   ->  Seq Scan on public.datasets ds  (cost=0.00..1.06 rows=5 width=8) (actual time=0.008..0.010 rows=5 loops=1)
                                                         Output: ds.id, ds."projectId"
                                                         Filter: (ds."publishedAt" IS NOT NULL)
                                                         Rows Removed by Filter: 1
                                                         Buffers: shared hit=1
                                                   ->  Hash  (cost=23.84..23.84 rows=2 width=12) (actual time=0.224..0.228 rows=2 loops=1)
                                                         Output: (count(fd."formId")), dfd."datasetId"
                                                         Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                         Buffers: shared hit=13
                                                         ->  HashAggregate  (cost=23.80..23.82 rows=2 width=12) (actual time=0.220..0.224 rows=2 loops=1)
                                                               Output: count(fd."formId"), dfd."datasetId"
                                                               Group Key: dfd."datasetId"
                                                               Batches: 1  Memory Usage: 24kB
                                                               Buffers: shared hit=13
                                                               ->  Hash Join  (cost=17.05..22.46 rows=269 width=8) (actual time=0.116..0.185 rows=269 loops=1)
                                                                     Output: dfd."datasetId", fd."formId"
                                                                     Inner Unique: true
                                                                     Hash Cond: (dfd."formDefId" = fd.id)
                                                                     Buffers: shared hit=13
                                                                     ->  Seq Scan on public.dataset_form_defs dfd  (cost=0.00..4.69 rows=269 width=8) (actual time=0.004..0.021 rows=269 loops=1)
                                                                           Output: dfd."datasetId", dfd."formDefId", dfd.actions
                                                                           Buffers: shared hit=2
                                                                     ->  Hash  (cost=13.69..13.69 rows=269 width=8) (actual time=0.096..0.097 rows=269 loops=1)
                                                                           Output: fd."formId", fd.id
                                                                           Buckets: 1024  Batches: 1  Memory Usage: 19kB
                                                                           Buffers: shared hit=11
                                                                           ->  Seq Scan on public.form_defs fd  (cost=0.00..13.69 rows=269 width=8) (actual time=0.005..0.054 rows=269 loops=1)
                                                                                 Output: fd."formId", fd.id
                                                                                 Buffers: shared hit=11
                                             ->  Hash  (cost=25.99..25.99 rows=3 width=12) (actual time=0.462..0.466 rows=3 loops=1)
                                                   Output: (count(form_attachments."formId")), form_attachments."datasetId"
                                                   Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                   Buffers: shared hit=9
                                                   ->  HashAggregate  (cost=25.93..25.96 rows=3 width=12) (actual time=0.454..0.457 rows=4 loops=1)
                                                         Output: count(form_attachments."formId"), form_attachments."datasetId"
                                                         Group Key: form_attachments."datasetId"
                                                         Batches: 1  Memory Usage: 24kB
                                                         Buffers: shared hit=9
                                                         ->  Hash Join  (cost=1.08..22.63 rows=660 width=8) (actual time=0.031..0.339 rows=878 loops=1)
                                                               Output: form_attachments."datasetId", form_attachments."formId"
                                                               Inner Unique: true
                                                               Hash Cond: (form_attachments."formId" = f.id)
                                                               Buffers: shared hit=9
                                                               ->  Seq Scan on public.form_attachments  (cost=0.00..16.80 rows=880 width=8) (actual time=0.004..0.174 rows=880 loops=1)
                                                                     Output: form_attachments."formId", form_attachments."datasetId"
                                                                     Buffers: shared hit=8
                                                               ->  Hash  (cost=1.04..1.04 rows=3 width=4) (actual time=0.011..0.012 rows=3 loops=1)
                                                                     Output: f.id
                                                                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                                     Buffers: shared hit=1
                                                                     ->  Seq Scan on public.forms f  (cost=0.00..1.04 rows=3 width=4) (actual time=0.003..0.004 rows=3 loops=1)
                                                                           Output: f.id
                                                                           Filter: (f."currentDefId" IS NOT NULL)
                                                                           Rows Removed by Filter: 1
                                                                           Buffers: shared hit=1
                                 ->  Finalize GroupAggregate  (cost=5219.81..5220.36 rows=4 width=36) (actual time=44.793..44.887 rows=4 loops=1)
                                       Output: count(*), count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean)), count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), entities."datasetId"
                                       Group Key: entities."datasetId"
                                       Buffers: shared hit=1657
                                       ->  Gather Merge  (cost=5219.81..5220.27 rows=4 width=36) (actual time=44.784..44.871 rows=8 loops=1)
                                             Output: entities."datasetId", (PARTIAL count(*)), (PARTIAL count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (PARTIAL count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (PARTIAL count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)))
                                             Workers Planned: 1
                                             Workers Launched: 1
                                             Buffers: shared hit=1657
                                             ->  Sort  (cost=4219.80..4219.81 rows=4 width=36) (actual time=41.776..41.778 rows=4 loops=2)
                                                   Output: entities."datasetId", (PARTIAL count(*)), (PARTIAL count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (PARTIAL count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean))), (PARTIAL count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)))
                                                   Sort Key: entities."datasetId"
                                                   Sort Method: quicksort  Memory: 25kB
                                                   Buffers: shared hit=1657
                                                   Worker 0:  actual time=39.365..39.366 rows=4 loops=1
                                                     Sort Method: quicksort  Memory: 25kB
                                                     Buffers: shared hit=771
                                                   ->  Partial HashAggregate  (cost=4219.72..4219.76 rows=4 width=36) (actual time=41.735..41.737 rows=4 loops=2)
                                                         Output: entities."datasetId", PARTIAL count(*), PARTIAL count(((entities."createdAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), PARTIAL count(((entities."updatedAt" IS NOT NULL) OR NULL::boolean)), PARTIAL count(((entities."updatedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))
                                                         Group Key: entities."datasetId"
                                                         Batches: 1  Memory Usage: 24kB
                                                         Buffers: shared hit=1650
                                                         Worker 0:  actual time=39.313..39.314 rows=4 loops=1
                                                           Batches: 1  Memory Usage: 24kB
                                                           Buffers: shared hit=764
                                                         ->  Parallel Seq Scan on public.entities  (cost=0.00..2335.26 rows=68526 width=20) (actual time=0.008..15.876 rows=58247 loops=2)
                                                               Output: entities."createdAt", entities."updatedAt", entities."datasetId"
                                                               Buffers: shared hit=1650
                                                               Worker 0:  actual time=0.011..15.345 rows=53843 loops=1
                                                                 Buffers: shared hit=764
                           ->  GroupAggregate  (cost=19214.24..19214.27 rows=1 width=20) (actual time=173.584..173.588 rows=1 loops=1)
                                 Output: count(*), count(((e.conflict IS NULL) OR NULL::boolean)), e."datasetId"
                                 Group Key: e."datasetId"
                                 Buffers: shared hit=1091 read=16883
                                 I/O Timings: read=90.148
                                 ->  Sort  (cost=19214.24..19214.25 rows=1 width=8) (actual time=173.577..173.579 rows=1 loops=1)
                                       Output: e."datasetId", e.conflict
                                       Sort Key: e."datasetId"
                                       Sort Method: quicksort  Memory: 25kB
                                       Buffers: shared hit=1091 read=16883
                                       I/O Timings: read=90.148
                                       ->  Nested Loop  (cost=19206.20..19214.23 rows=1 width=8) (actual time=173.536..173.539 rows=1 loops=1)
                                             Output: e."datasetId", e.conflict
                                             Inner Unique: true
                                             Buffers: shared hit=1091 read=16883
                                             I/O Timings: read=90.148
                                             ->  HashAggregate  (cost=19205.91..19205.92 rows=1 width=4) (actual time=173.497..173.499 rows=1 loops=1)
                                                   Output: entity_defs."entityId"
                                                   Group Key: entity_defs."entityId"
                                                   Batches: 1  Memory Usage: 24kB
                                                   Buffers: shared hit=1088 read=16883
                                                   I/O Timings: read=90.148
                                                   ->  Seq Scan on public.entity_defs  (cost=0.00..19205.91 rows=1 width=4) (actual time=49.886..173.482 rows=1 loops=1)
                                                         Output: entity_defs."entityId"
                                                         Filter: (entity_defs."conflictingProperties" IS NOT NULL)
                                                         Rows Removed by Filter: 123490
                                                         Buffers: shared hit=1088 read=16883
                                                         I/O Timings: read=90.148
                                             ->  Index Scan using entities_pkey on public.entities e  (cost=0.29..8.31 rows=1 width=12) (actual time=0.032..0.033 rows=1 loops=1)
                                                   Output: e.conflict, e."datasetId", e.id
                                                   Index Cond: (e.id = entity_defs."entityId")
                                                   Buffers: shared hit=3
                     ->  GroupAggregate  (cost=59.64..59.92 rows=2 width=20) (actual time=0.414..0.420 rows=1 loops=1)
                           Output: count((a.details -> 'submissionId'::text)), count(((a."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), dfd_1."datasetId"
                           Group Key: dfd_1."datasetId"
                           Buffers: shared hit=52
                           ->  Sort  (cost=59.64..59.67 rows=13 width=73) (actual time=0.393..0.400 rows=16 loops=1)
                                 Output: dfd_1."datasetId", a.details, a."loggedAt"
                                 Sort Key: dfd_1."datasetId"
                                 Sort Method: quicksort  Memory: 32kB
                                 Buffers: shared hit=52
                                 ->  Hash Join  (cost=53.24..59.40 rows=13 width=73) (actual time=0.320..0.372 rows=16 loops=1)
                                       Output: dfd_1."datasetId", a.details, a."loggedAt"
                                       Hash Cond: (dfd_1."formDefId" = sd."formDefId")
                                       Buffers: shared hit=52
                                       ->  Seq Scan on public.dataset_form_defs dfd_1  (cost=0.00..4.69 rows=269 width=8) (actual time=0.008..0.036 rows=269 loops=1)
                                             Output: dfd_1."datasetId", dfd_1."formDefId", dfd_1.actions
                                             Buffers: shared hit=2
                                       ->  Hash  (cost=53.07..53.07 rows=13 width=73) (actual time=0.250..0.254 rows=16 loops=1)
                                             Output: a.details, a."loggedAt", sd."formDefId"
                                             Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                             Buffers: shared hit=50
                                             ->  Nested Loop  (cost=33.26..53.07 rows=13 width=73) (actual time=0.117..0.237 rows=16 loops=1)
                                                   Output: a.details, a."loggedAt", sd."formDefId"
                                                   Buffers: shared hit=50
                                                   ->  Hash Join  (cost=33.11..41.98 rows=13 width=73) (actual time=0.100..0.181 rows=16 loops=1)
                                                         Output: a.details, a."loggedAt", s.id
                                                         Hash Cond: (s.id = ((a.details -> 'submissionId'::text))::integer)
                                                         Buffers: shared hit=18
                                                         ->  Seq Scan on public.submissions s  (cost=0.00..7.30 rows=230 width=4) (actual time=0.006..0.046 rows=230 loops=1)
                                                               Output: s.id
                                                               Buffers: shared hit=5
                                                         ->  Hash  (cost=32.95..32.95 rows=13 width=69) (actual time=0.073..0.074 rows=16 loops=1)
                                                               Output: a.details, a."loggedAt"
                                                               Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                               Buffers: shared hit=13
                                                               ->  Index Scan using audits_action_acteeid_loggedat_index on public.audits a  (cost=0.42..32.95 rows=13 width=69) (actual time=0.028..0.044 rows=16 loops=1)
                                                                     Output: a.details, a."loggedAt"
                                                                     Index Cond: (a.action = 'entity.error'::text)
                                                                     Buffers: shared hit=13
                                                   ->  Index Scan using submission_defs_submissionid_current_index on public.submission_defs sd  (cost=0.14..0.84 rows=1 width=8) (actual time=0.002..0.003 rows=1 loops=16)
                                                         Output: sd.id, sd."submissionId", sd.xml, sd."formDefId", sd."submitterId", sd."createdAt", sd."encDataAttachmentName", sd."localKey", sd.signature, sd.current, sd."instanceName", sd."instanceId", sd."userAgent", sd."deviceId", sd.root
                                                         Index Cond: ((sd."submissionId" = s.id) AND (sd.current = true))
                                                         Buffers: shared hit=32
               ->  Finalize GroupAggregate  (cost=7935.62..7936.19 rows=4 width=52) (actual time=88.897..90.427 rows=3 loops=1)
                     Output: count(*), count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), count((a_1.details -> 'submissionDefId'::text)), count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)), count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean)), count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)), e_1."datasetId"
                     Group Key: e_1."datasetId"
                     Buffers: shared hit=2185
                     ->  Gather Merge  (cost=7935.62..7936.08 rows=4 width=52) (actual time=88.882..90.403 rows=6 loops=1)
                           Output: e_1."datasetId", (PARTIAL count(*)), (PARTIAL count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (PARTIAL count((a_1.details -> 'submissionDefId'::text))), (PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (PARTIAL count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), (PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)))
                           Workers Planned: 1
                           Workers Launched: 1
                           Buffers: shared hit=2185
                           ->  Sort  (cost=6935.61..6935.62 rows=4 width=52) (actual time=78.217..78.223 rows=3 loops=2)
                                 Output: e_1."datasetId", (PARTIAL count(*)), (PARTIAL count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))), (PARTIAL count((a_1.details -> 'submissionDefId'::text))), (PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))), (PARTIAL count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean))), (PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)))
                                 Sort Key: e_1."datasetId"
                                 Sort Method: quicksort  Memory: 25kB
                                 Buffers: shared hit=2185
                                 Worker 0:  actual time=71.030..71.035 rows=3 loops=1
                                   Sort Method: quicksort  Memory: 25kB
                                   Buffers: shared hit=728
                                 ->  Partial HashAggregate  (cost=6935.53..6935.57 rows=4 width=52) (actual time=78.182..78.187 rows=3 loops=2)
                                       Output: e_1."datasetId", PARTIAL count(*), PARTIAL count(((a_1."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), PARTIAL count((a_1.details -> 'submissionDefId'::text)), PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)), PARTIAL count((((a_1.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean)), PARTIAL count(((((a_1.details -> 'submissionDefId'::text) IS NULL) AND (a_1."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean))
                                       Group Key: e_1."datasetId"
                                       Batches: 1  Memory Usage: 24kB
                                       Buffers: shared hit=2178
                                       Worker 0:  actual time=70.987..70.992 rows=3 loops=1
                                         Batches: 1  Memory Usage: 24kB
                                         Buffers: shared hit=721
                                       ->  Parallel Hash Join  (cost=3747.63..6740.58 rows=3899 width=73) (actual time=60.367..71.012 rows=3498 loops=2)
                                             Output: e_1."datasetId", a_1."loggedAt", a_1.details
                                             Inner Unique: true
                                             Hash Cond: (((a_1.details -> 'entityId'::text))::integer = e_1.id)
                                             Buffers: shared hit=2178
                                             Worker 0:  actual time=52.669..61.221 rows=1471 loops=1
                                               Buffers: shared hit=721
                                             ->  Parallel Bitmap Heap Scan on public.audits a_1  (cost=555.79..3537.53 rows=3899 width=69) (actual time=1.798..2.790 rows=3498 loops=2)
                                                   Output: a_1."actorId", a_1.action, a_1."acteeId", a_1.details, a_1."loggedAt", a_1.claimed, a_1.processed, a_1."lastFailure", a_1.failures, a_1.id, a_1.notes
                                                   Recheck Cond: (a_1.action = 'entity.update.version'::text)
                                                   Heap Blocks: exact=285
                                                   Buffers: shared hit=528
                                                   Worker 0:  actual time=1.701..2.197 rows=1471 loops=1
                                                     Buffers: shared hit=243
                                                   ->  Bitmap Index Scan on audits_action_acteeid_loggedat_index  (cost=0.00..554.13 rows=6629 width=0) (actual time=1.612..1.612 rows=6997 loops=1)
                                                         Index Cond: (a_1.action = 'entity.update.version'::text)
                                                         Buffers: shared hit=148
                                                         Worker 0:  actual time=1.612..1.612 rows=6997 loops=1
                                                           Buffers: shared hit=148
                                             ->  Parallel Hash  (cost=2335.26..2335.26 rows=68526 width=8) (actual time=56.837..56.839 rows=58247 loops=2)
                                                   Output: e_1."datasetId", e_1.id
                                                   Buckets: 131072  Batches: 1  Memory Usage: 5600kB
                                                   Buffers: shared hit=1650
                                                   Worker 0:  actual time=50.924..50.926 rows=33352 loops=1
                                                     Buffers: shared hit=478
                                                   ->  Parallel Seq Scan on public.entities e_1  (cost=0.00..2335.26 rows=68526 width=8) (actual time=0.009..24.636 rows=58247 loops=2)
                                                         Output: e_1."datasetId", e_1.id
                                                         Buffers: shared hit=1650
                                                         Worker 0:  actual time=0.009..27.249 rows=33352 loops=1
                                                           Buffers: shared hit=478
         ->  GroupAggregate  (cost=1853.06..1861.06 rows=4 width=36) (actual time=1.474..1.476 rows=1 loops=1)
               Output: count((a_2.details -> 'submissionDefId'::text)), count(((((a_2.details -> 'submissionDefId'::text) IS NOT NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)), count((((a_2.details -> 'submissionDefId'::text) IS NULL) OR NULL::boolean)), count(((((a_2.details -> 'submissionDefId'::text) IS NULL) AND (a_2."loggedAt" >= (CURRENT_DATE - 45))) OR NULL::boolean)), e_2."datasetId"
               Group Key: e_2."datasetId"
               Buffers: shared hit=683
               ->  Sort  (cost=1853.06..1853.56 rows=199 width=73) (actual time=1.210..1.225 rows=184 loops=1)
                     Output: e_2."datasetId", a_2.details, a_2."loggedAt"
                     Sort Key: e_2."datasetId"
                     Sort Method: quicksort  Memory: 73kB
                     Buffers: shared hit=683
                     ->  Nested Loop  (cost=0.71..1845.47 rows=199 width=73) (actual time=0.068..1.032 rows=184 loops=1)
                           Output: e_2."datasetId", a_2.details, a_2."loggedAt"
                           Inner Unique: true
                           Buffers: shared hit=683
                           ->  Index Scan using audits_action_acteeid_loggedat_index on public.audits a_2  (cost=0.42..414.78 rows=199 width=69) (actual time=0.043..0.261 rows=184 loops=1)
                                 Output: a_2."actorId", a_2.action, a_2."acteeId", a_2.details, a_2."loggedAt", a_2.claimed, a_2.processed, a_2."lastFailure", a_2.failures, a_2.id, a_2.notes
                                 Index Cond: (a_2.action = 'entity.create'::text)
                                 Buffers: shared hit=131
                           ->  Index Scan using entities_pkey on public.entities e_2  (cost=0.30..7.19 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=184)
                                 Output: e_2."datasetId", e_2.id
                                 Index Cond: (e_2.id = ((a_2.details -> 'entityId'::text))::integer)
                                 Buffers: shared hit=552
   ->  Finalize GroupAggregate  (cost=20595.74..20615.14 rows=4 width=20) (actual time=343.523..349.766 rows=4 loops=1)
         Output: count(*), count(((a_3."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)), e_3."datasetId"
         Group Key: e_3."datasetId"
         Buffers: shared hit=351187 read=16851
         I/O Timings: read=95.305
         ->  Gather Merge  (cost=20595.74..20615.04 rows=8 width=20) (actual time=341.803..349.747 rows=10 loops=1)
               Output: e_3."datasetId", (PARTIAL count(*)), (PARTIAL count(((a_3."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean)))
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=351187 read=16851
               I/O Timings: read=95.305
               ->  Partial GroupAggregate  (cost=19595.71..19614.09 rows=4 width=20) (actual time=316.218..320.273 rows=3 loops=3)
                     Output: e_3."datasetId", PARTIAL count(*), PARTIAL count(((a_3."loggedAt" >= (CURRENT_DATE - 45)) OR NULL::boolean))
                     Group Key: e_3."datasetId"
                     Buffers: shared hit=351187 read=16851
                     I/O Timings: read=95.305
                     Worker 0:  actual time=310.823..316.801 rows=4 loops=1
                       Buffers: shared hit=126978 read=6112
                       I/O Timings: read=22.134
                     Worker 1:  actual time=305.298..306.786 rows=3 loops=1
                       Buffers: shared hit=100787 read=5088
                       I/O Timings: read=24.037
                     ->  Sort  (cost=19595.71..19598.33 rows=1048 width=12) (actual time=292.550..297.820 rows=38770 loops=3)
                           Output: e_3."datasetId", a_3."loggedAt"
                           Sort Key: e_3."datasetId"
                           Sort Method: quicksort  Memory: 3280kB
                           Buffers: shared hit=351187 read=16851
                           I/O Timings: read=95.305
                           Worker 0:  actual time=283.123..288.938 rows=42093 loops=1
                             Sort Method: quicksort  Memory: 3339kB
                             Buffers: shared hit=126978 read=6112
                             I/O Timings: read=22.134
                           Worker 1:  actual time=282.328..286.018 rows=33388 loops=1
                             Sort Method: quicksort  Memory: 2931kB
                             Buffers: shared hit=100787 read=5088
                             I/O Timings: read=24.037
                           ->  Nested Loop  (cost=469.03..19543.14 rows=1048 width=12) (actual time=0.832..270.241 rows=38770 loops=3)
                                 Output: e_3."datasetId", a_3."loggedAt"
                                 Inner Unique: true
                                 Buffers: shared hit=351173 read=16851
                                 I/O Timings: read=95.305
                                 Worker 0:  actual time=0.932..270.139 rows=42093 loops=1
                                   Buffers: shared hit=126971 read=6112
                                   I/O Timings: read=22.134
                                 Worker 1:  actual time=0.873..260.335 rows=33388 loops=1
                                   Buffers: shared hit=100780 read=5088
                                   I/O Timings: read=24.037
                                 ->  Hash Join  (cost=468.74..19146.62 rows=1048 width=12) (actual time=0.809..107.154 rows=38770 loops=3)
                                       Output: a_3."loggedAt", ed."entityId"
                                       Hash Cond: (ed."sourceId" = eds.id)
                                       Buffers: shared hit=2241 read=16851
                                       I/O Timings: read=95.305
                                       Worker 0:  actual time=0.910..106.142 rows=42093 loops=1
                                         Buffers: shared hit=691 read=6112
                                         I/O Timings: read=22.134
                                       Worker 1:  actual time=0.849..85.968 rows=33388 loops=1
                                         Buffers: shared hit=615 read=5088
                                         I/O Timings: read=24.037
                                       ->  Parallel Seq Scan on public.entity_defs ed  (cost=0.00..18485.55 rows=48498 width=8) (actual time=0.041..85.270 rows=38831 loops=3)
                                             Output: ed."sourceId", ed."entityId"
                                             Filter: ed.root
                                             Rows Removed by Filter: 2332
                                             Buffers: shared hit=1120 read=16851
                                             I/O Timings: read=95.305
                                             Worker 0:  actual time=0.047..87.999 rows=42117 loops=1
                                               Buffers: shared hit=314 read=6112
                                               I/O Timings: read=22.134
                                             Worker 1:  actual time=0.049..66.729 rows=33492 loops=1
                                               Buffers: shared hit=238 read=5088
                                               I/O Timings: read=24.037
                                       ->  Hash  (cost=466.76..466.76 rows=158 width=73) (actual time=0.727..0.730 rows=135 loops=3)
                                             Output: a_3."loggedAt", a_3.details, eds.id
                                             Buckets: 1024  Batches: 1  Memory Usage: 19kB
                                             Buffers: shared hit=1105
                                             Worker 0:  actual time=0.826..0.829 rows=135 loops=1
                                               Buffers: shared hit=369
                                             Worker 1:  actual time=0.764..0.766 rows=135 loops=1
                                               Buffers: shared hit=369
                                             ->  Nested Loop  (cost=0.70..466.76 rows=158 width=73) (actual time=0.097..0.673 rows=135 loops=3)
                                                   Output: a_3."loggedAt", a_3.details, eds.id
                                                   Inner Unique: true
                                                   Buffers: shared hit=1105
                                                   Worker 0:  actual time=0.134..0.773 rows=135 loops=1
                                                     Buffers: shared hit=369
                                                   Worker 1:  actual time=0.098..0.708 rows=135 loops=1
                                                     Buffers: shared hit=369
                                                   ->  Index Scan using audits_action_acteeid_loggedat_index on public.audits a_3  (cost=0.42..330.57 rows=158 width=69) (actual time=0.051..0.247 rows=135 loops=3)
                                                         Output: a_3."actorId", a_3.action, a_3."acteeId", a_3.details, a_3."loggedAt", a_3.claimed, a_3.processed, a_3."lastFailure", a_3.failures, a_3.id, a_3.notes
                                                         Index Cond: (a_3.action = 'entity.bulk.create'::text)
                                                         Buffers: shared hit=290
                                                         Worker 0:  actual time=0.077..0.321 rows=135 loops=1
                                                           Buffers: shared hit=97
                                                         Worker 1:  actual time=0.048..0.275 rows=135 loops=1
                                                           Buffers: shared hit=97
                                                   ->  Index Only Scan using entity_def_sources_pkey on public.entity_def_sources eds  (cost=0.29..0.86 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=405)
                                                         Output: eds.id
                                                         Index Cond: (eds.id = ((a_3.details -> 'sourceId'::text))::integer)
                                                         Heap Fetches: 0
                                                         Buffers: shared hit=815
                                                         Worker 0:  actual time=0.002..0.002 rows=1 loops=135
                                                           Buffers: shared hit=272
                                                         Worker 1:  actual time=0.002..0.002 rows=1 loops=135
                                                           Buffers: shared hit=272
                                 ->  Index Scan using entities_pkey on public.entities e_3  (cost=0.29..0.38 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=116310)
                                       Output: e_3."datasetId", e_3.id
                                       Index Cond: (e_3.id = ed."entityId")
                                       Buffers: shared hit=348932
                                       Worker 0:  actual time=0.003..0.003 rows=1 loops=42093
                                         Buffers: shared hit=126280
                                       Worker 1:  actual time=0.004..0.004 rows=1 loops=33388
                                         Buffers: shared hit=100165
 Query Identifier: 7601977736375776119
 Planning:
   Buffers: shared hit=64
 Planning Time: 3.520 ms
 Execution Time: 661.930 ms
(401 rows)

ktuite

🚀

alxndrsn added 17 commits February 6, 2025 15:15

restrict tests to the ones which matter

22f4291

formatting

31cf95e

formatting

dd4655c

Improve query a bit

2d08f6b

tidy

4c012c4

wip

e88b531

passing

755e9e7

cloers...

a8c1f3d

move a count upwards

64fcd6a

move another count upwards

9134afc

remove coment

42082aa

remove some distincts

de088ac

remove some distincts

602b23f

sumelsewhere

cd0b862

formatting

2d2f8b8

remove top-level aggregation

2232edf

revert .only() from tests

63fe8cf

alxndrsn requested a review from ktuite February 6, 2025 16:41

remove unused stuff

ee88bf3

ktuite approved these changes Feb 7, 2025

View reviewed changes

lib/model/query/analytics.js Outdated Show resolved Hide resolved

lindsay-stevens reviewed Feb 7, 2025

View reviewed changes

lib/model/query/analytics.js Outdated Show resolved Hide resolved

alxndrsn commented Feb 7, 2025

View reviewed changes

lib/model/query/analytics.js Outdated Show resolved Hide resolved

alxndrsn added 8 commits February 7, 2025 05:22

use _cutoffDate

6757bea

less case statements

d20961a

simpler counting

3804242

less casting

e401445

less sums

f472e09

formatting

1b4d56c

eliminate sums totally

245958e

reduce aliases in subqueries

835d5b7

alxndrsn added 7 commits February 7, 2025 06:01

formatting

656f9b7

move dataset to end of selects

e542d52

standardise casing, quoting

acfb130

no more distinct

c27e8c6

colocate entities queries

0668abd

standardise join conditional ordering

e8e1d22

EXISTs

ff206a6

alxndrsn commented Feb 7, 2025

View reviewed changes

lib/model/query/analytics.js Show resolved Hide resolved

alxndrsn marked this pull request as ready for review February 7, 2025 06:51

alxndrsn requested review from ktuite and lindsay-stevens February 7, 2025 06:51

formatting

0b60e85

ktuite approved these changes Feb 7, 2025

View reviewed changes

Merge branch 'master' into query-changes-1

a0b70a3

alxndrsn merged commit 2a9e1bc into getodk:master Feb 8, 2025
6 checks passed

alxndrsn deleted the query-changes-1 branch February 8, 2025 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analytics/datasets: make query more efficient #1387

analytics/datasets: make query more efficient #1387

alxndrsn commented Feb 6, 2025 •

edited

Loading

ktuite left a comment

alxndrsn commented Feb 7, 2025

ktuite left a comment

analytics/datasets: make query more efficient #1387

analytics/datasets: make query more efficient #1387

Conversation

alxndrsn commented Feb 6, 2025 • edited Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

Before submitting this PR, please make sure you have:

ktuite left a comment

Choose a reason for hiding this comment

alxndrsn commented Feb 7, 2025

ktuite left a comment

Choose a reason for hiding this comment

alxndrsn commented Feb 6, 2025 •

edited

Loading