Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading bibliographic coupling takes too long for 1k papers in Semantic Scholar #273

Open
olegs opened this issue Jul 20, 2021 · 1 comment
Labels
invalid This doesn't seem right

Comments

@olegs
Copy link
Contributor

olegs commented Jul 20, 2021

26 mins for request: https://pubtrends.net/result?query=programming%20languages%20theory&source=Semantic%20Scholar&limit=1000&sort=Most%20Cited&noreviews=False&expand=0&jobid=predefined_38f03d8fe1ce0f5b5dbb1cc63a67a22a

[2021-07-20 17:08:22] Analyzing search query
[2021-07-20 17:08:22] Searching 1000 most cited publications matching programming languages theory
[2021-07-20 17:09:03] Loading publication data
[2021-07-20 17:09:05] Analyzing title and abstract texts
[2021-07-20 17:09:15] Loading citations statistics for papers
[2021-07-20 17:10:18] Loading citations information
[2021-07-20 17:10:19] Calculating co-citations for selected papers
[2021-07-20 17:10:20] Processing bibliographic coupling for selected papers
[2021-07-20 17:36:21] Analyzing papers similarity graph
[2021-07-20 17:36:21] Extracting topics from paper similarity graph
[2021-07-20 17:37:20] Analyzing topics descriptions
[2021-07-20 17:37:24] Identifying top papers
[2021-07-20 17:37:24] Analyzing authors and groups
[2021-07-20 17:37:24] Analyzing popular journals
[2021-07-20 17:37:25] Visualizing
[2021-07-20 17:37:39] Done
@olegs olegs added the invalid This doesn't seem right label Jul 20, 2021
@olegs
Copy link
Contributor Author

olegs commented Jul 21, 2021

Explain analyse query for the part of query used in bibliographic coupling fetching:

explain analyse
SELECT ssid_out, ssid_in, crc32id_in
FROM sscitations C
WHERE (crc32id_out, ssid_out) IN (VALUES (-2004926960, 'eb33b4f5b7ba0f135f1025cac48d7fa26d43668b'), (-1498603286, 'f673921415d0589621e5d2a086899209c4998c54'), (-1097331807, 'db3e0391be8c586fb57edadcbcb9ee1fab2353a0'), (-1780487288, '4198e76048ccbcfffe66d1d7a7af496dbe4f3263'), (-1620333214, '905748cd0222df99c9755f59fd526c56a94d9da4'), (-736907493, '97e8696138a75c184fd209eb1a88ed3ab36b915f'), (251655193, '68efa14f4b04ff95daa7f273cc05a119338eacaa'), (-2077481228, '6855871e5b3a8fa972c20b4c314b1625628b8cd1'), (573300619, 'f1656f65c17281a7a040dd1b3525330c39645f43'), (1703800989, 'cecbe6b6db513e2f2cd6727aaaa48807d9e33573'))
LIMIT 1000;
Limit  (cost=1050.00..87316.72 rows=1000 width=86) (actual time=20.531..52627.028 rows=1000 loops=1)
  ->  Gather  (cost=1050.00..31337953.61 rows=363256 width=86) (actual time=20.528..52626.764 rows=1000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Hash Semi Join  (cost=50.00..31300628.01 rows=151357 width=86) (actual time=4235.653..52570.177 rows=366 loops=3)
"              Hash Cond: ((c.ssid_out)::text = ""*VALUES*_1"".column1)"
              ->  Hash Semi Join  (cost=25.00..31273306.85 rows=9757069 width=86) (actual time=4234.993..52569.171 rows=371 loops=3)
"                    Hash Cond: (c.crc32id_out = ""*VALUES*"".column1)"
                    ->  Parallel Seq Scan on sscitations c  (cost=0.00..29513662.80 rows=628979680 width=90) (actual time=0.366..39705.072 rows=51819328 loops=3)
                    ->  Hash  (cost=12.50..12.50 rows=1000 width=4) (actual time=0.405..0.406 rows=1000 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 44kB
"                          ->  Values Scan on ""*VALUES*""  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.223 rows=1000 loops=3)"
              ->  Hash  (cost=12.50..12.50 rows=1000 width=32) (actual time=0.489..0.490 rows=1000 loops=3)
                    Buckets: 1024  Batches: 1  Memory Usage: 80kB
"                    ->  Values Scan on ""*VALUES*_1""  (cost=0.00..12.50 rows=1000 width=32) (actual time=0.002..0.226 rows=1000 loops=3)"
Planning Time: 16.804 ms
Execution Time: 52627.403 ms

@olegs olegs changed the title Loading bibliographic coupling takes too long Loading bibliographic coupling takes too long for 1k papers in Semantic Scholar Jul 21, 2021
olegs added a commit that referenced this issue Jul 22, 2021
…iographic coupling takes too long for 1k papers in Semantic Scholar #273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

1 participant