Stop using a web server for the API docs generation #578

arnaucasau · 2024-01-03T08:43:15Z

Summary

This PR is part of #564 and changes the updateApiDocs.ts script to stop using a web server and a web crawler for the API docs generation. Instead, the script uses the HTML in the artifact zip file to convert it into markdown and copies the images from the same zip to the respective version folder without downloading it from any web server.

Details

Convert HTML to markdown

After removing the web crawler used to download the HTML files that the script translated into markdown, we need to take into account HTML files that are used as a redirect, like stubs/qiskit.utils.name_args.html for Qiskit v0.45:

<html><head><meta http-equiv="refresh" content="0; url=https://qiskit.org/documentation/apidoc/utils.html#qiskit.utils.name_args"></head></html>

These files were not downloaded using the web crawler and therefore, not processed by the sphinxHtmlToMarkdown function. Now, the script will try to convert those files into markdown as well, ending up in an empty markdown file we need to remove. That is translated into this conditional in updateApisDocs.ts:

if (result.markdown == "") {
  continue;
}

Save images

As for the images, we don't need the web server because we can find them in the folder called _images inside the artifact zip file. The script will copy all the images to the correct API version images folder. Moreover, the script only saves the images corresponding to the release notes for the current APIs (not using the historical argument). This change will allow us to remove unnecessary duplicate images we are currently downloading.

Bug fix

In addition to that change, the PR fixes an underlying issue with the old method. We were only downloading images when they were not present in the API images folder by checking if we already had a file with the same name. New versions of the same API have new images with the same name as the old ones, and because of that, we were never saving the new ones.

All the historical versions will be regenerated in a follow-up.

Commands used:

npm run gen-api -- -p qiskit -v 0.45.0 -a https://github.com/Qiskit/qiskit/actions/runs/6744953436/artifacts/1026798160
npm run gen-api -- -p qiskit-ibm-provider -v 0.7.3 -a https://github.com/Qiskit/qiskit-ibm-provider/actions/runs/7301486985/artifacts/1131430696
npm run gen-api -- -p qiskit-ibm-runtime -v 0.17.0 -a https://github.com/Qiskit/qiskit-ibm-runtime/suites/18863019852/artifacts/1100724937

Closes #564

frankharkins

Big improvement and quick turnaround, thanks!

Eric-Arellano

Amazing work!

scripts/lib/WebCrawler.ts

scripts/lib/downloadImages.ts

Eric-Arellano · 2024-01-03T11:10:49Z

scripts/lib/downloadImages.ts

-    await closeWebServer();
-  }
+  await pMap(images, async (img) => {
+    const imgName = img.src.split("/").pop() || "";


What's with the || ""? That seems like it would be a bug.

This is used because the pop() function returns string | undefined. We will never have an undefined because the names of the images are always defined, and we don't work with empty arrays, so I changed to use ! at the end to tell the compiler it will be a string.

const imgName = img.src.split("/").pop()!;

Excellent - that expresses the intent much better. Good change!

Eric-Arellano · 2024-01-03T11:12:25Z

scripts/lib/downloadImages.ts

+      return;
+    }
+
+    await $`cp ${originalImagesFolderPath}/${imgName} public/${img.dest}`;


It's better to use the builtin mechanism to copy files: https://www.geeksforgeeks.org/node-js-fspromises-copyfile-method/. It avoids overhead from spawning a bunch of new distinct processes.

Much better, thanks! TIL

scripts/commands/updateApiDocs.ts

Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>

Closes #511. Example: ``` $ npm run check-pages-render Checked 10 / 71 pages Checked 20 / 71 pages Checked 30 / 71 pages Checked 40 / 71 pages Checked 50 / 71 pages Checked 60 / 71 pages Checked 70 / 71 pages ✅ All pages render without crashing ``` ## Only checks non-API docs by default This script is quite slow, at least on my M1 because the Docker images is built with x86. So, to avoid CI slowing down too much, we only check non-API pages in PR builds. Our nightly cron job checks everything else. ## Does not auto-start Docker We no longer have the file `webServer.ts` thanks to #578, which was a great improvement. Rather than adding back somewhat complex code for us to auto-start the server—and then to periodically ping if it's ready or time out—we expect the user to start up the server. That's acceptable since usually people will rely on CI to run this check. It's too slow for people to be frequently running locally. --------- Co-authored-by: Frank Harkins <frankharkins@hotmail.co.uk>

### Summary This PR is part of Qiskit#564 and changes the `updateApiDocs.ts` script to stop using a web server and a web crawler for the API docs generation. Instead, the script uses the HTML in the artifact zip file to convert it into markdown and copies the images from the same zip to the respective version folder without downloading it from any web server. ### Details #### Convert HTML to markdown After removing the web crawler used to download the HTML files that the script translated into markdown, we need to take into account HTML files that are used as a redirect, like `stubs/qiskit.utils.name_args.html` for Qiskit v0.45: ```html <html><head><meta http-equiv="refresh" content="0; url=https://qiskit.org/documentation/apidoc/utils.html#qiskit.utils.name_args"></head></html> ``` These files were not downloaded using the web crawler and therefore, not processed by the `sphinxHtmlToMarkdown` function. Now, the script will try to convert those files into markdown as well, ending up in an empty markdown file we need to remove. That is translated into this conditional in `updateApisDocs.ts`: ```ts if (result.markdown == "") { continue; } ``` #### Save images As for the images, we don't need the web server because we can find them in the folder called `_images` inside the artifact zip file. The script will copy all the images to the correct API version images folder. Moreover, the script only saves the images corresponding to the release notes for the current APIs (not using the historical argument). This change will allow us to remove unnecessary duplicate images we are currently downloading. #### Bug fix In addition to that change, the PR fixes an underlying issue with the old method. We were only downloading images when they were not present in the API images folder by checking if we already had a file with the same name. New versions of the same API have new images with the same name as the old ones, and because of that, we were never saving the new ones. All the historical versions will be regenerated in a follow-up. Commands used: ```bash npm run gen-api -- -p qiskit -v 0.45.0 -a https://github.com/Qiskit/qiskit/actions/runs/6744953436/artifacts/1026798160 npm run gen-api -- -p qiskit-ibm-provider -v 0.7.3 -a https://github.com/Qiskit/qiskit-ibm-provider/actions/runs/7301486985/artifacts/1131430696 npm run gen-api -- -p qiskit-ibm-runtime -v 0.17.0 -a https://github.com/Qiskit/qiskit-ibm-runtime/suites/18863019852/artifacts/1100724937 ``` Closes Qiskit#564 --------- Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>

Closes Qiskit#511. Example: ``` $ npm run check-pages-render Checked 10 / 71 pages Checked 20 / 71 pages Checked 30 / 71 pages Checked 40 / 71 pages Checked 50 / 71 pages Checked 60 / 71 pages Checked 70 / 71 pages ✅ All pages render without crashing ``` ## Only checks non-API docs by default This script is quite slow, at least on my M1 because the Docker images is built with x86. So, to avoid CI slowing down too much, we only check non-API pages in PR builds. Our nightly cron job checks everything else. ## Does not auto-start Docker We no longer have the file `webServer.ts` thanks to Qiskit#578, which was a great improvement. Rather than adding back somewhat complex code for us to auto-start the server—and then to periodically ping if it's ready or time out—we expect the user to start up the server. That's acceptable since usually people will rely on CI to run this check. It's too slow for people to be frequently running locally. --------- Co-authored-by: Frank Harkins <frankharkins@hotmail.co.uk>

arnaucasau added 4 commits January 2, 2024 22:07

Drop web server && fix current API images

1b507ed

make-historical: stop copying release notes images

ee6d9ad

use pMap instead of for

5bf403f

remove webServer.ts and webCrawler.ts

efb44b0

frankharkins approved these changes Jan 3, 2024

View reviewed changes

frankharkins requested a review from Eric-Arellano January 3, 2024 11:00

Eric-Arellano reviewed Jan 3, 2024

View reviewed changes

arnaucasau and others added 2 commits January 3, 2024 13:58

Incorporate feedback

c68c214

Co-authored-by: Eric Arellano <14852634+Eric-Arellano@users.noreply.github.com>

missing await

0205c2a

Eric-Arellano approved these changes Jan 3, 2024

View reviewed changes

arnaucasau added this pull request to the merge queue Jan 3, 2024

Merged via the queue into Qiskit:main with commit d06ed65 Jan 3, 2024
3 checks passed

arnaucasau deleted the fix-images branch January 3, 2024 13:17

Eric-Arellano mentioned this pull request Jan 3, 2024

Add check that all pages render #572

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using a web server for the API docs generation #578

Stop using a web server for the API docs generation #578

arnaucasau commented Jan 3, 2024 •

edited

Loading

frankharkins left a comment

Eric-Arellano left a comment

Eric-Arellano Jan 3, 2024

arnaucasau Jan 3, 2024

Eric-Arellano Jan 3, 2024

Eric-Arellano Jan 3, 2024

arnaucasau Jan 3, 2024

Stop using a web server for the API docs generation #578

Stop using a web server for the API docs generation #578

Conversation

arnaucasau commented Jan 3, 2024 • edited Loading

Summary

Details

Convert HTML to markdown

Save images

Bug fix

frankharkins left a comment

Choose a reason for hiding this comment

Eric-Arellano left a comment

Choose a reason for hiding this comment

Eric-Arellano Jan 3, 2024

Choose a reason for hiding this comment

arnaucasau Jan 3, 2024

Choose a reason for hiding this comment

Eric-Arellano Jan 3, 2024

Choose a reason for hiding this comment

Eric-Arellano Jan 3, 2024

Choose a reason for hiding this comment

arnaucasau Jan 3, 2024

Choose a reason for hiding this comment

arnaucasau commented Jan 3, 2024 •

edited

Loading