[SPARK-50295][INFRA] Add a script to build docs with image #48860

panbingkun · 2024-11-15T11:46:43Z

What changes were proposed in this pull request?

The pr aims to add a script to build docs with image.

The overall idea is as follows:

prepare compiled Spark packages for various subsequent documents (on host)
build image from cache
run image as container
Mount local files to the container (this way, there is no need to copy the spark file to the container, and the compiled spark package is already prepared in the local spark folder, so there is no need to compile it again in the container, otherwise it will re-download many dependency jars, which is very time-cost)
generate error docs, scala doc, python doc and sql doc in container.
generate r docs in host.
Why does r document need to be compiled outside the container ?
Because when compiling inside the container, the permission of the directory /__w/spark/spark/R/pkg/docs automatically generated by RScript is dr-xr--r-x, and when writing to subsequent files, will throw an error as:
! [EACCES] Failed to copy '/usr/local/lib/R/site-library/pkgdown/BS5/assets/katex-auto.js' to '/__w/spark/spark/R/pkg/docs/katex-auto.js': permission denied

Why are the changes needed?

For developers of pyspark, some python libraries are conflicts between the environment for generating docs and the development environment. In order to help developers verify more easily.

Does this PR introduce any user-facing change?

No, only for spark developers.

How was this patch tested?

Pass GA.
Manually test (The verification process can be found in the comments).

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2024-11-15T12:54:13Z

dev/spark-test-image/docs/build-docs-on-local

+
+# 3.build docs on container: `error docs`, `scala doc`, `python doc`, `sql doc`
+docker run \
+  --mount type=bind,source="${SPARK_HOME}",target="${DOCKER_MOUNT_SPARK_HOME}" \


if the container is going to write files to the mounted path, please make sure permission won't bother the user accessing/deleting from host. for example, if the container writes files with uid 0, the host user may have no permission to delete them.

(base) ➜ spark-community git:(SPARK-50295) ✗ ls -al docs/api total 0 drwxr-xr-x 7 panbingkun staff 224 Nov 18 19:52 . drwx------@ 286 panbingkun staff 9152 Nov 18 19:25 .. drwxr-xr-x@ 22 panbingkun staff 704 Nov 18 19:52 R drwxr-xr-x 26 panbingkun staff 832 Nov 18 19:25 java drwxr-xr-x 15 panbingkun staff 480 Nov 18 19:47 python drwxr-xr-x 6 panbingkun staff 192 Nov 18 19:25 scala drwxr-xr-x 11 panbingkun staff 352 Nov 18 19:49 sql

panbingkun · 2024-11-18T10:49:03Z

The verification process is as follows:

Run the following command:

sh dev/spark-test-image/docs/build-docs-on-local

The process of run is as follows:

[info] Note: Some input files use or override a deprecated API.
[info] Note: Recompile with -Xlint:deprecation for details.
[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list
[success] Total time: 46 s, completed Nov 18, 2024, 7:23:38 PM

[+] Building 8.4s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 8.6s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 8.7s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 8.9s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.0s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.2s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.3s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.5s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.6s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 9.8s (1/2)                                                                                                                                                           docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
[+] Building 93.0s (13/13) FINISHED                                                                                                                                               docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                              0.0s
 => => transferring dockerfile: 3.81kB                                                                                                                                                            0.0s
 => [internal] load metadata for docker.io/library/ubuntu:jammy-20240911.1                                                                                                                       88.5s
 => [internal] load .dockerignore                                                                                                                                                                 0.0s
 => => transferring context: 2B                                                                                                                                                                   0.0s
 => importing cache manifest from ghcr.io/apache/spark/apache-spark-github-action-image-docs-cache:master                                                                                         4.4s
 => => inferred cache manifest type: application/vnd.oci.image.index.v1+json                                                                                                                      0.0s
 => [1/7] FROM docker.io/library/ubuntu:jammy-20240911.1@sha256:0e5e4a57c2499249aafc3b40fcd541e9a456aab7296681a3994d631587203f97                                                                  0.0s
 => => resolve docker.io/library/ubuntu:jammy-20240911.1@sha256:0e5e4a57c2499249aafc3b40fcd541e9a456aab7296681a3994d631587203f97                                                                  0.0s
 => [auth] apache/spark/apache-spark-github-action-image-docs-cache:pull token for ghcr.io                                                                                                        0.0s
 => CACHED [2/7] RUN apt-get update && apt-get install -y     build-essential     ca-certificates     curl     gfortran     git     gnupg     libcurl4-openssl-dev     libfontconfig1-dev     li  0.0s
 => CACHED [3/7] RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown', 'rmarkdown', 'testthat'), repos='https://cloud.r-project.org/')" &&     Rscript -e "devtools::install_versi  0.0s
 => CACHED [4/7] RUN add-apt-repository ppa:deadsnakes/ppa                                                                                                                                        0.0s
 => CACHED [5/7] RUN apt-get update && apt-get install -y python3.9 python3.9-distutils     && rm -rf /var/lib/apt/lists/*                                                                        0.0s
 => CACHED [6/7] RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9                                                                                                                    0.0s
 => CACHED [7/7] RUN python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0'   ipython ipython_genutils  0.0s
 => exporting to image                                                                                                                                                                            0.0s
 => => exporting layers                                                                                                                                                                           0.0s
 => => exporting manifest sha256:86549617bcf8050c8b39402be5679e3663adf07de19894b872f11598c173c935                                                                                                 0.0s
 => => exporting config sha256:f8e2afeca787583d05cb7572d71d7ccff2fb8c7f8ec7a4b2e6f61d6fb3061d8d                                                                                                   0.0s
 => => exporting attestation manifest sha256:4998eaa40eace447f735eccf12114f8b63d9b975e33aa98133ee9944f7b3751d                                                                                     0.0s
 => => exporting manifest list sha256:c319e5c6347755831e6cf998bff702f737769a3cad4839965c9c0322f50f7ea7                                                                                            0.0s
 => => naming to docker.io/apache/spark/apache-spark-ci-image-docs:1731928756                                                                                                                     0.0s
 => => unpacking to docker.io/apache/spark/apache-spark-ci-image-docs:1731928756                                                                                                                  0.0s

 5 warnings found (use docker --debug to expand):
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 30)
 - UndefinedVar: Usage of undefined variable '$R_LIBS_SITE' (line 75)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 75)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 27)
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 29)

Fetching bundler-2.4.22.gem
Successfully installed bundler-2.4.22
Parsing documentation for bundler-2.4.22
Installing ri documentation for bundler-2.4.22
Done installing documentation for bundler after 0 seconds
1 gem installed
Don't run Bundler as root. Installing your bundle as root will break this application for all non-root users on this machine.
Bundle complete! 4 Gemfile dependencies, 32 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
Configuration file: /__w/spark/spark/docs/_config.yml

************************
* Building error docs. *
************************
Generated: docs/_generated/error-conditions.html

*************************************
* Building Scala and Java API docs. *
*************************************
Moving back into docs dir.
Removing old docs
Making directory api/scala
cp -r ../target/scala-2.13/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.min.js from Scala API to Java API for page post-processing of badges
Copying api_javadocs.js to Java API for page post-processing of badges
Appending content of api-javadocs.css to JavaDoc stylesheet.css for badge styles

*****************************
* Building Python API docs. *
*****************************
Running Sphinx v4.5.0
/__w/spark/spark/python/pyspark/pandas/__init__.py:43: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
loading pickled environment... done
[autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/errors.rst, development/index.rst, development/logger.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart_connect.ipynb, ..., user_guide/pandas_on_spark/transform_apply.rst, user_guide/pandas_on_spark/typehints.rst, user_guide/pandas_on_spark/types.rst, user_guide/python_packaging.rst, user_guide/sql/arrow_pandas.rst, user_guide/sql/dataframe_column_selections.rst, user_guide/sql/index.rst, user_guide/sql/python_data_source.rst, user_guide/sql/python_udtf.rst, user_guide/sql/type_conversions.rst
[autosummary] generating autosummary for: /__w/spark/spark/python/docs/source/reference/api/pyspark.Accumulator.add.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.Accumulator.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.Accumulator.value.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.AccumulatorParam.addInPlace.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.AccumulatorParam.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.AccumulatorParam.zero.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.BarrierTaskContext.allGather.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.BarrierTaskContext.attemptNumber.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.BarrierTaskContext.barrier.rst, /__w/spark/spark/python/docs/source/reference/api/pyspark.BarrierTaskContext.cpus.rst, ..., /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.status.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.stop.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.active.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.addListener.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.get.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.removeListener.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.resetTerminated.rst, /__w/spark/spark/python/docs/source/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryManager.rst
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 2298 source files that are out of date
updating environment: 0 added, 2298 changed, 0 removed
reading sources... [100%] reference/pyspark.sql/spark_session .. user_guide/pandas_on_spark/supported_pandas_api


looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader .. user_guide/pandas_on_spark/supported_pandas_api
waiting for workers...
generating indices... done
highlighting module code... [100%] pyspark.util
writing additional pages... search done
copying images... [100%] ../../../docs/img/pyspark-spark_core_and_rdds.png
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build succeeded.

The HTML pages are in build/html.
Moving back into docs dir.
Making directory api/python
cp -r ../python/docs/build/html/. api/python

**************************
* Building SQL API docs. *
**************************
Generating SQL API Markdown files.
WARNING: Using incubator modules: jdk.incubator.vector


    SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b/text()');
    SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b');
    SELECT xpath_boolean('<a><b>1</b></a>','a/b');
    SELECT xpath_double('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_float('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_long('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_number('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_short('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
    SELECT xpath_string('<a><b>b</b><c>cc</c></a>','a/c');
Generating HTML files for SQL API documentation.
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /__w/spark/spark/sql/site
INFO    -  Documentation built in 0.79 seconds
/__w/spark/spark/sql
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
            Source: /__w/spark/spark/docs
       Destination: /__w/spark/spark/docs/_site
 Incremental build: disabled. Enable with --incremental
      Generating...
                    done in 23.136 seconds.
 Auto-regeneration: disabled. Use --watch to enable.
Configuration file: /Users/panbingkun/Developer/spark/spark-community/docs/_config.yml

************************
* Building R API docs. *
************************
Using Scala 2.13
Using R_SCRIPT_PATH = /usr/local/bin


── Installing package SparkR into temporary library ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Building pkgdown site for package SparkR ────────────────────────────────────
Reading from: /Users/panbingkun/Developer/spark/spark-community/R/pkg
Writing to: /Users/panbingkun/Developer/spark/spark-community/R/pkg/docs
── Sitrep ──────────────────────────────────────────────────────────────────────
✖ URLs not ok.
  In DESCRIPTION, URL is missing package url
  (https://spark.apache.org/docs/4.0.0/api/R).
  See details in `vignette(pkgdown::metadata)`.
✔ Favicons ok.
✔ Open graph metadata ok.
✔ Articles metadata ok.
✔ Reference metadata ok.
── Initialising site ───────────────────────────────────────────────────────────
── Building home ───────────────────────────────────────────────────────────────
Reading README.md
Writing 404.html
── Building function reference ─────────────────────────────────────────────────
Warning: SparkR is deprecated in Apache Spark 4.0.0 and will be removed in a future release. To continue using Spark in R, we recommend using sparklyr instead: https://spark.posit.co/get-started/

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window


Reading man/write.stream.Rd
Reading man/write.text.Rd
── Building articles ───────────────────────────────────────────────────────────
Reading vignettes/sparkr-vignettes.Rmd
Writing articles/sparkr-vignettes.html
── Building sitemap ────────────────────────────────────────────────────────────
── Building redirects ──────────────────────────────────────────────────────────
── Building search index ───────────────────────────────────────────────────────
── Checking for problems ───────────────────────────────────────────────────────
── Finished building pkgdown site for package SparkR ───────────────────────────
── Finished building pkgdown site for package SparkR ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Warning messages:
1: Failed to parse usage: `` array_aggregate(x, initialValue, merge, ...) array_contains(x, value) array_distinct(x) array_except(x, y) array_exists(x, f) array_forall(x, f) array_filter(x, f)
array_intersect(x, y) array_join(x, delimiter, ...) array_max(x) array_min(x) array_position(x, value) array_remove(x, value) array_repeat(x, count) array_sort(x, ...) array_transform(x, f)
arrays_overlap(x, y) array_union(x, y) arrays_zip(x, ...) arrays_zip_with(x, y, f) concat(x, ...) element_at(x, extraction) explode(x) explode_outer(x) flatten(x) from_json(x, schema, ...)
from_csv(x, schema, ...) map_concat(x, ...) map_entries(x) map_filter(x, f) map_from_arrays(x, y) map_from_entries(x) map_keys(x) map_values(x) map_zip_with(x, y, f) posexplode(x)
posexplode_outer(x) reverse(x) schema_of_csv(x, ...) schema_of_json(x, ...) shuffle(x) size(x) slice(x, start, length) sort_array(x, asc = TRUE) transform_keys(x, f) transform_values(x, f)
to_json(x, ...) to_csv(x, ...) S4method(`reverse`, list(`Column`))(x) S4method(`to_json`, list(`Column`))(x, ...) S4method(`to_csv`, list(`Column`))(x, ...) S4method(`concat`, list(`Column`))(x,
...) S4method(`from_json`, list(`Column`,`characterOrstructTypeOrColumn`))(x, schema, as.json.array = FALSE, ...) S4method(`schema_of_json`, list(`characterOrColumn`))(x, ...) S4method(`from_csv`,
list(`Column`,`characterOrstructTypeOrColumn`))(x, schema, ...) S4method(`schema_of_csv`, list(`characterOrColumn`))(x, ...) S4method(`array_aggregate`,
list(`characterOrColumn`,`Column`,``function``))(x, initialValue, merge, finish = NULL) S4method(`array_contains`, list(`Column`))(x, value) S4method(`array_distinct`, list(`Column`))(x)
S4method(`array_except`, list(`Column`,`Column`))(x, y) S4method(`array_exists`, list(`characterOrColumn`,``function``))(x, f) S4method(`array_filter`, list(`characterOrColumn`,``function``))(x, f)
S4method(`array_forall`, list(`characterOrColumn`,``function``))(x, f) S4method(`array_intersect`, list(`Column`,`Column`))(x, y) S4method(`array_join`, list(`Column`,`character`))(x, delimiter,
nullReplacement = NULL) S4method(`array_max`, list(`Column`))(x) S4method(`array_min`, list(`Column`))(x) S4method(`array_position`, list(`Column`))(x, value) S4method(`array_remove`,
list(`Column`))(x, value) S4method(`array_repeat`, list(`Column`,`numericOrColumn`))(x, count) S4method(`array_sort`, list(`Column`))(x, comparator = NULL) S4method(`array_transform`,
list(`characterOrColumn`,``function``))(x, f) S4method(`arrays_overlap`, list(`Column`,`Column`))(x, y) S4method(`array_union`, list(`Column`,`Column`))(x, y) S4method(`arrays_zip`,
list(`Column`))(x, ...) S4method(`arrays_zip_with`, list(`characterOrColumn`,`characterOrColumn`,``function``))(x, y, f) S4method(`shuffle`, list(`Column`))(x) S4method(`flatten`, list(`Column`))(x)
S4method(`map_concat`, list(`Column`))(x, ...) S4method(`map_entries`, list(`Column`))(x) S4method(`map_filter`, list(`characterOrColumn`,``function``))(x, f) S4method(`map_from_arrays`,
list(`Column`,`Column`))(x, y) S4method(`map_from_entries`, list(`Column`))(x) S4method(`map_keys`, list(`Column`))(x) S4method(`transform_keys`, list(`characterOrColumn`,``function``))(x, f)
S4method(`transform_values`, list(`characterOrColumn`,``function``))(x, f) S4method(`map_values`, list(`Column`))(x) S4method(`map_zip_with`,
list(`characterOrColumn`,`characterOrColumn`,``function``))(x, y, f) S4method(`element_at`, list(`Column`))(x, extraction) S4method(`explode`, list(`Column`))(x) S4method(`size`, list(`Column`))(x)
S4method(`slice`, list(`Column`))(x, start, length) S4method(`sort_array`, list(`Column`))(x, asc = TRUE) S4method(`posexplode`, list(`Column`))(x) S4method(`explode_outer`, list(`Column`))(x)
S4method(`posexplode_outer`, list(`Column`))(x) ``
2: Failed to parse usage: `` dapply(x, func, schema) S4method(`dapply`, list(`SparkDataFrame`,``function``,`characterOrstructType`))(x, func, schema) ``
3: Failed to parse usage: `` dapplyCollect(x, func) S4method(`dapplyCollect`, list(`SparkDataFrame`,``function``))(x, func) ``
+ rm ../_pkgdown.yml
+ popd
~/Developer/spark/spark-community/R ~/Developer/spark/spark-community/R ~/Developer/spark/spark-community/R
+ popd
~/Developer/spark/spark-community/R ~/Developer/spark/spark-community/R
Moving back into docs dir.
Making directory api/R
cp -r ../R/pkg/docs/. api/R
            Source: /Users/panbingkun/Developer/spark/spark-community/docs
       Destination: /Users/panbingkun/Developer/spark/spark-community/docs/_site
 Incremental build: disabled. Enable with --incremental
      Generating...
                    done in 14.093 seconds.
 Auto-regeneration: disabled. Use --watch to enable.
Untagged: apache/spark/apache-spark-ci-image-docs:1731928756
Deleted: sha256:c319e5c6347755831e6cf998bff702f737769a3cad4839965c9c0322f50f7ea7
Build doc done.

zhengruifeng

dumb questions: can we move the scripts to another dictionary?

panbingkun · 2024-11-19T02:02:02Z

dumb questions: can we move the scripts to another dictionary?

Allow me to try it.

panbingkun · 2024-11-19T11:41:24Z

dumb questions: can we move the scripts to another dictionary?

I have moved the script from dev/spark-test-image/docs to dev/spark-test-image-utils/docs, and the local testing is okay.

panbingkun · 2024-11-26T02:49:33Z

cc @HyukjinKwon @LuciferYang

LuciferYang · 2024-11-26T06:42:37Z

will verify the script later

panbingkun · 2024-11-26T06:43:35Z

will verify the script later

Thank you very much! ❤️

LuciferYang · 2024-11-27T06:30:21Z

The script can executed successfully, thank you very much, @panbingkun .

However, should the final generated results only exist in the docs/site/ directory? It seems that copies of the generated .html files exist in many other places, such as the sql/site/ directory and the docs/api directory. Additionally, since these files cannot currently be cleaned using commands like sbt clean or mvn clean, this results in many extra .html files being left in the project work space after each build.

panbingkun · 2024-11-27T08:07:28Z

The script can executed successfully, thank you very much, @panbingkun .

However, should the final generated results only exist in the docs/site/ directory? It seems that copies of the generated .html files exist in many other places, such as the sql/site/ directory and the docs/api directory. Additionally, since these files cannot currently be cleaned using commands like sbt clean or mvn clean, this results in many extra .html files being left in the project work space after each build.

@LuciferYang Thank you very much for helping to verify! ❤️

I think the above issue is due to a problem with the script build_api_docs.rb itself that has existed in the past, as follows:

spark/docs/_plugins/build_api_docs.rb

Lines 189 to 193 in e03319f

    
           puts "Making directory api/sql" 
        
           mkdir_p "api/sql" 
        
           puts "cp -r ../sql/site/. api/sql" 
        
           cp_r("../sql/site/.", "api/sql")

Can we solve this issue with a new separate PR?

LuciferYang · 2024-11-27T08:12:52Z

Yeah, it's fine for me to add some cleanup logic in a separate followup. It would be more friendly for local build

LuciferYang · 2024-11-27T08:14:07Z

LGTM, but it would be best to wait for @zhengruifeng to take another look.

panbingkun · 2024-11-27T08:26:23Z

Add a note (I communicated privately with @LuciferYang and confirmed this).

If you encounter similar issues:

ERROR: failed to solve: ubuntu:jammy-20240911.1: failed to resolve source metadata for docker.io/library/ubuntu:jammy-20240911.1: failed to authorize: failed to fetch oauth token: Post "https://auth.docker.io/token": read tcp 192.168.1.23:49300->54.236.113.205:443: read: connection reset by peer

please add the registry-mirrors to the file ~/.docker/daemon.json:

(base) ➜  .docker pwd
/Users/panbingkun/.docker
(base) ➜  .docker cat daemon.json
{
  "builder": {
    "gc": {
      "defaultKeepStorage": "20GB",
      "enabled": true
    }
  },
  "experimental": false,
  "registry-mirrors": [
    "https://registry.docker-cn.com",
    "http://hub-mirror.c.163.com",
    "https://docker.mirrors.ustc.edu.cn",
    "https://dockerhub.azk8s.cn",
    "https://mirror.ccs.tencentyun.com",
    "https://registry.cn-hangzhou.aliyuncs.com",
    "https://docker.mirrors.ustc.edu.cn",
    "https://docker.1panel.live",
    "https://atomhub.openatom.cn/",
    "https://hub.uuuadc.top",
    "https://docker.anyhub.us.kg",
    "https://dockerhub.jobcher.com",
    "https://dockerhub.icu",
    "https://docker.ckyl.me",
    "https://docker.awsl9527.cn"
  ]
}

pan3793 · 2024-11-27T08:30:33Z

dev/spark-test-image-util/docs/build-docs

+  --interactive --tty "${IMG_URL}" \
+  /bin/bash -c "sh ${BUILD_DOCS_SCRIPT_PATH}"
+
+# 4.Build docs on host: `r doc`.


Given that SparkR is deprecated and has fewer changes, how about respecting SKIP_RDOC here? So that developers don't need to have an R env installed on the host.

Good point, I have also thought about this question.
How do you all think about it?
This will makes the script look much more concise and pretty!

Second thought, we can also passthrough SKIP_ERRORDOC, SKIP_SCALADOC, SKIP_PYTHONDOC, SKIP_SQLDOC into the container, so that all existing flags still work as-is.

There is a difference, we do not want to run sbt compile in the container because there is no ivy cache in the container. when executing it, all dependent jars will be downloaded. If we use a similar mounting workaround, the complexity will increase significantly.

If we run sbt compile in the container, it will be very slow.

pan3793 · 2024-11-27T09:47:37Z

the script works well on my local machine, thanks @panbingkun

pan3793 · 2024-11-27T09:50:46Z

dumb questions: can we move the scripts to another dictionary?

@zhengruifeng I suppose the script is intended to be used by developers, if so, maybe just put it at dev/build-docs?

pan3793 · 2024-11-27T09:52:49Z

dev/spark-test-image-util/docs/build-docs

@@ -0,0 +1,71 @@
+#!/usr/bin/env bash


the script seems does not have x permission, you can grant it by chmod a+x <path> and the git will record permissions correctly in UNIX-like OS.

panbingkun · 2024-11-27T10:41:03Z

the script works well on my local machine, thanks @panbingkun

Thank you very much for helping to verify again! ❤️

panbingkun · 2024-11-28T09:16:01Z

Merged to master.
Thank you, @zhengruifeng, @LuciferYang, @pan3793 .

HyukjinKwon · 2024-11-28T23:41:14Z

dev/spark-test-image-util/docs/build-docs

@@ -0,0 +1,71 @@
+#!/usr/bin/env bash


Can we document this in docs/README.md?

Sure, let me do it as a follow up pr.

Please review: #49013
thanks!

[SPARK-50295][INFRA] Add a script to build docs with image

d46541f

github-actions bot added BUILD DOCS labels Nov 15, 2024

pan3793 reviewed Nov 15, 2024

View reviewed changes

Merge branch 'master' into SPARK-50295

3a434ec

zhengruifeng reviewed Nov 18, 2024

View reviewed changes

panbingkun added 3 commits November 19, 2024 18:51

Merge branch 'master' into SPARK-50295

e9dedf4

change dir of build-docs-on-local

4d3cac5

fix

913f525

panbingkun added 2 commits November 26, 2024 10:01

Merge branch 'master' into SPARK-50295

ddc06ba

update

eb5b4df

panbingkun marked this pull request as ready for review November 26, 2024 02:48

panbingkun requested a review from zhengruifeng November 26, 2024 02:49

update

088a8ac

LuciferYang approved these changes Nov 27, 2024

View reviewed changes

pan3793 reviewed Nov 27, 2024

View reviewed changes

Merge branch 'master' into SPARK-50295

8217ead

zhengruifeng approved these changes Nov 27, 2024

View reviewed changes

panbingkun closed this in b9bff4b Nov 28, 2024

HyukjinKwon reviewed Nov 28, 2024

View reviewed changes

LuciferYang mentioned this pull request Feb 6, 2025

[SPARK-51104][DOC] Self-host JavaScript and CSS in Spark website #49823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50295][INFRA] Add a script to build docs with image #48860

[SPARK-50295][INFRA] Add a script to build docs with image #48860

panbingkun commented Nov 15, 2024 •

edited

Loading

pan3793 Nov 15, 2024

panbingkun Nov 18, 2024

panbingkun commented Nov 18, 2024 •

edited

Loading

zhengruifeng left a comment

panbingkun commented Nov 19, 2024

panbingkun commented Nov 19, 2024

panbingkun commented Nov 26, 2024

LuciferYang commented Nov 26, 2024

panbingkun commented Nov 26, 2024

LuciferYang commented Nov 27, 2024

panbingkun commented Nov 27, 2024

LuciferYang commented Nov 27, 2024

LuciferYang commented Nov 27, 2024

panbingkun commented Nov 27, 2024

pan3793 Nov 27, 2024

panbingkun Nov 27, 2024 •

edited

Loading

pan3793 Nov 27, 2024

panbingkun Nov 27, 2024

panbingkun Nov 27, 2024

pan3793 commented Nov 27, 2024

pan3793 commented Nov 27, 2024

pan3793 Nov 27, 2024

panbingkun Nov 27, 2024

panbingkun Nov 27, 2024

panbingkun commented Nov 27, 2024

panbingkun commented Nov 28, 2024

HyukjinKwon Nov 28, 2024

panbingkun Nov 29, 2024 •

edited

Loading

panbingkun Nov 29, 2024

[SPARK-50295][INFRA] Add a script to build docs with image #48860

[SPARK-50295][INFRA] Add a script to build docs with image #48860

Conversation

panbingkun commented Nov 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun commented Nov 18, 2024 • edited Loading

zhengruifeng left a comment

Choose a reason for hiding this comment

panbingkun commented Nov 19, 2024

panbingkun commented Nov 19, 2024

panbingkun commented Nov 26, 2024

LuciferYang commented Nov 26, 2024

panbingkun commented Nov 26, 2024

LuciferYang commented Nov 27, 2024

panbingkun commented Nov 27, 2024

LuciferYang commented Nov 27, 2024

LuciferYang commented Nov 27, 2024

panbingkun commented Nov 27, 2024

Choose a reason for hiding this comment

panbingkun Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 commented Nov 27, 2024

pan3793 commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun commented Nov 27, 2024

panbingkun commented Nov 28, 2024

Choose a reason for hiding this comment

panbingkun Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun commented Nov 15, 2024 •

edited

Loading

panbingkun commented Nov 18, 2024 •

edited

Loading

panbingkun Nov 27, 2024 •

edited

Loading

panbingkun Nov 29, 2024 •

edited

Loading