-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release]: OpenSearch Hadoop Client #3385
Comments
[Triage] |
Hey @prudhvigodithi. Yes, the artifacts will all have the |
[bump] we expect the review to be complete at the end of this week and would like to release next week. @gaiksaya @prudhvigodithi |
Hi @harshavamsi, |
Hi @gaiksaya, the review is complete. We are good to go for the release. We do also need snapshot builds, yes. |
Ack! Adding @rishabh6788 to this issue to help on-board to 1-click release process. |
Hi @harshavamsi, |
Adding some thoughts from the discussion:
Having this would be easy to manage multiple clients under the same repo, can you please add your thoughts ? @gaiksaya @harshavamsi @bbarani @dblock |
Adding some more details to this issue since our meeting:
Yes, today there are build tasks for each individual jar and they are built into separate folders. For example running
Again, there is no build tasks for both version combinations, i.e
Again, the publish script does generate individual POMs for both the scala versions, it is just that the publish script does not like that the project name is
I hope I addressed the questions above. After some inspection I did realize that the POMs, JARs were generated separately for each spark version because the project is separated into all the individual spark distributions. Correct me if I'm missing something here or if you need more info. |
@harshavamsi maven publish does not look at the project name, it publishes based on the values inside generated POM, please check this link explained better with Based on the following error: |
Right, yes the artifactId is being manipulated and that is by design because of the various spark variants. The generated POMs for each of the variants has a unique artifactId. |
@harshavamsi the |
Yeah, it's being changed during the maven task because the top level project, example One option we considered was to split out each scala variant into their own folders with their own build gradle files, but this is not feasible in the long run as we add more variants with permutation and combinations. The number of variants under each project will keep increasing. Take a look at this PR where I added a bunch of new feature variants. In this case, we would need a separate subfolder for each of these variants to prevent the artifactId from clashing. This is precisely why the |
Hi @harshavamsi, we are waiting for opensearch-spark-30 library as our Spark 3.x clusters can't connect to our OpenSearch 1.x cluster using elasticsearch-spark-30.xxx dependency right now. Are these OS libraries going to be released soon? |
@harshavamsi , just want to follow up on this request. With regarding to the snapshot versions, have you got a chance to check the above link out? and if you need any assistance or questions that we may be able to help with? thanks! |
Hey, from the discussion with @harshavamsi, following are the clients to be published
Wanted to now check what is takes to have a separate task (a maven publish task) to publish each client to the desired maven coordinates? I have also noticed the folder the jars are created is under So coming back to the point @harshavamsi @wbeckler having a separate maven publish task for each client is possible to have? then the GH workflow can use these publish tasks, please check this example. |
@prudhvigodithi The process you are proposing would work for all but the Spark versions of the connector. However, Spark is the most popular version. Given the highly involved way that Gradle is set up to produce the permutations of Spark and Scala, there is no way to adapt it to the separate publish tasks. Instead we need to pull in the various publishing files from each build folder. After running the distribution command the files will be in the following locations (see below). Unfortunately, the Spark/Scala gradle build process is highly customized and cannot be broken down in the way we typically do this. opensearch-hadoop dist/build/distributions/opensearch-hadoop-1.0.0-SNAPSHOT.pom opensearch-hadoop-mr mr/build/distributions/opensearch-hadoop-mr-1.0.0-SNAPSHOT.pom opensearch-hadoop-hive hive/build/distributions/opensearch-hadoop-hive-1.0.0-SNAPSHOT.pom opensearch-spark-20 spark/spark-20/build/distributions/opensearch-spark-20_2.10-1.0.0-SNAPSHOT.pom opensearch-spark-30 spark/spark-30/build/distributions/opensearch-spark-30_2.12-1.0.0-SNAPSHOT.pom |
Let me quickly summarize everything: Releasing OpenSearch Hadoop is not as straightforward as releasing other clients or plugins. The hadoop connector comes with multiple versions of spark and scala that are compiled together. It also comes with other clients like hive, MR, and the global client itself. Background: Publishing the rest of the clients works as you would expect using the Here are some alternatives we considered:
Based on this assessment, we believe that there should be a custom build script that is able to pull in all the targets above and publish them to maven. The POM files for each JAR are absolutely correctly generated. @prudhvigodithi @gaiksaya @jordarlu @bbarani @wbeckler @VachaShah |
Thanks for the detailed explanation @harshavamsi. |
Perfect, thank you! This is exactly what I'm looking for. We should be able to build, upload, sign and publish. This would make our life very easy. |
Publish commands for MR and Hive:
Distribution script for snapshots: Distribution scipt for release: |
Thanks @harshavamsi @prudhvigodithi for the help.
Steps taken:Set below variables before building:
Run below commands for generating the snapshots artifacts:
Copy the output contents to a new folder with below structure:
Clone this (build) repository and run the publish script: https://github.com/opensearch-project/opensearch-build/blob/main/publish/publish-snapshot.sh Run: Details
Next steps@harshavamsi Will work on automating the same via GHA. Please raise the PR and tag me so that I can add hadoop repo for retrieving maven credentiatls. Example for the same: https://github.com/opensearch-project/security/blob/main/.github/workflows/maven-publish.yml#L24-L34 |
Also @harshavamsi once above artifacts are verified, can you looks into generating non-snapshots version of artifacts. Following 1-click release process, the heavy lifting will be done by GHA. Jenkins workflow will just download the artifacts from draft release, sign and release them. |
Did you read the on-boarding document
Yes.
What is the name of your component?
OpenSearch-Hadoop
What is the link to your GitHub repo?
https://github.com/opensearch-project/opensearch-hadoop
Targeted release date
4/24/2023 tentative and subject to successful security review
Where should we publish this component?
Maven
What type of artifact(s) will be generated for this component?
Should be a standard JAR
Have you completed the required reviews including security reviews, UX reviews?
Ongoing right now.
Have you on-boarded automated security scanning for the GitHub repo associated with this component?
Yes, we have whitesource for security scanning.
Additional context
The Hadoop client has hive, spark, mapreduce, pig, and storm as sub clients. They're each capable of generating their own jars. Each of those JARs need to be published individually. So we will need to release:
opensearch-hadoop
,opensearch-hadoop-mr
opensearch-spark-20
opensearch-spark-30
opensearch-hadoop-hive
cc: @wbeckler @gaiksaya
The text was updated successfully, but these errors were encountered: