-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Elasticsearch to 8 #33135
Upgrade Elasticsearch to 8 #33135
Conversation
Fix failing Test Suit
@@ -2379,7 +2379,7 @@ elasticsearch: | |||
elasticsearch_configs: | |||
description: ~ | |||
options: | |||
use_ssl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In elasticsearch 7, use_ssl
is an accepted parameter when constructing ElasticSearch
client. See the following source code :
https://github.com/elastic/elasticsearch-py/blob/7.14/elasticsearch/client/__init__.py#L113
However, in elasticsearch 8, it no longer accepts use_ssl
parameter. See the following source code:
Therefore to make the testsuite compile with the ES8 , I use http_compress
as the argument (which is one of the accepted arguments for constructing ES client
Looks good, but it needs a bit "surrounding changes" and "context". I think we need to think a bit carefully about the change/migration (configs?) -and soem changelog documentation should be addded explaining what's going on and explaining context. The Elasticsearch config can. now be moved to provider - this is where it belongs. We should keep the pre-2.7 defaults in wiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning but the whole "elasticsearch" config should be moved to provider.yaml - same as celery, kubernetes and others. |
Thanks. I'll some more contexts into changelog soon. For moving elasticsearch config, do you mean the following lines should be moved to https://github.com/apache/airflow/blob/main/airflow/config_templates/config.yml#L2396-L2410 |
https://github.com/apache/airflow/blob/main/airflow/config_templates/config.yml#L2396-L2410 Yes. See for example: #32777 - this is a new thing that has been added in Airflow 2.7 - providers can now contribute their own whole "sections" and they can be removed from generic "airflow" configuration. Also, you will need to keep old "pre-2.7" defaults in https://github.com/apache/airflow/blob/main/airflow/config_templates/pre_2_7_defaults.cfg for back-compatibility (this one provides the default values for old providers that have not yet moved their config to ). The PR for Hive above does not have it yet as I realised it only afterwards. As you can see elasticsearch_configs is there, but I think - if we decide to also move "elasticsearch" to the provider, we need to add "elasticearch" to the defaults as well before 2.7.0 is released. I will go ahead and add it now, otherwise some back-compatibility scenarios might not work (old provider assuming default value retrieved might stop working). |
The elasticsearch group is likely to be moved to elasticsearch provider. Anticipating that (see apache#33135) we need to move it to pre-2.7 defaults in order to have back-compatibility for providers that assume default values to be there.
Thanks. I've added some contexts in CHANGELOG and also moved the config to |
@@ -27,6 +27,17 @@ | |||
Changelog | |||
--------- | |||
|
|||
5.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5.1.0 | |
6.0.0 |
Since we are suggesting a breaking change, we need to have a major bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - will revise accordingly
self.client = elasticsearch.Elasticsearch(host.split(";"), **es_kwargs) # type: ignore[attr-defined] | ||
|
||
self.client = elasticsearch.Elasticsearch(host, **es_kwargs) # type: ignore[attr-defined] | ||
# in airflow.cfg, host of elasticsearch has to be http://dockerhostXxxx:9200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know what error do we see if the protocol is not included in the set value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -292,27 +292,24 @@ def es_read(self, log_id: str, offset: int | str, metadata: dict) -> list | Elas | |||
} | |||
|
|||
try: | |||
max_log_line = self.client.count(index=self.index_patterns, body=query)["count"] | |||
max_log_line = self.client.count(index=self.index_patterns, body=query)["count"] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have here a type ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - if we look at the official ES package, the count
API from the ElasticSearch
class actually didn't have the body
as parameter. See this link: https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/_sync/client/__init__.py#L801
But the body parameter is still accepted because there's a decorator at the beginning, which modifies the function to accept body
as the argument. See this few lines : https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/_sync/client/__init__.py#L798-L800
Therefore, without type ignore, the pre-commit job mypy at provider
will actually get failed because it thinks that body is not an accepted parameter (which actually is)
|
||
logs: list[Any] | ElasticSearchResponse = [] | ||
if max_log_line != 0: | ||
try: | ||
query.update({"sort": [self.offset_field]}) | ||
res = self.client.search( | ||
res = self.client.search( # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question regarding type ignore
index=self.index_patterns, | ||
body=query, | ||
size=self.MAX_LINE_PER_PAGE, | ||
from_=self.MAX_LINE_PER_PAGE * self.PAGE, | ||
) | ||
logs = ElasticSearchResponse(self, res) | ||
except elasticsearch.exceptions.ElasticsearchException: | ||
self.log.exception("Could not read log with log_id: %s", log_id) | ||
except Exception as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot we not narrow down the exception we catch? Is the previous exception no longer present? If so, have they added any other similar class and can we use that?
Having such a broad level exception catch and not re-raising it might lead to some silent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the exception elasticsearch.exceptions.ElasticsearchException
is no longer present. Instead new class of exceptions are defined such as UnsupportedProductError
, NotFoundError
and so on. See this file:
https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/exceptions.py
And I feel like all those errors can occur when calling the ES API. So maybe we should raise the exception after logging to the error log ?
~~~~~~~~~~~~~~~~ | ||
|
||
.. note:: | ||
Upgrade to ElasaticSearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use ElasticSearch 8 package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade to ElasaticSearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use ElasticSearch 8 package. | |
Upgrade to Elasticsearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use Elasticsearch 8 package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will revise accordingly
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with | ||
ElasticSearch database that is below version 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with | |
ElasticSearch database that is below version 8. | |
This also means that the release drops support for Elasticsearch 7 and below and will no longer work with | |
Elasticsearch database that is below version 8. |
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with | ||
ElasticSearch database that is below version 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this PR is a breaking change?
Airflow wise what exactly gets broken here?
Updating upstream package version by itself is not a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - what I'm trying to say is that this release will drop support for ElasticSearch database version 7 or below since elasticsearch 8
will not work with DB version below 8 (i.e. airflow users should also upgrade their ES database to version 8 to ensure compatibility). I wrote this because I read the following doc and learn that Elasticsearch language clients are only backwards compatible with default distributions and without guarantees made
https://elasticsearch-py.readthedocs.io/en/stable/
Airflow-wise, nothing should really get broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. @eladkal is right - dependency change is not a reason for major bump - even if it is such a "big" dependency. This very clear in https://semver.org/#what-should-i-do-if-i-update-my-own-dependencies-without-changing-the-public-api:
What should I do if I update my own dependencies without changing the public API?
That would be considered compatible since it does not affect the public API. Software that explicitly depends on the same dependencies as your package should have their own dependency specifications and the author will notice any conflicts. Determining whether the change is a patch level or minor level modification depends on whether you updated your dependencies in order to fix a bug or introduce new functionality. We would usually expect additional code for the latter instance, in which case it’s obviously a minor level increment.
The elasticsearch group is likely to be moved to elasticsearch provider. Anticipating that (see #33135) we need to move it to pre-2.7 defaults in order to have back-compatibility for providers that assume default values to be there.
How about moving "elasticsearch" section as well? I seems very much like it should be in, this section makes no sense when elasticsearch provider is not installed I think? |
@@ -72,3 +73,20 @@ connection-types: | |||
|
|||
logging: | |||
- airflow.providers.elasticsearch.log.es_task_handler.ElasticsearchTaskHandler | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of also moving [elasticsearch]
section (which I thin makes sense) it should be accompanied by adding the link to elastisearch provider configuration in the Airflow documentation: (following #32777).
When we have more of those, we might want to choose to do it automatically but for now we need to add it "manually"
… add configurations-ref.rst, and revise the doc
Thanks all. I've moved |
Looks cool from my side. Thanks @Owen-CH-Leung for being so responsive to our comments :) |
..... | ||
|
||
Misc | ||
~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove above including the version.
It is not known yet which version is going to be next.. it could be 5.0.1, 5.1.0, 6.0.0
we know the release version only in run time.
The change we need here is just the note after the changelog title.. during release the release manager set it to the proper classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Removed.
I think @pankajkoti is also testing on ES remote logging with ES 8. Let's merge the PR when he comes back to us |
I can confirm ES remote logging works fine for me from this branch and Elasticsearch 8.9.0 🎉 . Thanks for your patience @Owen-CH-Leung |
Cool! |
Fantastic news ! |
aaand merged. |
(cherry picked from commit ad9d8d4)
Fixes #31256
Upgrade the
elasticsearch
provider to use officialelasticsearch
package with version >8.Remote logging with latest ES package should also work with this PR merged. To test locally, rebuild the breeze image with elasticsearch constraint set up
elasticsearch>8,<9
, set up elasticsearch 8 , kibana & filebeat in docker respectively, with the following changes to theairflow.cfg
Note that with the latest es package, host field under
elasticsearch
has to be a full URL starting withhttp://
orhttps://
. If you just puthost.docker.internal:9200
that will not work