Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch task_log_handler wide scope index filters causing overload of ES cluster #37999

Closed
1 of 2 tasks
nsAstro opened this issue Mar 8, 2024 · 9 comments · Fixed by #38423
Closed
1 of 2 tasks
Labels
area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@nsAstro
Copy link
Contributor

nsAstro commented Mar 8, 2024

Apache Airflow Provider(s)

elasticsearch

Versions of Apache Airflow Providers

Latest version of the ElasticSearch provider

Apache Airflow version

2.9 but all others as well

Operating System

Debian

Deployment

Astronomer

Deployment details

Affects any environment where multiple Airflow systems depend on a monolith ElasticSearch. Not specific to a deployment pattern beyond the fact that the Astronomer model creates indices at a per-day level. I believe this is the same in OSS Airflow Helm chart w/ fluentd.

What happened

The Airflow query filter to fetch logs from ElasticSearch searches '_all' indices from the underlying ElasticSearch for a specific string pattern that is a concatenation of dag_run_id, task_run_id, and a few other details. This results in a tremendous amount of stress on the underlying ElasticSearch because the Webserver is not being specific about its queries. The ElasticSearch model parallelizes the search across all the indices and across it's multiple nodes but at scale this results in significant latency when fetching logs as the number on indices scales. If there are multiple Airflow clusters using the same ElasticSearch, this issue is further compounded. The inbuilt index_patterns parameter that Airflow exposes is not enough to limit the query radius in all cases. The static filter means an operator cannot dynamically choose a day/month filter for a given task_log. In cases where logging is configured to create a new index per day, filter patterns can still end up querying hundreds of indices.

What you think should happen instead

Airflow Webserver should be using a more pointed query filter to specifically query only the relevant index for a given log query. While I understand that Airflow is not responsible for shipping logs to the underlying logging store, it could help if Airflow could be configured or make certain assumptions about the index pattern that it needs to ask for a given log. Alternatively, reindexing on a dag_id (with a secondary task_id) could make queries much more scoped and severely reduce the amount of cluster load (not an Airflow responsibility). But allowing a user to set a function that can provide an index pattern for a given task_run_id would give the most flexibility in constraining queries.

How to reproduce

Create multiple Airflow systems using Astronomer or the OSS Chart with FluentD. Configure FluentD to push to a different index-prefix per Airflow. For a minimal ElasticSearch cluster with just 3 nodes, the default number of indices is 2000 per node. At around 5000~ indices, any log query will result in 100% CPU utilization on the Elasticsearch.

Anything else

At the right scale of Airflow + ElasticSearch this issue occurs every time a user views the task logs of any given task.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nsAstro nsAstro added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Mar 8, 2024
@eladkal
Copy link
Contributor

eladkal commented Mar 8, 2024

cc @Owen-CH-Leung You contributed several elastic PRs in the past. Can you assist with triaging this issue?

@Owen-CH-Leung
Copy link
Contributor

Ok let me take a look and will revert soon.

@nsAstro
Copy link
Contributor Author

nsAstro commented Mar 9, 2024

Made a few edits to the main issue. Also updated the name after some feedback from @jedcunningham.

@nsAstro nsAstro changed the title ElasticSearch log fetching hook is not granular enough in its index filter ElasticSearch task_log_handler wide scope index filters causing overload of ES cluster Mar 9, 2024
@Owen-CH-Leung
Copy link
Contributor

Owen-CH-Leung commented Mar 10, 2024

@nsAstro could you elaborate more on allowing a user to set a function that can provide an index pattern for a given task_run_id ? I'm not so sure how airflow can be further improved to dynamically query different index based on the task_run_id.

@jedcunningham
Copy link
Member

That was my idea. I had 2 that immediately came to mind:

  • Allow templating in index_patterns, like we do for log_id_template
  • A callable that deployment managers can configure that'll take a TI and return a value for index_patterns. This would allow more complex logic to determine the right index_patterns to search.

The former is probably less work, but I'm not sure what context is all available to shove in there.

@nsAstro
Copy link
Contributor Author

nsAstro commented Mar 11, 2024

From a responsibilities perspective, if Airflow deems the user responsible for streaming the task logs to ElasticSearch in any way they see fit, it should allow equally flexible configuration on read since it can make no real assumptions on index schema. If the first option can be done in a way that doesn't create any limitations, I'm all for it. However, I think option 2 is the best way to pass the responsibility back to the user to define the query to align with how they index the logs in the first place.

@pankajastro
Copy link
Member

Making it more configurable seems like the right approach. I'm also leaning towards the second option, @Owen-CH-Leung, what do you think?

@nsAstro, feel free to draft a PR, and please let me know if there's any way I can support you with this.

@Owen-CH-Leung
Copy link
Contributor

Yeah second options sounds like a more flexible approach for me too. Happy to help more when we have a draft PR to implement this feature.

@dstandish
Copy link
Contributor

just want to say, very good writeup @nsAstro :) made it very easy to understand what is going on here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants