Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align timers and timing metrics (ms) across all metrics loggers #39908

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

dirrao
Copy link
Contributor

@dirrao dirrao commented May 29, 2024

Align timers and timing metrics (ms) across all metrics loggers
closed: #39818

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's likely very wrong as it is heavily backwards compatible. It will suddenly change almost all dashboards and alerts people built using those metrics.

Metrics are definitely "public" API of Airflow and we cannot change them without breaking compatibility.

@kaxil
Copy link
Member

kaxil commented May 29, 2024

That's likely very wrong as it is heavily backwards compatible. It will suddenly change almost all dashboards and alerts people built using those metrics.

Metrics are definitely "public" API of Airflow and we cannot change them without breaking compatibility.

100% -- this will break all the operational dashboards for companies, users and service providers

@dirrao
Copy link
Contributor Author

dirrao commented May 30, 2024

That's likely very wrong as it is heavily backwards compatible. It will suddenly change almost all dashboards and alerts people built using those metrics.
Metrics are definitely "public" API of Airflow and we cannot change them without breaking compatibility.

100% -- this will break all the operational dashboards for companies, users and service providers

Yes. This is a breaking change for customers (Datadog and OTEL). As, Statsd log the timer metrics in milli seconds. So, the metrics documentation is inconsistent unless we make the all the metrics loggers emit the same metric unit (milli seconds).
Screen Shot 2024-05-24 at 11 06 41 PM

@dirrao dirrao requested a review from ferruzzi May 30, 2024 01:07
@potiuk
Copy link
Member

potiuk commented May 30, 2024

Yes. But making things consistent with docs is one thing, while breaking existing deployments that baase on "actual" behaviour is another. In this case changing the metrics is a problem and if we want to make documentattion and implementation consistent, we should only update the documentation (even if it means that some metrics will have different and inconsistent behaviour).

In this case - preserving current behaviour trumps consistency.

I think we could have a new parameter that will say "consistent" metrics (disabled by default) where this might get fixed and we should swap the default value with Airflow 3 (to be consistent by default). Simply metrics is one of those things that is undervalued by most people who develop Airlfow but have the potential of wreakiing huge havoc if we change their behaviour, it's heavily public API when you think of breaking effects it might have and probability it will cause problems if you change metric values effectively by 3 orders of magnitude.

@ferruzzi
Copy link
Contributor

Yeah, I'm with @potiuk and @kaxil here. We already ran around in circles on this topic in the previous seconds/milliseconds metrics PR just a couple weeks ago. It's a mess, but breaking everyone's dashboards isn't a viable solution IMHO.

To add to that, even if we do standardize them, why break two (Datadog and Otel) to match one (StatsD)? If we were going to do this, shouldn't we change the one that doesn't match the other two?

This feels like an Airflow3 "big breaking thing" to me.

@dirrao dirrao force-pushed the metrics_unit_correction branch from 9a5b95e to 6bed534 Compare June 6, 2024 17:10
@dirrao
Copy link
Contributor Author

dirrao commented Jun 6, 2024

Hi @potiuk , @kaxil , @ferruzzi,
Thanks for comments. I have introduced a flag to enable the consistency across all the logger timer metrics. By default the consistency flag is disable. I have added the warning for the same to the customer. Can you re-review it once?

@dirrao dirrao requested a review from potiuk June 6, 2024 17:14
@potiuk
Copy link
Member

potiuk commented Jun 13, 2024

I think if we want to implement this consistency, it should be Airlfow 3 feature to have it on by default and possibly we should raise a future deprecation warning informing that it will change in Airflow 3 and that user can switch to consistent metrics by flipping that config. @ferruzzi - WDYT ?

Also if we do that - I think that should be added to our Airflow 3 discussions as a topic to discuss - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes - > maybe you can add it it as command and make sure this is included in our Airflow 3 call at some point in time @dirrao ?

@potiuk
Copy link
Member

potiuk commented Jun 13, 2024

WHY: This is generally a serious deployment change for a number of users - but since it's a deployment and not DAG authoring change, it's likely a good candidate for one of the Airlfow 3 incompatibilities. But we now need to seriously plan what is in and out of Airflow 3 incompatibilities, and if our plan is to make a backwards-incompatible change in Airflow 3, we should warn our users before (via deprecations) and make sure this is captured in our Airflow 3 discussions.

I think it would be good to have someone who will be "championing" that - i.e. making sure it's remembered, properly docuemented, good warnings raised and people are aware of this as incompatibility - each of the "workstreams" there will have a lead/champion so having someone who will be advocating and leading that through Airlfow 3 transition (with all the consequences/documentation/making sure that it's properly communicated to our users and that there is a transition plam, including some guidelines of what our users should do) is a good idea.

@dirrao
Copy link
Contributor Author

dirrao commented Jun 13, 2024

I think if we want to implement this consistency, it should be Airlfow 3 feature to have it on by default and possibly we should raise a future deprecation warning informing that it will change in Airflow 3 and that user can switch to consistent metrics by flipping that config. @ferruzzi - WDYT ?

Good idea. I have updated the warning. So, users can take a call on it when they switching to Airflow 3.

warning: "Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds."

@dirrao
Copy link
Contributor Author

dirrao commented Jun 13, 2024

Also if we do that - I think that should be added to our Airflow 3 discussions as a topic to discuss - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes - > maybe you can add it it as command and make sure this is included in our Airflow 3 call at some point in time @dirrao ?

I don't have the access to wiki. I will remind you on this.

@potiuk
Copy link
Member

potiuk commented Jun 13, 2024

I don't have the access to wiki. I will remind you on this.

If you want access, send me your email in Slack

@dirrao
Copy link
Contributor Author

dirrao commented Jun 26, 2024

Hi @potiuk, @ferruzzi
I have updated the code as per the previous comments. If you can cross check and confirm the same, then I will update the test cases for the same.

@dirrao dirrao requested a review from potiuk June 27, 2024 11:17
@potiuk
Copy link
Member

potiuk commented Jul 2, 2024

looks gpod (sorry for delay)

@dirrao
Copy link
Contributor Author

dirrao commented Jul 4, 2024

@potiuk
I am seeing this deprecation warning everywhere. As it is being triggered during initialization.
is there any place where we can add it ignore it for the all the test cases? is there any better way to handle this?

________________________________________________________________________________ ERROR at setup of TestJdbcHook.test_suppress_and_warn_when_raised_exception_is_not_suppressed ________________________________________________________________________________

request = <SubRequest 'initialize_airflow_tests' for <Function test_jdbc_conn_connection>>

    @pytest.fixture(autouse=True, scope="session")
    def initialize_airflow_tests(request):
        """Helper that setups Airflow testing environment."""
        print(" AIRFLOW ".center(60, "="))

        # Setup test environment for breeze
        home = os.path.expanduser("~")
        airflow_home = os.environ.get("AIRFLOW_HOME") or os.path.join(home, "airflow")

        print(f"Home of the user: {home}\nAirflow home {airflow_home}")

        # Initialize Airflow db if required
        lock_file = os.path.join(airflow_home, ".airflow_db_initialised")
        if not skip_db_tests:
            if request.config.option.db_init:
                print("Initializing the DB - forced with --with-db-init switch.")
                initial_db_init()
            elif not os.path.exists(lock_file):
                print(
                    "Initializing the DB - first time after entering the container.\n"
                    "You can force re-initialization the database by adding --with-db-init switch to run-tests."
                )
>               initial_db_init()

tests/conftest.py:383:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:346: in initial_db_init
    from tests.test_utils.compat import AIRFLOW_V_2_8_PLUS, AIRFLOW_V_2_10_PLUS
tests/test_utils/compat.py:28: in <module>
    from airflow.models import Connection, Operator
airflow/models/__init__.py:79: in __getattr__
    val = import_string(f"{path}.{name}")
airflow/utils/module_loading.py:39: in import_string
    module = import_module(module_path)
/usr/local/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
airflow/models/operator.py:22: in <module>
    from airflow.models.baseoperator import BaseOperator
airflow/models/baseoperator.py:85: in <module>
    from airflow.models.taskinstance import TaskInstance, clear_task_instances
airflow/models/taskinstance.py:103: in <module>
    from airflow.models.xcom import LazyXComSelectSequence, XCom
airflow/models/xcom.py:52: in <module>
    from airflow.utils.json import XComDecoder, XComEncoder
airflow/utils/json.py:27: in <module>
    from airflow.serialization.serde import CLASSNAME, DATA, SCHEMA_ID, deserialize, serialize
airflow/serialization/serde.py:406: in <module>
    _register()
airflow/serialization/serde.py:391: in _register
    _stringifiers[c] = name
airflow/metrics/protocols.py:112: in __exit__
    self.stop()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <airflow.metrics.protocols.Timer object at 0xffff73988370>, send = True

    def stop(self, send: bool = True) -> None:
        """Stop the timer, and optionally send it to stats backend."""
        if self._start_time is not None:
            if metrics_consistency_on:
                self.duration = 1000.0 * (time.perf_counter() - self._start_time)  # Convert to milliseconds.
            else:
>               warnings.warn(
                    "Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.",
                    AirflowProviderDeprecationWarning,
                    stacklevel=2,
                )
E               airflow.exceptions.AirflowProviderDeprecationWarning: Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.

airflow/metrics/protocols.py:127: AirflowProviderDeprecationWarning

@potiuk
Copy link
Member

potiuk commented Jul 4, 2024

Hmm.. seing it - I am not sure if warning is a good idea. The thing with warnings is that you should have an "easy" way to get rid of the warning - and changing your dashboard/metrics is likely not an easy task - and I think one of the worst cases with warnings is to not allow people to silence them easily (by fixing them) . I am just thiking whether we should allow a third state of the switch ("legacy" ?) to allow the users to deliberately set it and silence the warning ?

BTW. The way how to treat warnings in tests is described in the docs: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#handling-warnings

@dirrao dirrao force-pushed the metrics_unit_correction branch from 573fc7f to d2d190b Compare July 5, 2024 09:55
@dirrao
Copy link
Contributor Author

dirrao commented Jul 5, 2024

I have updated the unit tests for the changes. I tried adding the ignore warning in pyproject.toml file. However it's not working.
'ignore:Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.'

@potiuk
Copy link
Member

potiuk commented Jul 26, 2024

I'd say with Airflow 3 work started - we should hold on with "fixing" it in Airflow 2 and simply follow whatever we decide on for Airflow 3.

That needs someone to take a lead on driviing to consensus and agreeing our metrics/telemetry strategy for Airflow 3:

  1. Are we going to stay with Statsd or OTEL only (I.e. is there something missing from OTEL that will make us stay with statsd)
  2. If we are going to stay with statsd - should we align all the timers and "break" compatibility

I think what @shubham22 mentioned at last dev call they are going to verify if that is something that they heavily would rely on and it will be discussed further at the dev calls.

@ferruzzi
Copy link
Contributor

I'm fine with millis. My main problem is standardizing on the statsd and breaking everyone else. In fact, after the last Airflow 3 dev call where they were discussing dropping statsd entirely in favor of OTel, I'll double down on that objection. We should not be basing this decision on how statsd handles it, especially if we may not even be using statsd in the codebase in a few months.

@dirrao
Copy link
Contributor Author

dirrao commented Sep 6, 2024

@potiuk / @ferruzzi / @kaxil
Can someone let us know the latest update on telemetry track? Can we resume this PR?

@potiuk
Copy link
Member

potiuk commented Sep 6, 2024

See discussions we have about OTEL at devlist - both separate discussion by Dennis and summary of the devcal discussion where we discussed it.

@ferruzzi
Copy link
Contributor

ferruzzi commented Sep 6, 2024

and see this one as well. #40800

OTel will be the default starting in 3.0, StatsD will be included but secondary/backup/legacy. Given that context, Here's what I think:

This can be added to the 3.0 changes and:

  1. in 3.0 all timers will emit in millis
  2. the "metrics consistency" flag should default to "on" and let users flip it back to legacy if they need time to migrate, and
  3. add a deprecation path to that consistency flag so we can eventually remove the option from the config (in 3.1?) and clean up the code.

@dirrao
Copy link
Contributor Author

dirrao commented Sep 7, 2024

@ferruzzi
I have enabled the consistency flag on by default. Please take a look at the PR once.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@potiuk
Copy link
Member

potiuk commented Sep 7, 2024

remove the option from the config (in 3.1?)

I think in 4.0 - but also I think we will get to much faster "MAJOR" release cycle. We have not decided it yet, but we might want to do just that and say release new Airflow MAJOR release every year ?

@dirrao dirrao self-assigned this Sep 10, 2024
@dirrao dirrao added the airflow3.0:candidate Potential candidates for Airflow 3.0 label Sep 10, 2024
@dirrao dirrao added the airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes label Sep 10, 2024
@dirrao
Copy link
Contributor Author

dirrao commented Sep 10, 2024

Hi @kaxil / @ferruzzi,
I'm awaiting final approval to proceed with the merge. Could you please take a look when you get a chance?

@potiuk potiuk merged commit 08daffe into apache:main Sep 11, 2024
52 checks passed
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
version_added: 2.10.0
type: string
example: ~
default: "True"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!!! @potiuk I guess no one noticed, cos this made it out with a default of on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, except this hasn't made it to a release yet, and it wasn't marked for backporting. (Lucky miss there)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't slip through, it was discussed above and intentional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied on #43966 as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to enable it directly in Airflow 3 without backporting. However, updating this was overlooked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes airflow3.0:candidate Potential candidates for Airflow 3.0 area:metrics kind:documentation telemetry Telemetry-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Airflow Timer Metrics Inconsistancy across multiple metrics loggers
5 participants