-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the DNS lookup duration metric #93254
Conversation
Tagging subscribers to this area: @dotnet/ncl Issue Details
This is a necessary change to ensure the metric is in sync with the spec and the OTel recommendations. An 8.0 backport is planned to avoid future breaking changes.
|
private static string GetErrorType(Exception exception) => (exception as SocketException)?.SocketErrorCode switch | ||
{ | ||
SocketError.HostNotFound => "host_not_found", | ||
SocketError.TryAgain => "try_again", | ||
SocketError.AddressFamilyNotSupported => "address_family_not_supported", | ||
SocketError.NoRecovery => "no_recovery", | ||
|
||
_ => exception.GetType().Name | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked the codes we can realistically get from the PAL by looking at our mapping in the Unix PAL, and getaddrinfo
return values on Windows.
There are some other codes that may theoretically occur, but I wouldn't expect to see them in practice. Ie. seeing InvalidArgument
, TypeNotFound
or SocketNotSupported
means a programming error in our PAL rather than anything else if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. Details about the error attribute need to be added to dotnet/docs#37213 and/or the OTEL semantic conventions
} | ||
} | ||
} | ||
|
||
private static string GetErrorType(Exception exception) => (exception as SocketException)?.SocketErrorCode switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These strings are specifically for metrics, right? Please add a comment on the method to that effect. Where do these result strings come from? Are they strings dictated by otel, or are you just snake_casing the names from the enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These strings are specifically for metrics, right?
Yes.
Where do these result strings come from?
They are defined by us, by mapping enum values to snake-cased strings which is an OTel convention. We do the same in HTTP metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is an OTel convention
Why don't we do the same for the _ =>
case then? i.e. it seems like our general goal here is to take any SocketError value or Exception name, and snake case it, and we're then just optimizing a few cases that we expect to be most common by hardcoding the transformation. In the name of robustness, shouldn't we be dynamically creating the snak_cased version of the resulting SocketError if there is one (outside of the range we've hardcoded) or for the exception type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noticed in #93254 (comment), this should be actually the full name of the exception, I will correct it.
Why don't we do the same for the _ => case then?
The specs don't require us to camel-case the exception names (would be very weird with full names). Tagging @lmolkova for a comprehensive explanation.
dynamically creating the snak_cased version of the resulting SocketError
The SocketError
-s that can practically appear here are coming from a limited set. I picked the codes we can realistically get from the PAL by looking at our mapping in the Unix PAL, and getaddrinfo
return values on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when that mapping is augmented in the future, we're going to remember to update this random method as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when that mapping is augmented in the future
That would only happen if getaddrinfo
implementations would introduce new error codes which I find very unlikely. Implementing dynamic mapping here to prematurely solve that problem feels like overengineering. Users would see the value host_not_found
in 99+% of the practical cases, sometimes try_again
maybe.
If you think the mapping is problematic in the form I'm proposing, I would rather prefer to remove it, and just report the exception name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like our general goal here is to take any SocketError value or Exception name, and snake case it
I'd frame the goal as disambiguating the most common sources of error that users would appreciate not having grouped together when analyzing failures. I don't think it is a goal that every possible SocketError has been disambiguated, that new future SocketError values are automatically disambiguated, or that exception names are transformed into snake case. The hard-coded list Anton has here seems fine to me. If we ever changed it we'd probably want that to be an intentional choice so we can warn users that their queries might need to be updated.
src/libraries/System.Net.NameResolution/src/System/Net/NameResolutionMetrics.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.NameResolution/src/System/Net/NameResolutionMetrics.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I commented on the other issue that I do think FullName would be preferable for exception types for consistency across other .NET metrics.
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
CI failures are unrelated (eg. #93413). |
I opened #93418 to revert this because it broke the build |
Looks like it didn't break the build after all - see explanation in #93418 (comment) |
dns.lookupS.duration
todns.lookup.duration
#92917dns.lookup.duration
should reporterror.type
#92837This is a necessary change to ensure the metric is in sync with the spec and the OTel recommendations. An 8.0 backport is planned to avoid future breaking changes.