Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for Backup Edge Cache Group #1908

Closed
wants to merge 4 commits into from
Closed

Changes for Backup Edge Cache Group #1908

wants to merge 4 commits into from

Conversation

Vijay-1
Copy link
Contributor

@Vijay-1 Vijay-1 commented Feb 22, 2018

This PR implements solution for the issue: #1907

It places the backup policy in the CZF file

{
"coverageZones": {
"GROUP2": {
"backupList": ["GROUP1"],
"network6": [
"1234:567a::/64",
"1234:567b::/64"
],
"network": [
"10.197.69.0/24"
]
},
"GROUP1": {
"backupList": ["GROUP2"],
"network6": [
"1234:5677::/64",
"1234:5676::/64"
],
"network": [
"10.126.250.0/24"
]
}
}
}

The following test cases has been executed successfully for both DNS and HTTP Routing:
Test Setup

GROUP1 : Two cache Servers
Group2 : One Cache Server
GEO Limit set to CZF only

Request from GROUP1's subnet , with one server in GROUP1 down and make sure the request is getting redirected to the remaining server in GROUP1
Request from GROUP1's subnet , with both servers in GROUP1 down and make sure the request is getting redirected to the GROUP2

Regression cases
Set GEO limit to None without backup list configured in CZF
Request from GROUP1's subnet , with both servers in GROUP1 down and make sure the request is getting redirected to the GROUP2

@asfgit
Copy link
Contributor

asfgit commented Feb 22, 2018

Can one of the admins verify this patch?

@mitchell852 mitchell852 added the Traffic Router related to Traffic Router label Feb 22, 2018
@limited
Copy link
Contributor

limited commented Feb 22, 2018

add to whitelist

@limited
Copy link
Contributor

limited commented Feb 22, 2018

test this please

@limited limited added the new feature A new feature, capability or behavior label Feb 22, 2018
@limited limited added this to the 2.3 milestone Feb 22, 2018
Copy link
Contributor

@rawlinp rawlinp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did an initial pass over this PR and it generally seems good but I did notice a couple things:

  1. The indentation seems off in several places. Please make sure indentation is consistent with the rest of the file.
  2. New features should be clearly documented. We do have documentation around the Coverage Zone File that will need updated for the new field.

Question: if a CZ's backup list exists, but none of the backup CZs have caches available, what happens? After the backup list is exhausted, should we fallback to the original logic of finding the closest available cachegroup?

@limited
Copy link
Contributor

limited commented Feb 22, 2018

@rawlinp This backupList would be used to gain more control over which CGs can be a backup for a group. If a CG is not in the backup list, it won't be used, even if that means the request will be rejected.

We specifically implemented this code to not fall back to getClosestAvailableCachegroup(). This allows operators to compartmentalize traffic within portions of their network. (For example, if all the east coast caches go down, bandwidth-limited backbone/cross-country links will not be overwhelmed by all the requests going from east clients to west-coast caches). (This was actually the driving reason for this feature and why the existing getClosest... isn't sufficient in some cases).

@limited
Copy link
Contributor

limited commented Feb 22, 2018

Code looks good.

Please also add a section to CHANGELOG.md at the top-level with a few line description of the feature and a link to the new docs.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Feb 23, 2018

I have addressed the review comments. Please review and merge if every thing looks good.

.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).
.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).

The ``"backupList"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the configured backup Cache Group(s)). This can be used as an alternative to ``"coordinates"`` therby enabling operators to have a control over choosing back up edge servers rather than depending on Geo-coordinates based routing.
Copy link
Contributor

@limited limited Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"where zone name does not map to a cg name in Traffic Ops"

This might be more clearly written as:

The ``"backupList"`` section is optional and is used by Traffic Router for localization in the case of a CZF
 "hit" when there are no caches available for that DS in the matched cache group. 

.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).
.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).

The ``"backupList"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the configured backup Cache Group(s)). This can be used as an alternative to ``"coordinates"`` therby enabling operators to have a control over choosing back up edge servers rather than depending on Geo-coordinates based routing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also mention that the list is ordered.
Also, include that if there are no caches available in any of the listed groups, the request will be bypassed (if configured) or rejected

@limited
Copy link
Contributor

limited commented Feb 23, 2018

Hopefully last request.

Can you please fix your commit author and email (currently they are root).

Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

Then redo the push (you might need a git push -f)

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Feb 23, 2018

Thank you for guiding me through out this.

Copy link
Contributor

@rawlinp rawlinp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple minor spacing/indentation issues, change section of release notes

CHANGELOG.md Outdated
@@ -16,3 +16,11 @@ Pre-2.2 the HTTP Routing Name is configurable via the `http.routing.name` option
2. Add this parameter to a **single** profile in the affected CDN

With those profile parameters in place Traffic Ops can be safely upgraded to 2.2. Before taking a post-upgrade snapshot, make sure to check your Delivery Service example URLs for unexpected Routing Name changes. Once Traffic Ops has been upgraded to 2.2 and a post-upgrade snapshot has been taken, your Traffic Routers can be upgraded to 2.2 (Traffic Routers must be upgraded *after* Traffic Ops so that they can work with custom per-DeliveryService Routing Names).


#### Backup Edge Cache group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we currently have an RC for 2.2, this is most likely landing in 2.3. So I would add a new v2.3.0 [unreleased] section at the top of this file and add this subsection #### Backup Edge Cache group under a ### Release Notes header, ie:

v2.3.0 [unreleased]
-------------------

### Release Notes

#### my new feature
...

v2.2.0 [unreleased]
...


if (coordinates != null && coordinates.has("latitude") && coordinates.has("longitude")) {
final double latitude = coordinates.get("latitude").asDouble();
final double longitude = coordinates.get("longitude").asDouble();
geolocation = new Geolocation(latitude, longitude);
}

if (!addNetworkNodesToRoot(root, loc, locData, geolocation, useDeep)) {
if (backupList != null) {
bkCacheLoc = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much indentation here

}

@SuppressWarnings({"PMD.CyclomaticComplexity", "PMD.NPathComplexity"})
public CacheLocation getCoverageZoneCacheLocation(final String ip, final String deliveryServiceId, final boolean useDeep) {
public CacheLocation getCoverageZoneCacheLocation(final String ip, final String deliveryServiceId,final Track track, final boolean useDeep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space between parameters

.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).
.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).

The ``"backupList"`` section is optional and is used by Traffic Router for localization in the case of a CZF "hit" when there are no caches available for that DS in the matched cache group. This backup list is ordered and if there are no caches available in any of the configured backup cache groups, the request will be bypassed (if configured) or rejected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: did you build the docs with this change to make sure the backupList note is contained within the same box as the coordinates note? The syntax highlighting here makes me think that the 2nd note might not be in the same box

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built the docs and loaded the file in a browser. Didn't find any issue.

@limited limited self-assigned this Feb 26, 2018
@asfgit
Copy link
Contributor

asfgit commented Feb 26, 2018

Can one of the admins verify this patch?

@limited
Copy link
Contributor

limited commented Feb 27, 2018

ok to test

@asfgit
Copy link
Contributor

asfgit commented Feb 27, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1124/
Test FAILed.

@rawlinp
Copy link
Contributor

rawlinp commented Feb 27, 2018

I think we want to allow the backupList to be used for compartmentalizing traffic but also provide a configurable option to fallback to the default behavior of finding the next closest cachegroup once all backupList cachegroups have been exhausted. Perhaps by default we will fall back to finding the next closest cachegroup if all backup cachegroups have been exhausted, but if the coverageZone has a field "fallbackToNextClosest": false the request will be dropped if the backupList has been exhausted?

Maybe the format changes to:

"GROUP2": {
    "backupZones": {
        "list": ["GROUP1"],
        "fallbackToNextClosest": false   <-----default: true or false?
    },
    "network6": [
        "1234:567a::/64",
        "1234:567b::/64"
    ],
    "network": [
        "10.197.69.0/24"
    ]
}

I'm not sure it matters whether the default is true or false as long as it's documented. What do you guys think?

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Feb 28, 2018 via email

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Feb 28, 2018

@asfgit
Looks like the failure is because of the build environment, i dont see anything related to the code under test.

Creating jenkinsincubatortrafficcontrolpr1124_traffic_portal_build_1 ... error
ERROR: for jenkinsincubatortrafficcontrolpr1124_traffic_portal_build_1 Cannot start service traffic_portal_build: network jenkinsincubatortrafficcontrolpr1124_default not found

Request your inputs on fixing these .

@dneuman64
Copy link
Contributor

I think I agree with Rawlin. I understand the reason we have the backup list, and why using getClosestAvailableCachegroup() isn't used when the backup cache groups are exhausted, but I think we want the ability to be able to fallback to getClosestAvailableCachegroup() if need be. This would give us the ability to prefer the backup cache groups defined in the CZF but if all of those are unavailable we would still be able to serve content from the next closest cache group even though it might be a burden on the network. If you completely disagree then we can keep this PR as is and put in an issue to add the default functionality.

@limited
Copy link
Contributor

limited commented Feb 28, 2018

@Vijay-1 I think we could add this functionality for a pretty minor change to the logic in getCoverageZoneCacheLocation(). If we just replace the else block on line 692 with an if for the new true/false parameter I think that will do it.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Feb 28, 2018

@dneuman64 . If every one agrees that this fallback configuration is required,will get it done in this PR.
@limited . Yep that should do it. Will make the code changes , unit test it and update the PR.
I am not sure about the "All checks have failed". I dont see any problem with the code under review. Can you guys give me any inputs on that as well, so that i can cover both these changes, in case if this failure requires anything from my side.

@dneuman64
Copy link
Contributor

I think the All Checks have failed just means it couldn't find the unit tests to run. I think it's just an issue with the build envrionment. @dangogh can you confirm?

@asfgit
Copy link
Contributor

asfgit commented Mar 1, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1142/
Test PASSed.

}
}

if ((networkNode.getBackupLoc() != null && networkNode.isUseClosest()) || (networkNode.getBackupLoc() == null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the backupZones list is empty and the operator sets useClosest=true, I think we should enter this loop because its equivalent to trying to route to the backupZones when they are all down/unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think , this condition in NetworkNode.java will take care of this, where we parse and store the useClosest flag.
if (backupConfigNode != null && backupConfigNode.has("list")) {
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this the following config will do the closest group if backup list is empty
"backupZones":{
"list": [],
"fallbackToClosestGroup": true
},

and wont do closest group with the following:
"backupZones":{
"list": [],
"fallbackToClosestGroup": false
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'm trying to simplify this conditional, because its awkward to have opposite checks on both sides of an OR. Do we really need to check for networkNode.getBackupLoc() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. Let me fix this.

@asfgit
Copy link
Contributor

asfgit commented Mar 2, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1155/
Test PASSed.

@asfgit
Copy link
Contributor

asfgit commented Mar 2, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1156/
Test PASSed.

@limited
Copy link
Contributor

limited commented Mar 2, 2018

I'm good with this.

@rawlinp Any more comments?

Copy link
Contributor

@rawlinp rawlinp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies in advance, the code looks good, but I have a bunch of nitpicks mostly around variable naming and some minor cleanup

CHANGELOG.md Outdated

### Release Notes

#### Backup Edge Cache group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a heads up (you shouldn't have to take any action), I might be reformatting the changelog later per the mailing list discussion about using the keepachangelog.com format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog format has now changed to the keepachangelog format (there's now a link to that in the changelog), so you will need to rebase this branch onto the latest version of master and update these notes to use the new format.

@@ -881,7 +899,7 @@ The CZF is an input to the Traffic Control CDN, and as such does not get generat

The script that generates the CZF file is not part of Traffic Control, since it is different for each situation.

.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically).
.. note:: The ``"coordinates"`` section is optional and may be used by Traffic Router for localization in the case of a CZF "hit" where the zone name does not map to a Cache Group name in Traffic Ops (i.e. Traffic Router will route to the closest Cache Group(s) geographically). The ``"backupZones"`` section is optional and is used by Traffic Router for localization in the case of a CZF "hit" when there are no caches available for that DS in the matched cache group. This backup list is ordered and if there are no caches available in any of the configured backup cache groups, the request will be bypassed (if configured) or rejected or gets to the closest cache groups if configured (``"fallbackToClosestGroup"`` flag) for this backup configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this last sentence up to make it clearer (starting from "This backup list...":
This backup "list" contains an ordered list of backup groups to choose from if the matched cache group has no caches available for a requested DS. If an available cache cannot be found in any of the backup groups either, the "fallbackToClosestGroup" flag determines the Traffic Router's following behavior. If true, Traffic Router will find the next closest cache group with available caches. If false (the default), Traffic Router will bypass the request (if configured) or reject it.

@@ -106,15 +108,33 @@ public static NetworkNode generateTree(final JsonNode json, final boolean verify
final String loc = czIter.next();
final JsonNode locData = JsonUtils.getJsonNode(coverageZones, loc);
final JsonNode coordinates = locData.get("coordinates");
final JsonNode backupConfigNode = locData.get("backupZones");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of a nitpick but can we call this variable backupConfigJson? All the Node stuff is confusing with JsonNode vs NetworkNode.

@@ -106,15 +108,33 @@ public static NetworkNode generateTree(final JsonNode json, final boolean verify
final String loc = czIter.next();
final JsonNode locData = JsonUtils.getJsonNode(coverageZones, loc);
final JsonNode coordinates = locData.get("coordinates");
final JsonNode backupConfigNode = locData.get("backupZones");
JsonNode fallbackNode = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this variable is unneeded; see comment on L131

if (backupConfigNode != null) {
if (backupConfigNode.has("list")) {
bkCacheLoc = new ArrayList<>();
for (final JsonNode network : JsonUtils.getJsonNode(backupConfigNode, "list")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cachegroup rather than network?

@@ -281,6 +304,14 @@ public Geolocation getGeolocation() {
return geolocation;
}

public List<String> getBackupLoc() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBackupCacheGroups

// Check whether the CZF entry has a geolocation and use it if so.
return getClosestCacheLocation(cacheRegister.filterAvailableLocations(deliveryServiceId), networkNode.getGeolocation(), cacheRegister.getDeliveryService(deliveryServiceId));
if (networkNode.getBackupLoc() != null) {
for (final String element : networkNode.getBackupLoc()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cachegroup rather than element

if (closestCacheLocation != null) {
LOGGER.debug("Got closest CZ cache group " + closestCacheLocation.getId() + " for " + ip + ", ds " + deliveryServiceId);
if (track != null) {
track.setFromBackupCzGroup(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, it wasn't from backupCzGroup but the default behavior of nextClosest. Should we just omit this if-block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, any thing other than the exact match on the subnet, is backup. (i.e) both co-ordinates and Backup Cache Groups are fallback mechanism and hence i feel this if-block should be fine

fallbackNode = backupConfigNode.get("fallbackToClosestGroup");
useClosestOnBackupFailure = (fallbackNode != null) ? fallbackNode.asBoolean(): false;
} else {
useClosestOnBackupFailure = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just initialize useClosestOnBackupFailure to true on L113 and omit this if-else block. It doesn't change behavior; it would just be cleaner because it will be set to false in the previous block if configured anyway.

bkCacheLoc.add(network.asText());
}
}
fallbackNode = backupConfigNode.get("fallbackToClosestGroup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two lines can be replaced with:

useClosestOnBackupFailure = JsonUtils.optBoolean(backupConfigJson, "fallbackToClosestGroup", false);

Then you also don't need the fallbackNode intermediate variable

@asfgit
Copy link
Contributor

asfgit commented Mar 3, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1174/
Test PASSed.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 3, 2018

@rawlinp , I have address all your comments except the removal of if-block which tags cache group selection based on coordinates as backup.

I believe, any thing other than the exact match on the subnet, is backup / fallback. (i.e) both co-ordinates and Backup Cache Groups are fallback mechanism and hence i feel this if-block should be fine.

@asfgit
Copy link
Contributor

asfgit commented Mar 5, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1175/
Test PASSed.

@rawlinp
Copy link
Contributor

rawlinp commented Mar 5, 2018

@Vijay-1 thanks for addressing my comments!

I pulled this PR down to run some test scenarios against it, and this case was unexpected:

  1. client IP is found in the CZF
  2. matched zone has no caches available
  3. matched zone's backupZone list is empty
  4. routing logic falls back to a Geo-lookup
  5. client gets 302'd to closest available cache group

I would've expected item 3 to cause the client to be bypassed (if configured) or rejected. With your current code this only happens if the Delivery Service is set to CZ-only.

}
return closestCacheLocation;
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather just returning null here, I think we have to return more information to the caller so that the caller knows the difference between "null because there wasn't a hit in the CZF" and "null because there was a hit in the CZF but we're not returning a CacheLocation due to the "backupZones" config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rawlinp This test case and results looks fine to me. With the Geo Limit any value other than CZF only , say for example None, TR will go all out to find a route and when doing so it will find a route via 'GEO' since this is not limited to CZF only.

For Null returning, may be we can add one more rdtl field to denote backup Zone failure. But then , i think it is supernumerary since we have enough tagging in the access logs to differentiate between selecting caches from CZ, CZ's backup configuration for successful cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, one of the reasons for specifying backupZones is to confine traffic to certain cachegroups when the matched cachegroups are unavailable. In this case we're getting a hit in the CZ (so we know where the client is in the network), that cachegroup is unavailable, but the client will still fall back to a Geo-lookup (even though from the CZ-hit we already know where the client is), best-case scenario get matched to the same cachegroup, find the closest available cachegroup, and altogether disregard the "confine traffic to certain cachegroups" rule in the CZF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"confine traffic to certain cache group" is defined by CZF. So, if we remove that limitation (Geo Limit's value other than CZF only), TR can fetch a matching cache using Geo lookup. I am still not clear on why should TR reject this when the configuration says None as Geo Limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geo is generally the backup case when a client's IP cannot be found in the CZF. If the client is found in the CZF, we shouldn't be falling back to a Geo-lookup which subverts the CZF's backupZones config. We know the client is in our network, so we want it to follow the rules specified in the CZF.

If the client was not found in the CZF, then it should most likely not be in our network. In that case, the congestion might not be a problem because the client should be entering our network from somewhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client was found in the CZF, but the caches in that zone are down (and theres no backupList AND fallthrough == true), then we should try to use thecoordinates of the matching zone to find the next closest cache group. If there are no coordinates, I think the current behavior is to try geolocation, even on a client that was in the CZF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, but the scenario I'm concerned about is when fallbackToNextClosest == false. In that case I don't think the request should fallback to Geo when the client was found in the CZF but the matched cachegroup has no caches available. In the best-case geo-lookup that client would still be traversing the same network path to the next available cachegroup (which is what the backup list is supposed to prevent).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i agree that when fallbackToNextClosest ==false that we should NOT geolocate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, fallbackToNextClosest flag of backupZones gets to decide whether TR has to fallback to Geo lookup or not. Thank you @rawlinp and @limited for clarifying. Let me make the changes.

@asfgit
Copy link
Contributor

asfgit commented Mar 7, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1196/
Test PASSed.

CHANGELOG.md Outdated

### Release Notes

#### Backup Edge Cache group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog format has now changed to the keepachangelog format (there's now a link to that in the changelog), so you will need to rebase this branch onto the latest version of master and update these notes to use the new format.

@@ -262,7 +264,8 @@ public Geolocation getLocation(final String clientIP, final DeliveryService deli
}
} else {
// Deep caching not enabled for this Delivery Service; use the regular CZ
cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds);
cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds, track, useDeep);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this line should just pass track into this method and let the method default useDeep to false.

@@ -345,8 +352,9 @@ public DNSRouteResult route(final DNSRequest request, final Track track) throws
return result;
}

final CacheLocation cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds);
final CacheLocation cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds, track, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to change the method signature to have this be getCoverageZoneCacheLocation(request.getClientIP(), ds, track); so that we're not passing useDeep explicitly as false? useDeep should default to false.

@@ -279,7 +282,11 @@ public Geolocation getLocation(final String clientIP, final DeliveryService deli
track.setResult(ResultType.MISS);
track.setResultDetails(ResultDetails.DS_CZ_ONLY);
}
} else {
} else if ((useDeep) || (networkNode != null && networkNode.isUseClosest()) || (networkNode == null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can remain a simple else if the logic in my comment above is implemented.

@@ -116,7 +116,7 @@ public void setRegionalAlternateCount(final int regionalAlternateCount) {

public enum ResultDetails {
NO_DETAILS, DS_NOT_FOUND, DS_TLS_MISMATCH, DS_NO_BYPASS, DS_BYPASS, DS_CZ_ONLY, DS_CLIENT_GEO_UNSUPPORTED, GEO_NO_CACHE_FOUND,
REGIONAL_GEO_NO_RULE, REGIONAL_GEO_ALTERNATE_WITHOUT_CACHE, REGIONAL_GEO_ALTERNATE_WITH_CACHE
REGIONAL_GEO_NO_RULE, REGIONAL_GEO_ALTERNATE_WITHOUT_CACHE, REGIONAL_GEO_ALTERNATE_WITH_CACHE, DS_CZ_BACKUP_CG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add DS_CZ_NO_BKP to signify that the matched cachegroup and its configured backup cachegroups were unavailable.

}
}
return closestCacheLocation;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from my suggestions above, this line would probably be:

} else {
    track.SetNoBackupLocationsAvailable(true);
    return null;
}

Then I'm not sure if we can also remove the return null statement below this.

@@ -262,7 +264,8 @@ public Geolocation getLocation(final String clientIP, final DeliveryService deli
}
} else {
// Deep caching not enabled for this Delivery Service; use the regular CZ
cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds);
cacheLocation = getCoverageZoneCacheLocation(request.getClientIP(), ds, track, useDeep);
networkNode = getNetworkNode(request.getClientIP());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm not a big fan of polluting this method with extra NetworkNode stuff, especially since getNetworkNode is already called in getCoverageZoneCacheLocation. I think ideally getCoverageZoneCacheLocation should return us the information we need.

Since we're passing in the Track object, we could probably add that information there, ie something like track.noBackupLocationsAvailable() which would get set within getCoverageZoneCacheLocation. Then we can update the logic starting on L277 to something like this:

if (track.noBackupLocationsAvailable() or ds.isCoverageZoneOnly()) { 
    if (ds.geoRedirect) {...}
    else {
        resultDetails = DS_CZ_ONLY if ds.isCoverageZoneOnly else DS_CZ_NO_BKP
        ...
    }
}

@asfgit
Copy link
Contributor

asfgit commented Mar 8, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1202/
Test PASSed.

Addressed the following review comments:
1. Added documentation
2. Updated CHANGELOG.md
3. Fixed indentation

Addressed review comments.
Commit using proper Author fields.

Addressed Rawlin's comments

Fixing Note's section as per Rawlin's comment.

Addressing Rawlin's comments on fallback configuration on backupZones.

Rewrote the ambiguous if condition check based on Eric's comments

Fixing unit test case failure.

Addressing Rawlin's comments on:
1. Naming convention
2. Avoiding use of variable 'fallbackNode '
3. Documentation

Documentation update for Traffic Router's access logs rdtl field

Addressing Rawlin's comment on expanding the scope of backupZones's fallbackToClosestGroup configuration for all Geo Limit values.

Use Track to record network node's configuration accross.
@asfgit
Copy link
Contributor

asfgit commented Mar 8, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1203/
Test PASSed.

@asfgit
Copy link
Contributor

asfgit commented Mar 8, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1204/
Test PASSed.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 8, 2018

@rawlinp
I agree that we should be using track across. I decided against it because i thought it would add more changes. I have now changed the code using track. Given below is some of the deviation i took from your comments.

  1. I didnt use DS_CZ_NO_BKP, since all the failures will be validated against bypass setting and that is what will get tagged in the end. Even if we mark it with DS_CZ_NO_BKP, it will get replaced by NO_BYPASS i believe. I think , we can leave it this way since a successful fetch gets tagged properly.

  2. I used a boolean variable to keep track of fallbacktoNextGroup flag and used it to decide on geo lookups.

  3. Looks like changing the signature of getCoverageZoneLocation, leaving out a boolean, keeping false as default is not possible because getDeepCoverageZoneCacheLocation uses true.

  4. I believe , i have resolved conflicts in CHANGELOG. Please let me know if i missed anything.

  5. I have not tested these changes since my test bed is down temporarily. Hosted for review without testing to give you a heads up on how i am planning to address your comments.

Copy link
Contributor

@rawlinp rawlinp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the changelog first, will continue reviewing the rest of the PR shortly

CHANGELOG.md Outdated
@@ -1,13 +1,18 @@
# Changelog
All notable changes to this project will be documented in this file.
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is unnecessary, it turns the line above into a large header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

CHANGELOG.md Outdated

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

## [Unreleased]
### Added
- Per-DeliveryService Routing Names: you can now choose a Delivery Service's Routing Name (rather than a hardcoded "tr" or "edge" name). This might require a few pre-upgrade steps detailed [here](http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/migration_from_20_to_22.html#per-deliveryservice-routing-names)

- Backup Edge Cache group: Backup Edge group for a particular cache group can be configured using coverage zone files. With this, users will be able control traffic within portions of their network, there by avoiding choosing fall back cache groups using geo co-ordinates(getClosestAvailableCachegroup) This can be controlled using "backupZones" which contains configuration for backup cache groups parsed from Coverage zone file as explained [here] (http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/using.html#the-coverage-zone-file-and-asn-table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things here:

  1. Based on the keepachangelog format, there shouldn't be blank lines between bullet points, so this line should be moved up.
  2. "thereby" not "there by", "fallback" not "fall back", "coordinates" not "co-ordinates"
  3. A period is needed after "(getClosestAvailableCachegroup)"
  4. "the Coverage Zone File" not "Coverage zone file"
  5. Also, the [here] (http://...) link doesn't work if there's a space between [here] and the link (http://..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

CHANGELOG.md Outdated
### Changed
- Reformatted this CHANGELOG file to the keep-a-changelog format

[Unreleased]: https://github.com/apache/incubator-trafficcontrol/compare/RELEASE-2.1.0...HEAD
-[Unreleased]: https://github.com/apache/incubator-trafficcontrol/compare/RELEASE-2.1.0...HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change here breaks the [Unreleased] link above, this shouldn't be changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

CHANGELOG.md Outdated

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

## [Unreleased]
### Added
- Per-DeliveryService Routing Names: you can now choose a Delivery Service's Routing Name (rather than a hardcoded "tr" or "edge" name). This might require a few pre-upgrade steps detailed [here](http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/migration_from_20_to_22.html#per-deliveryservice-routing-names)

- Backup Edge Cache group: Backup Edge group for a particular cache group can be configured using coverage zone files. With this, users will be able control traffic within portions of their network, there by avoiding choosing fall back cache groups using geo co-ordinates(getClosestAvailableCachegroup) This can be controlled using "backupZones" which contains configuration for backup cache groups parsed from Coverage zone file as explained [here] (http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/using.html#the-coverage-zone-file-and-asn-table)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the keepachangelog format there should only be one blank line between sections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -295,15 +296,15 @@ public Geolocation getLocation(final String clientIP, final DeliveryService deli
track.setResult(ResultType.MISS);
track.setResultDetails(ResultDetails.DS_CZ_ONLY);
}
} else {
} else if (track.isUseNextClosest()) {
//Even with Geo limit none, TR wont do geo look-up, if fallback is diabled via backupZones configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diabled -> disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 8, 2018

@rawlinp
Fixed the Formatting comments and typo. I will run few test cases to on these changes during my daytime and update the results.

@asfgit
Copy link
Contributor

asfgit commented Mar 8, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1210/
Test PASSed.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 9, 2018

I have successfully executed the following test cases:
With Geo Limit CZF only:

  1. Fallback to configured backup CZ Cache groups
  2. Fallback to coordinates if backupZones are not available based on fallbackToClosestGroup flag

With Geo Limit 'None':

  1. Fallback to Geo lookup based on backupZones config's fallbackToClosestGroup flag setting
  2. With no backupZones config, fallback to Geo lookup

@asfgit
Copy link
Contributor

asfgit commented Mar 12, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/incubator-trafficcontrol-PR/1223/
Test PASSed.

@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 21, 2018

This PR will be closed and a new one based on the review comments will be created.

@Vijay-1 Vijay-1 closed this Mar 21, 2018
@Vijay-1 Vijay-1 deleted the EdgeCache-BackupGroup branch March 21, 2018 12:06
@Vijay-1
Copy link
Contributor Author

Vijay-1 commented Mar 21, 2018

New PR: #2029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned new feature A new feature, capability or behavior Traffic Router related to Traffic Router
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants