[nexus] Silo IP pools #3985

david-crespo · 2023-08-30T03:45:06Z

A lot going on here. I will update this description as I finish things out.

Important background

[nexus] Silo IP pools schema change #3981 added silo_id and project_id columns to the IP pools table in the hope that that would be sufficient for the rest of the work schema-wise, but it turns out there was more needed.
Before this change, the concept of a default IP pool was implemented through the name on the IP pool, i.e., the IP pool named default is the one used by default for instance IP allocation if no pool name is specified as part of the instance create POST
Before this change the concept of an internal IP pool was implemented through a boolean internal column on the ip_pool table

IP pool selection logic

There are two situations where we pick an IP pool to allocate IPs from. For an instance's source NAT, we always use the default pool. Before this change, with only fleet pools, this simply meant picking the one named default (analogous now to the one with is_default == true). With the possibility of silo pools, we now pick the most specific default available. That means that if there is a silo-scoped pool marked default matching the current silo, we use that. If not, we pick the fleet-level pool marked default, which should always exist (see possible todos at the bottom — we might want to take steps to guarantee this).

For instance ephemeral IPs, the instance create POST body takes an optional pool name. If none is specified, we follow the same defaulting logic as above — the most-specific pool marked is_default. We are leaving pool names globally unique (as opposed to per-scope) which IMO makes the following lookup logic easy to understand and implement: given a pool name, look up the pool by name. (That part we were already going.) The difference now with scopes is that we need to make sure that the scope of the selected pool (assuming it exists) does not conflict with the current scope, i.e., the current silo. In this situation, we do not care about what's marked default, and we are not trying to get an exact match on scope. We just need to disallow an instance from using an IP pool reserved for a different silo. We can revisit this, but as implemented here you can, for example, specify a non-default pool scoped to fleet or silo (if one exists) even if there exists a default pool scoped to your silo.

DB migrations on `ip_pool` table

There are three migrations here based on guidance from @smklein based on CRDB docs about limitations to online schema changes and some conversations he had with them. It's possible they could be made into two. I don't think it can be done in one.

Add is_default column and a unique index ensuring there is only one default IP pool per "scope" (unique silo_id, including null as a distinct value)
Populate correct data in new columns
- Populate is_default = true for any IP pools named default (there should be exactly one, but nothing depends on that)
- silo_id = INTERNAL_SILO_ID for any IP pools marked internal (there should be exactly one, but nothing depends on that)
Drop the internal column

Code changes

Add similar-asserts so we can get a usable diff when the schema migration tests fail. Without this you could get a 20k+ line diff with 4 relevant lines.
Use silo_id == INTERNAL_SILO_ID everywhere we were previously looking at the internal flag (thanks @luqmana for the suggestion)
Add silo_id and default to IpPoolCreate (create POST params) and plumb them down the create chain
- Leave off project_id for now, we can add that later
Fix index that is failing to prevent multiple default pools for a given scope (see comment [nexus] Silo IP pools #3985 (comment))
Source NAT IP allocation uses new defaulting logic to get the most specific available default IP Pool
Instance ephemeral IP allocation uses that default logic if no pool name specified in the create POST, otherwise look up pool by specified name (but can only get pools matching its scope, i.e., its project, silo, or the whole fleet)

Limitations that we might want to turn into to-dos

You can't update default on a pool, i.e., you can't make a pool default. You have to delete it and make a new one. This one isn't that hard — I would think of it like image promotion, where it's not a regular update pool, it's a special endpoint for making a pool default that can unset the current default if it's a different pool.
Project-scoped IP pools endpoints are fake — they just return all IP pools. They were made in anticipation of being able to make them real. I'm thinking we should remove them or make them work, but I don't think we have time to make them work.
Ensure somehow that there is always a fleet-level default pool to fall back to

schema/crdb/dbinit.sql

+    silo_id, project_id
+) WHERE
+    "default" = true AND time_deleted IS NULL;
+


david-crespo · 2023-08-30T04:39:35Z

Only 3 tests are failing, which is surprisingly good.

FAIL [   6.275s] nexus-db-queries db::queries::external_ip::tests::test_ensure_pool_exhaustion_does_not_use_other_pool
FAIL [   4.879s] nexus-db-queries db::queries::external_ip::tests::test_next_external_ip_is_restricted_to_pools
FAIL [  28.225s] omicron-nexus::test_all integration_tests::instances::test_instance_create_in_silo

david-crespo · 2023-08-30T04:50:51Z

schema/crdb/dbinit.sql

+    COALESCE(silo_id, '00000000-0000-0000-0000-000000000000'::uuid), 
+    COALESCE(project_id, '00000000-0000-0000-0000-000000000000'::uuid) 
+) WHERE
+    "default" = true AND time_deleted IS NULL;


Welp, how do we feel about this coalesce @smklein

I think postgres itself has better support here:

https://www.postgresql.org/docs/current/indexes-unique.html

They have a CREATE UNIQUE INDEX ... ON TABLE ... NULLS DISTINCT option, which can specify that NULLS are treated equally.

However, for Cockroachdb:

https://www.cockroachlabs.com/docs/stable/create-index

https://www.cockroachlabs.com/docs/stable/null-handling#nulls-and-unique-constraints

I don't see similar support.

So: Translating "NULL" to "The zero UUID" seems a little hacky, but probably fine.

david-crespo · 2023-08-30T16:11:59Z

Still a bit of work to do, but all the existing tests pass. The biggest thing it needs is tests for instance IP allocation with a specific pool name exercising the scope conflict checks.

david-crespo · 2023-08-30T16:20:04Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+        //
+        // opctx
+        //     .authorize(authz::Action::ListChildren, &authz::IP_POOL_LIST)
+        //     .await?;


Interesting TODO. Not sure how to do this auth check given the way perms are set up for IpPool and IpPoolList:

omicron/nexus/db-queries/src/authz/omicron.polar

Lines 368 to 390 in eee942f

# Describes the policy for accessing "/v1/system/ip-pools" in the API

resource IpPoolList {

permissions = [

"list_children",

"modify",

"create_child",

];

# Fleet Administrators can create or modify the IP Pools list.

relations = { parent_fleet: Fleet };

"modify" if "admin" on "parent_fleet";

"create_child" if "admin" on "parent_fleet";

# Fleet Viewers can list IP Pools

"list_children" if "viewer" on "parent_fleet";

}

has_relation(fleet: Fleet, "parent_fleet", ip_pool_list: IpPoolList)

if ip_pool_list.fleet = fleet;

# Any authenticated user can create a child of a provided IP Pool.

# This is necessary to use the pools when provisioning instances.

has_permission(actor: AuthenticatedActor, "create_child", ip_pool: IpPool)

if silo in actor.silo and silo.fleet = ip_pool.fleet;

Can we punt on this for now? Alternatively: if we don't do the most restrictive auth check right now are we concerned by the data exposed? Seems like a potentially reasonable punt.

Well, right now there is no auth check, at least not on this particular call. We could do something minimal that at least makes us feel better (we are already requiring the user be authenticated with a silo). But this call is made as part of a whole pile of calls that themselves have auth checks, and we're not really leaking info here as far as I can tell.

Yea, we definitely need to update the auth policy to reflect the fact that ip pools are no longer just a fleet-wide resource. Let's make an issue to track that?

david-crespo · 2023-08-30T17:26:40Z

Worth noting that while nearly all the plumbing is there for project-scoped IP pools, my plan was to make it for now so you can't create one (there's no project_id param on IpPoolCreate) and you don't get a project_id back on the IpPool view. So it's like it doesn't exist. I would want to be more confident everything was correct.

ahl

looking good

ahl · 2023-08-30T17:40:05Z

nexus/db-queries/src/db/datastore/external_ip.rs

@@ -53,16 +50,24 @@ impl DataStore {
        opctx: &OpContext,
        ip_id: Uuid,
        instance_id: Uuid,
+        project_id: Uuid,
        pool_name: Option<Name>,


Should this be NameOrId? No need to do it now, just asking...

Yeah, I'm pretty sure it could be. I think the main reason it was name only was that the name "default" was significant, and now it isn't.

ahl · 2023-08-30T17:42:36Z

nexus/db-queries/src/db/datastore/ip_pool.rs

-    /// Looks up the default IP pool by name.
+    /// Looks up the default IP pool for a given scope, i.e., a given
+    /// combination of silo and project ID. If there is no default at a given
+    /// scope, fall back up a level. There should always be a default at fleet


Suggested change

/// scope, fall back up a level. There should always be a default at fleet

/// scope, fall back to the next level up. There should always be a default at the fleet

? maybe clearer?

ahl · 2023-08-30T17:43:00Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+    /// Looks up the default IP pool for a given scope, i.e., a given
+    /// combination of silo and project ID. If there is no default at a given
+    /// scope, fall back up a level. There should always be a default at fleet
+    /// level, though this query can theoretically fail.


in what situation?

Basically you would have to delete the fleet-wide (no silo and no project) IP pool we create by default, which you can't do if it has any IP ranges, and you can't delete those if they have any allocated IPs. So it's pretty hard to do that. Once we add the ability to make something a new default, you could conceivably create a new fleet-level pool, make it the default, and delete it while it's still empty. Then you would have no default. All very contrived situations, obviously. It might be cool to try to ensure that it's always there (maybe, e.g., you can't ever delete a fleet-wide default pool), but on the other hand I don't think it's expected that we can prevent every form of bad configuration here. In fact, we definitely can't because this all ties into customer network config that is outside our control.

ahl · 2023-08-30T17:44:40Z

nexus/db-queries/src/db/datastore/ip_pool.rs

-        .await
+        // Optional primarily because there are test contexts where we don't care
+        // about the project. If project ID is None, we will only get back pools
+        // that themselves have no project.


Suggested change

// that themselves have no project.

// that themselves have no associated project.

... does that mean we'd just get fleet-wide pools?

Or silo-scoped pools. The signature here is a little misleading because it only takes project (because we are pulling the current silo off opctx) but the logic does look at both silo and project ID when picking defaults.

ahl · 2023-08-30T17:46:09Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+        //
+        // opctx
+        //     .authorize(authz::Action::ListChildren, &authz::IP_POOL_LIST)
+        //     .await?;


Can we punt on this for now? Alternatively: if we don't do the most restrictive auth check right now are we concerned by the data exposed? Seems like a potentially reasonable punt.

luqmana · 2023-08-30T18:18:41Z

nexus/Cargo.toml

@@ -68,6 +68,7 @@ serde.workspace = true
 serde_json.workspace = true
 serde_urlencoded.workspace = true
 serde_with.workspace = true
+similar-asserts.workspace = true


Should this be under dev-dependencies?

luqmana · 2023-08-30T18:21:07Z

nexus/db-model/src/ip_pool.rs

+    /// Silo, if IP pool is associated with a particular silo. One special use
+    /// for this is  associating a pool with the internal silo oxide-internal,
+    /// which is used for internal services. If there is no silo ID, the
+    /// pool is considered a fleet-wide silo and will be used for allocating


Suggested change

/// pool is considered a fleet-wide silo and will be used for allocating

/// pool is considered a fleet-wide pool and will be used for allocating

want to say the pool is fleet-wide right?

luqmana · 2023-08-30T18:24:39Z

nexus/db-model/src/ip_pool.rs

+    /// which is used for internal services. If there is no silo ID, the
+    /// pool is considered a fleet-wide silo and will be used for allocating
+    /// instance IPs in silos that don't have their own pool. Must be non-
+    /// null if project_id is non-null (this is enforced as a DB constraint).


We definitely have this pattern a lot with Option's on the Rust side and DB constraints on the SQL side (#3152). I think with some wrangling we could have something like

enum PoolVisibility { Fleet, Silo { silo_id: Uuid }, Project { silo_id: Uuid, project_id: Uuid }, }

Anyways not a blocker for this PR.

Ah, found the issue #3152

luqmana · 2023-08-30T18:29:55Z

nexus/db-model/src/schema.rs

@@ -1131,7 +1131,7 @@ table! {
 ///
 /// This should be updated whenever the schema is changed. For more details,
 /// refer to: schema/crdb/README.adoc
-pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(3, 0, 0);
+pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(3, 0, 2);


What's the standard for major vs non-major increments anyways? Is it just a patch bump because we haven't straddled some release yet?

I wanted to imply that these are related migrations, though I admit that is not at all what semver patch versions are supposed to mean.

luqmana · 2023-08-30T18:44:21Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+        //
+        // opctx
+        //     .authorize(authz::Action::ListChildren, &authz::IP_POOL_LIST)
+        //     .await?;


Yea, we definitely need to update the auth policy to reflect the fact that ip pools are no longer just a fleet-wide resource. Let's make an issue to track that?

luqmana · 2023-08-30T18:52:40Z

nexus/src/app/ip_pool.rs

+            Some(silo) => {
+                let (.., authz_silo) = self
+                    .silo_lookup(&opctx, silo)?
+                    .lookup_for(authz::Action::Read)


Kinda ties into previous points about how authz is supposed to work now, but should there be some kinda check to ensure the current user can't create an ip pool in a given silo?

Only fleet admin has CreateChild on IpPoolList, and we are already checking for that at IP pool create time in the datastore function.

omicron/nexus/db-queries/src/authz/omicron.polar

Lines 369 to 379 in 58e8c67

resource IpPoolList {

permissions = [

"list_children",

"modify",

"create_child",

];

# Fleet Administrators can create or modify the IP Pools list.

relations = { parent_fleet: Fleet };

"modify" if "admin" on "parent_fleet";

"create_child" if "admin" on "parent_fleet";

omicron/nexus/db-queries/src/db/datastore/ip_pool.rs

Lines 147 to 156 in 58e8c67

pub async fn ip_pool_create(

&self,

opctx: &OpContext,

new_pool: &params::IpPoolCreate,

internal: bool,

) -> CreateResult<IpPool> {

use db::schema::ip_pool::dsl;

opctx

.authorize(authz::Action::CreateChild, &authz::IP_POOL_LIST)

.await?;

nexus/src/app/ip_pool.rs

nexus/types/src/external_api/params.rs

luqmana · 2023-08-30T18:58:03Z

schema/crdb/3.0.0/up.sql

@@ -20,17 +12,36 @@ SELECT CAST(
 );

 ALTER TABLE omicron.public.ip_pool
+    ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,


Should we be changing an existing schema migration?

Good catch, generally, we only want schema migrations (the files themselves) to be additive.

@david-crespo are you relying on the fact that "3.0.0" hasn't been released yet?

Yes. I think I mentioned it at control plane huddle but I noted it was a bad idea. There isn't really a reason to do it this way, I can change it to use 3 new migrations.

Changed to 3.0.1-3.0.3 instead of modifying 3.0.0 in e3c56c4

luqmana · 2023-08-30T19:02:46Z

schema/crdb/dbinit.sql

-     * internal is true, i.e., internal IP pools must be fleet-level pools.
+     * Fields representating association with a silo or project. silo_id must
+     * be non-null if project_id is non-null. silo_id is also used to mark an IP
+     * pool as "internal" by associating it with the oxide-internal silo.
     */
    silo_id UUID,
    project_id UUID,

    -- if silo_id is null, then project_id must be null


Does it ever make sense to have a non-NULL project_id and silo_id where the silo_id doesn't match the project's silo? If not, I think adjusting this so that at most one of them is non-null might be preferable. A project_id already uniquely identifies a project and therefore the silo it belongs to.

True. My initial reaction was that if silo_id wasn't there, the query for the default pool would have to be more complicated.

dsl::ip_pool .filter(dsl::silo_id.eq(authz_silo_id).or(dsl::silo_id.is_null())) .filter( dsl::project_id.eq(project_id).or(dsl::project_id.is_null()), ) .filter(dsl::is_default.eq(true)) .filter(dsl::time_deleted.is_null()) // this will sort by most specific first, i.e., // // (silo, project) // (silo, null) // (null, null) // // then by only taking the first result, we get the most specific one .order(( dsl::project_id.asc().nulls_last(), dsl::silo_id.asc().nulls_last(), )) .select(IpPool::as_select()) .first_async::<IpPool>(self.pool_authorized(opctx).await?) .await .map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server))

But if at most one of the two was non-null, we could do something like

where default = true and ( silo_id = authz_silo_id or project_id = input_project_id or (silo_id is null and project_id is null) )

and that should get us the same at-most-3 results. If anything the proposed modification is easier to understand rather than harder.

david-crespo · 2023-08-30T23:07:36Z

I have a test to add here and it's passing locally. Because it's only test code, I'm going to merge this so we can get things moving on dogfood and make that a second PR.

luqmana

a couple nits/questions but looks good overall. Thanks!

luqmana · 2023-08-30T23:04:48Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+    /// query can theoretically fail if someone is able to delete that pool or
+    /// make another one the default and delete that.


Not a blocker but maybe that's something we should guard against.

luqmana · 2023-08-30T23:11:48Z

nexus/db-queries/src/db/datastore/ip_pool.rs

-            .fetch_for(action)
+            // any authenticated user can CreateChild on an IP pool. this is
+            // meant to represent allocating an IP
+            .fetch_for(authz::Action::CreateChild)


Why get rid of the action arg and hardcode CreateChild here? Makes it a little mislead as the method is named ip_pools_fetch but as you noted, the CreateChild action is for allocating an IP from the pool. So a caller that just wants to retrieve the pool but not allocate from it has a slightly stronger than necessary check.

I think the abstraction is just cut the wrong way here. I'm not sure whether I made it worse — it was already weird before. Both this auth check and the check against the silo scope are fundamentally about the use of this function for retrieving an IP pool in order to allocate IPs from it. Despite the name, this isn't a general-purpose function at all. It only has one non-test call site:

omicron/nexus/db-queries/src/db/datastore/external_ip.rs

Lines 47 to 68 in 01e730a

/// Create an Ephemeral IP address for an instance.

pub async fn allocate_instance_ephemeral_ip(

&self,

opctx: &OpContext,

ip_id: Uuid,

instance_id: Uuid,

pool_name: Option<Name>,

) -> CreateResult<ExternalIp> {

// If we have a pool name, look up the pool by name and return it

// as long as its scopes don't conflict with the current scope.

// Otherwise, not found.

let pool = match pool_name {

Some(name) => self.ip_pools_fetch(&opctx, &name).await?,

// If no name given, use the default logic

None => self.ip_pools_fetch_default(&opctx).await?,

};

let pool_id = pool.identity.id;

let data =

IncompleteExternalIp::for_ephemeral(ip_id, instance_id, pool_id);

self.allocate_external_ip(opctx, data).await

}

I'm thinking the thing to do is inline this logic in that call site. ip_pool_fetch_default function involves a DB query so it makes sense as a datastore function, plus the default imples a "relative to what?" that I think makes the silo ID check make more sense. Plus the query needs to know about the silo ID in order to work.

luqmana · 2023-08-30T23:12:39Z

nexus/db-queries/src/db/datastore/ip_pool.rs

-            return Err(authz_pool.not_found());
+
+        // You can't look up a pool by name if it conflicts with your current
+        // scope, i.e., if it has a silo it is different from your current silo


Suggested change

// scope, i.e., if it has a silo it is different from your current silo

// scope, i.e., if it has a silo that is different from your current silo

nexus/db-queries/src/db/datastore/ip_pool.rs

luqmana · 2023-08-30T23:16:22Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+                &opctx,
+                IpPool::new(&identity, Some(silo_id), /*default= */ false),
+            )
+            .await;


check this call didn't fail?

luqmana · 2023-08-30T23:17:14Z

nexus/db-queries/src/db/datastore/ip_pool.rs

+        };
+        let _ = datastore
+            .ip_pool_create(&opctx, IpPool::new(&identity, Some(silo_id), true))
+            .await;


check for failure here too

luqmana · 2023-08-30T23:26:05Z

nexus/tests/integration_tests/ip_pools.rs

+    assert!(created_pool.silo_id.is_some());
+
+    let silo_id = created_pool.silo_id.unwrap();


nit: Can just drop that assert since the unwrap would panic. .expect if you want a nice message.

luqmana · 2023-08-30T23:32:14Z

schema/crdb/3.0.1/up.sql

+    DROP COLUMN IF EXISTS project_id,
+    ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,
+    DROP CONSTRAINT IF EXISTS project_implies_silo,
+    DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;


Is the ordering important here? i.e., is it ok for DROP COLUMN to come before the DROP CONSTRAINT's?

The test that applies the migrations and makes sure they produce the right end result didn’t complain, so I think it’s ok.

luqmana · 2023-08-30T23:35:18Z

schema/crdb/3.0.1/up.sql

+-- needs to be in its own transaction because of this thrilling bug
+-- https://github.com/cockroachdb/cockroach/issues/83593


would a nested transaction also work? that would get rid of the duplicate schema version checks

Followup to #3985. The diff is not very readable, I would just read the result: https://github.com/oxidecomputer/omicron/blob/686dfe3c4b23e71badc22b7998f10c9cc1d4ef7c/nexus/tests/integration_tests/instances.rs#L3359-L3452

@luqmana

Followup to #3985, closes #4005. - Add `is_default` to IP pool response - Inline `ip_pools_fetch` into the one callsite and delete it (per discussion #3985 (comment)) - Other small test tweaks suggested by @luqmana

@zephraph

Closes #2148 Closes #4002 Closes #4003 Closes #4006 ## Background #3985 (and followups #3998 and #4007) made it possible to associate an IP pool with a silo so that instances created in that silo would get their ephemeral IPs from said pool by default (i.e., without the user having to say anything other than "I want an ephemeral IP"). An IP pool associated with a silo was not accessible for ephemeral IP allocation from other silos — if a disallowed pool was specified by name at instance create time, the request would 404. However! That was the quick version, and the data model left much to be desired. The relation was modeled by adding a nullable `silo_id` and sort-of-not-really-nullable `is_default` column directly on the IP pool table, which has the following limitations (and there are probably more): * A given IP pool could only be associated with at most one silo, could not be shared * The concept of `default` was treated as a property of the pool itself, rather than a property of the _association_ with another resource, which is quite strange. Even if you could associate the pool with multiple silos, you could not have it be the default for one and not for the other * There is no way to create an IP pool without associating it with either the fleet or a silo * Extending this model to allow association at the project level would be inelegant — we'd have to add a `project_id` column (which I did in #3981 before removing it in #3985) More broadly (and vaguely), the idea of an IP pool "knowing" about silos or projects doesn't really make sense. Entities aren't really supposed to know about each other unless they have a parent-child relationship. ## Changes in this PR ### No such thing as fleet-scoped pool, only silo Thanks to @zephraph for encouraging me to make this change. It is dramatically easier to explain "link silo to IP pool" than it is to explain "link resource (fleet or silo) to IP pool". The way to recreate the behavior of a single default pool for the fleet is to simply associate a pool with all silos. Data migrations ensure that existing fleet-scoped pools will be associated with all silos. There can only be one default pool for a silo, so in the rare case where pool A is a fleet default and pool B is default on silo S, we associate both A and B with S, but only B is made silo default pool. ### API These endpoints are added. They're pretty self-explanatory. ``` ip_pool_silo_link POST /v1/system/ip-pools/{pool}/silos ip_pool_silo_list GET /v1/system/ip-pools/{pool}/silos ip_pool_silo_unlink DELETE /v1/system/ip-pools/{pool}/silos/{silo} ip_pool_silo_update PUT /v1/system/ip-pools/{pool}/silos/{silo} ``` The `silo_id` and `is_default` fields are removed from the `IpPool` response as they are now a property of the `IpPoolLink`, not the pool itself. I also fixed the silo-scoped IP pools list (`/v1/ip-pools`) and fetch (`/v1/ip-pools/{pool}`) endpoints, which a) did not actually filter for the current silo, allowing any user to fetch any pool, and b) took a spurious `project` query param that didn't do anything. ### DB The association between IP pools and fleet or silo (or eventually projects, but not here) is now modeled through a polymorphic join table called `ip_pool_resource`: ip_pool_id | resource_type | resource_id | is_default -- | -- | -- | -- 123 | silo | 23 | true 123 | silo | 4 | false ~~65~~ | ~~fleet~~ | ~~FLEET_ID~~ | ~~true~~ Now, instead of setting the association with a silo or fleet at IP pool create or update time, there are separate endpoints for adding and removing an association. A pool can be associated with any number of resources, but a unique index ensures that a given resource can only have one default pool. ### Default IP pool logic If an instance ephemeral IP or a floating IP is created **with a pool specified**, we simply use that pool if it exists and is linked to the user's silo. If an instance ephemeral IP or a floating IP is created **without a pool unspecified**, we look for a default pool for the current silo. If there is a pool linked with the current silo with `is_default=true`, use that. Otherwise, there is no default pool for the given scope and IP allocation will fail, which means the instance create or floating IP create request will fail. The difference introduced in this PR is that we do not fall back to fleet default if there is no silo default because we have removed the concept of a fleet-scoped pool. ### Tests and test helpers This is the source of a lot of noise in this PR. Because there can no longer be a fleet default pool, we can no longer rely on that for tests. The test setup was really confusing. We assumed a default IP pool existed, but we still had to populate it (add a range) if we had to do anything with it. Now, we don't assume it exists, we create it and add a range and associate it with a silo all in one helper. ## What do customers have to do when they upgrade? They should not _have_ to do anything at upgrade time. If they were relying on a single fleet default pool to automatically be used by new silos, when they create silos in the future they will have to manually associate each new silo with the desired pool. We are working on ways to make that easier or more automatic, but that's not in this change. It is less urgent because silo creation is an infrequent operation. If they are _not_ using the previously fleet default IP pool named `default` and do not want it to exist, they can simply delete any IP ranges it contains, unlink it from all silos and delete it. If they are not using it, there should not be any IPs allocated from it (which means they can delete it). --------- Co-authored-by: Justin Bennett <git@just-be.dev>

david-crespo added 5 commits August 28, 2023 22:45

add silo_id to views::IpPool

8c1df39

add silo param to IpPoolCreate

29cb45c

pass through silo name or ID to IP pool create, test with ID and name

bd61072

change internal column to use silo id, add and populate default column

049fc0f

new default IP pool logic with fallback to higher levels

b151e08

david-crespo commented Aug 30, 2023

View reviewed changes

schema/crdb/dbinit.sql

silo_id, project_id

) WHERE

"default" = true AND time_deleted IS NULL;

This comment was marked as resolved.

Sign in to view

This comment was marked as resolved.

Sign in to view

fix outdated comment on external IP queries, bump openapi spec

6455ec3

david-crespo commented Aug 30, 2023

View reviewed changes

david-crespo added 2 commits August 30, 2023 08:28

fix unique index preventing multiple defaults per scope

1c047dc

rename default to is_default to work around possible progenitor bug

6e7d60f

david-crespo force-pushed the silo-ip-pools branch from fb9d759 to 6e7d60f Compare August 30, 2023 14:15

david-crespo added 2 commits August 30, 2023 10:00

first pass at IP pool lookup by name logic

2ba2ffe

fix external IPs tests, simplify pool lookup by name logic

0798ab2

david-crespo marked this pull request as ready for review August 30, 2023 16:09

david-crespo requested review from ahl and luqmana August 30, 2023 16:12

david-crespo commented Aug 30, 2023

View reviewed changes

david-crespo force-pushed the silo-ip-pools branch from a879c31 to 95edfc0 Compare August 30, 2023 16:21

remove too-strict auth check on getting default pool

54060f4

david-crespo force-pushed the silo-ip-pools branch from 95edfc0 to 54060f4 Compare August 30, 2023 16:22

don't pass around the silo ID, it's always the current authz silo

b296e99

plumb through project_id, but you still can't make a project-scoped pool

e591c12

ahl reviewed Aug 30, 2023

View reviewed changes

david-crespo mentioned this pull request Aug 30, 2023

Use NameOrId in InstanceCreate.external_ips #3992

Open

luqmana reviewed Aug 30, 2023

View reviewed changes

david-crespo added 2 commits August 30, 2023 14:24

PR suggestions around comments and deps

06b1af7

don't edit an existing migration!

e3c56c4

david-crespo mentioned this pull request Aug 30, 2023

Rethink authz around retrieving IP pools #3995

Open

david-crespo added 2 commits August 30, 2023 16:33

remove half-baked project_id stuff for now

2d77525

remove superfluous _for on method names

8ab9920

david-crespo merged commit f2bc296 into main Aug 30, 2023

david-crespo deleted the silo-ip-pools branch August 30, 2023 23:38

david-crespo mentioned this pull request Aug 30, 2023

[nexus] Additional test for silo IP pools #3998

Merged

luqmana approved these changes Aug 30, 2023

View reviewed changes

This was referenced Aug 31, 2023

Updates to IP Pools #2055

Closed

IP pools: change/remove default #4002

Closed

IP pools: disallow deleting fleet-wide default pool? #4003

Closed

IP pools: update project IP pools endpoint with correct logic #4006

Closed

david-crespo added a commit that referenced this pull request Sep 1, 2023

small fixes to #3985

d982d81

david-crespo mentioned this pull request Sep 1, 2023

Small IP pools fixes #4007

Merged

david-crespo mentioned this pull request Oct 11, 2023

IP pools data model and API rework #4261

Merged

	# Describes the policy for accessing "/v1/system/ip-pools" in the API
	resource IpPoolList {
	permissions = [
	"list_children",
	"modify",
	"create_child",
	];

	# Fleet Administrators can create or modify the IP Pools list.
	relations = { parent_fleet: Fleet };
	"modify" if "admin" on "parent_fleet";
	"create_child" if "admin" on "parent_fleet";

	# Fleet Viewers can list IP Pools
	"list_children" if "viewer" on "parent_fleet";
	}
	has_relation(fleet: Fleet, "parent_fleet", ip_pool_list: IpPoolList)
	if ip_pool_list.fleet = fleet;

	# Any authenticated user can create a child of a provided IP Pool.
	# This is necessary to use the pools when provisioning instances.
	has_permission(actor: AuthenticatedActor, "create_child", ip_pool: IpPool)
	if silo in actor.silo and silo.fleet = ip_pool.fleet;

	/// scope, fall back up a level. There should always be a default at fleet
	/// scope, fall back to the next level up. There should always be a default at the fleet

	// that themselves have no project.
	// that themselves have no associated project.

	/// pool is considered a fleet-wide silo and will be used for allocating
	/// pool is considered a fleet-wide pool and will be used for allocating

	pub async fn ip_pool_create(
	&self,
	opctx: &OpContext,
	new_pool: &params::IpPoolCreate,
	internal: bool,
	) -> CreateResult<IpPool> {
	use db::schema::ip_pool::dsl;
	opctx
	.authorize(authz::Action::CreateChild, &authz::IP_POOL_LIST)
	.await?;

		/// query can theoretically fail if someone is able to delete that pool or
		/// make another one the default and delete that.

	/// Create an Ephemeral IP address for an instance.
	pub async fn allocate_instance_ephemeral_ip(
	&self,
	opctx: &OpContext,
	ip_id: Uuid,
	instance_id: Uuid,
	pool_name: Option<Name>,
	) -> CreateResult<ExternalIp> {
	// If we have a pool name, look up the pool by name and return it
	// as long as its scopes don't conflict with the current scope.
	// Otherwise, not found.
	let pool = match pool_name {
	Some(name) => self.ip_pools_fetch(&opctx, &name).await?,
	// If no name given, use the default logic
	None => self.ip_pools_fetch_default(&opctx).await?,
	};

	let pool_id = pool.identity.id;
	let data =
	IncompleteExternalIp::for_ephemeral(ip_id, instance_id, pool_id);
	self.allocate_external_ip(opctx, data).await
	}

	// scope, i.e., if it has a silo it is different from your current silo
	// scope, i.e., if it has a silo that is different from your current silo

		assert!(created_pool.silo_id.is_some());

		let silo_id = created_pool.silo_id.unwrap();

		-- needs to be in its own transaction because of this thrilling bug
		-- https://github.com/cockroachdb/cockroach/issues/83593

[nexus] Silo IP pools #3985

[nexus] Silo IP pools #3985

Conversation

david-crespo commented Aug 30, 2023 • edited Loading

Important background

IP pool selection logic

DB migrations on ip_pool table

Code changes

Limitations that we might want to turn into to-dos

This comment was marked as resolved.

This comment was marked as resolved.

david-crespo commented Aug 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo commented Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo commented Aug 30, 2023

ahl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luqmana Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo commented Aug 30, 2023

luqmana left a comment

Choose a reason for hiding this comment

luqmana Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo commented Aug 30, 2023 •

edited

Loading

DB migrations on `ip_pool` table

david-crespo commented Aug 30, 2023 •

edited

Loading

david-crespo Aug 30, 2023 •

edited

Loading

david-crespo Aug 30, 2023 •

edited

Loading

david-crespo Aug 30, 2023 •

edited

Loading

luqmana Aug 30, 2023 •

edited

Loading

smklein Aug 30, 2023 •

edited

Loading

luqmana Aug 30, 2023 •

edited

Loading