Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testnet Deployment #326

Closed
wants to merge 33 commits into from
Closed

Testnet Deployment #326

wants to merge 33 commits into from

Conversation

joshuakarp
Copy link
Contributor

@joshuakarp joshuakarp commented Feb 8, 2022

Description

Once #310 has been merged, we'll finally be able to move onto the deployment of our testnet into AWS.

I foresee this to be achievable in a few stages:

  1. Deploy Polykey onto AWS and ensure we have a working remote keynode (Testnet Node Deployment (testnet.polykey.io) #194)
  2. Create the testnet seed nodes and store these securely somewhere (Testnet securely maintain a pool of recovery codes #285)
  3. Use these seed nodes for our domain name (Update testnet.polykey.io to point to the list of IPs running seed keynodes #177)

Issues Fixed

Testnet deployment:

NodeGraph Structure:

Node Adding Policies:

Node Removal Policies:

Tasks

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@joshuakarp
Copy link
Contributor Author

I've added a test file for the validation utils (strictly unit tests). Obviously for the more simple validators, it's probably not necessary, but there were a few cases with parseSeedNodes I wanted to verify.

@joshuakarp
Copy link
Contributor Author

I've stupidly rebased, undoing the changes from this PR, and force-pushed, which caused the PR to close. Gonna go through reflog to try to undo and reopen the PR.

@CMCDragonkai CMCDragonkai reopened this Feb 14, 2022
@CMCDragonkai CMCDragonkai self-assigned this Feb 14, 2022
@CMCDragonkai CMCDragonkai added development Standard development and removed development Standard development labels Feb 14, 2022
@joshuakarp
Copy link
Contributor Author

Rebased on master (all changes from #310 and #320).

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 15, 2022

While deploying the HTTP-Demo we learned that if you set an Entrypoint in the docker config, this will will be the prefix to any CMD later set in the Task Definition.

This means since our Entrypoint in HTTP-Demo is /bin/http-mode, the CMD must be 80 so the full command becomes /bin/http-demo 80.

Setting the entrypoint is preferable if our container has only 1 executable. In the case of PK, this is just /bin/polykey, which means we should swap to using Entrypoint instead of Cmd in our release.nix. Then when we do any docker run polykey:latest ... the ... would be parameters appended to /bin/polykey like docker run polykey:latest vaults list...

The only reason to use Cmd would be if there are multiple possible commands, and you want to have a "default" command when you do docker run polykey:latest with no additional parameters. Then any parameters would override the Cmd.

Note that Task Definitions also allow overriding the Entrypoint by setting it explicitly in the TD.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 15, 2022

Additionally we created a new security group just for testing, it's called the DMZ security group. It uses the default VPC, and it allows ALL ipv4 and ipv6 inbound and outbound.

The deployed container requires several resources:

  1. Container repository
  2. Container image in the container repository
  3. Task definition
  4. Cluster
  5. Cluster service
  6. Cluster service runs the task

The end result assuming everything is setup, is that the container is assigned a public IP address that we can access. This is a bit hard to find, but it is in cluster > service > task > network details.

Each public IP here provided by an ENI. It is automatically created by AWS in this case.

@CMCDragonkai
Copy link
Member

Because we have created the http-demo, I think it's a better educational role than the existing socat-echo-server. Going to delete that task definition, but keeping the JSON here for posterity:

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::015248367786:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/socat-echo-server",
          "awslogs-region": "ap-southeast-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": [
        "sh",
        "-c"
      ],
      "portMappings": [
        {
          "hostPort": 1314,
          "protocol": "tcp",
          "containerPort": 1314
        },
        {
          "hostPort": 1315,
          "protocol": "tcp",
          "containerPort": 1315
        },
        {
          "hostPort": 1316,
          "protocol": "udp",
          "containerPort": 1316
        }
      ],
      "command": [
        "socat SYSTEM:\"echo hello; cat\" TCP4-LISTEN:1314`printf \"\\x2c\"`FORK & socat PIPE TCP4-LISTEN:1315`printf \"\\x2c\"`FORK & socat PIPE UDP4-LISTEN:1316`printf \"\\x2c\"`FORK"
      ],
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "alpine/socat:latest",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "socat-echo-server"
    }
  ],
  "placementConstraints": [],
  "memory": "512",
  "taskRoleArn": "arn:aws:iam::015248367786:role/ecsTaskExecutionRole",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:ap-southeast-2:015248367786:task-definition/socat-echo-server:6",
  "family": "socat-echo-server",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "runtimePlatform": null,
  "cpu": "256",
  "revision": 6,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

@CMCDragonkai
Copy link
Member

So now we should be able to setup PK without any of the bells and whistles similar to http-demo.

  1. Build PK
  2. Check that pk agent can run in the foreground and take note of the ENV variables used to bind the ports
  3. Do the port mapping similar to http-demo
  4. Deploy on to ECS
  5. Use telnet to check if the TCP works
  6. Then we can try a GRPC call

@joshuakarp
Copy link
Contributor Author

Success! We have a running Polykey instance on AWS, with a successful unattended bootstrap to create all of its node state.

There were a few things we've done to get this working:

  • The Deployment section in the js-polykey readme is completely sufficient to publish a Polykey image: https://github.com/MatrixAI/js-polykey#deployment (I had to use sudo for the docker commands, but this is likely something that can be fixed on your own end)
  • Port mappings
    • 1314 with TCP (for connecting to client)
    • 1315 with UDP (for connections between keynodes)
  • Environment variable setup (in task definition). We've set these as follows:
    • PK_CLIENT_HOST: 0.0.0.0
    • PK_CLIENT_PORT: 1315
    • PK_INGRESS_HOST: 0.0.0.0
    • PK_INGRESS_PORT: 1314
    • PK_PASSWORD: set as some default password (for unattended bootstrapping)
    • PK_SEED_NODES: '' (empty string - explicitly set no seed nodes to connect to)
    • PK_NODE_PATH: /srv (a directory from the root directory to store Polykey state - can't use the default home directory, as there's no notion of this on the remote)
    • from the logger output, the hosts and ports have been set correctly
  • The original mounted volume on the Polykey task definition was removed (see polykey task definition revision 4 for the original)
    • in the future, we can look into mounting a volume on the /srv directory though (look into an EBS volume, as opposed to an EFS volume)
    • although, there was some discussion of whether we even required any persistence for a seed node at all. All we want to be kept persistent is the node ID, and we can simply do so by using the recovery code to generate a deterministic keypair each time.

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Feb 15, 2022

Success!

Was able to contact our spun up remote keynode on AWS, by setting the client-host (the public IP displayed on AWS), client-port (our specified client port, 1315) and node-id (the node ID of our remote keynode):

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm run polykey -- agent status --verbose --client-host='3.26.131.213' --client-port=1315 --node-id='vp6k0j73e31tm57q8pdk34j567ljr4c86nj71f8l67ukj4gul53ig'

> @matrixai/polykey@1.0.0 polykey /home/josh/Documents/MatrixAI/js-polykey
> ts-node --require tsconfig-paths/register --compiler typescript-cached-transpile --transpile-only src/bin/polykey.ts "agent" "status" "--verbose" "--client-host=3.26.131.213" "--client-port=1315" "--node-id=vp6k0j73e31tm57q8pdk34j567ljr4c86nj71f8l67ukj4gul53ig"

INFO:PolykeyClient:Creating PolykeyClient
INFO:Session:Creating Session
INFO:Session:Setting session token path to /home/josh/.local/share/polykey/token
INFO:Session:Starting Session
INFO:Session:Started Session
INFO:Session:Created Session
INFO:GRPCClientClient:Creating GRPCClientClient connecting to 3.26.131.213:1315
INFO:GRPCClientClient:Created GRPCClientClient connecting to 3.26.131.213:1315
INFO:PolykeyClient:Starting PolykeyClient
INFO:PolykeyClient:Started PolykeyClient
INFO:PolykeyClient:Created PolykeyClient
✔ Please enter the password … ******
INFO:PolykeyClient:Stopping PolykeyClient
INFO:GRPCClientClient:Destroying GRPCClientClient connected to 3.26.131.213:1315
INFO:GRPCClientClient:Destroyed GRPCClientClient connected to 3.26.131.213:1315
INFO:Session:Stopping Session
INFO:Session:Stopped Session
INFO:PolykeyClient:Stopped PolykeyClient
status	LIVE
pid	1
nodeId	vp6k0j73e31tm57q8pdk34j567ljr4c86nj71f8l67ukj4gul53ig
clientHost	0.0.0.0
clientPort	1315
ingressHost	0.0.0.0
ingressPort	1314
egressHost	0.0.0.0
egressPort	38626
agentHost	127.0.0.1
agentPort	40237
proxyHost	127.0.0.1
proxyPort	43961
rootPublicKeyPem	-----BEGIN PUBLIC KEY-----
	MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAmi8SpzqyHM0Kwf1ew7qV
	...
	-----END PUBLIC KEY-----
rootCertPem	-----BEGIN CERTIFICATE-----
	MIIIIDCCBgigAwIBAgIFFkSJczgwDQYJKoZIhvcNAQELBQAwQDE+MDwGA1UEAxM1
        ...
	-----END CERTIFICATE-----
rootCertChainPem	-----BEGIN CERTIFICATE-----
	MIIIIDCCBgigAwIBAgIFFkSJczgwDQYJKoZIhvcNAQELBQAwQDE+MDwGA1UEAxM1
        ...
	-----END CERTIFICATE-----

Initially I was having troubles with this because I had forgotten to include the node ID in the CLI arguments. Without remembering that we needed a node ID as part of the verification step, it was tricky to debug because there was no error thrown (even despite supplying the client host and client port):

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm run polykey -- agent status --verbose --client-host='3.26.131.213' --client-port=1315         

> @matrixai/polykey@1.0.0 polykey /home/josh/Documents/MatrixAI/js-polykey
> ts-node --require tsconfig-paths/register --compiler typescript-cached-transpile --transpile-only src/bin/polykey.ts "agent" "status" "--verbose" "--client-host=3.26.131.213" "--client-port=1315"

status	DEAD

@CMCDragonkai
Copy link
Member

Yep confirmed it is working as well, plus the session token gets saved in in your $HOME/.local/share/polykey/token and is re-used when we call it again.

@CMCDragonkai
Copy link
Member

So the problem is that if you specify --client-host and --client-port but without --node-id, I believe the code ends up trying to use the "default" --node-id that comes from the status.json that should be in your local PK_NODE_PATH. Now in my case all I have is:

»» ~/.local/share/polykey
 ♖ tree .                                                                                                         ⚡(master) pts/0 17:03:48
.
└── token

Which means when I run:

> npm run polykey -- agent status --verbose --client-host='3.26.131.213' --client-port=1315
ErrorCLIStatusMissing: Could not resolve nodeId, clientHost or clientPort from non-existent Status

I actually get an exception, but this is because it's complaining that it cannot "resolve" the nodeId. Cause there's no parameter and no status file.

So if there is a status file, and we try to use it from there, we should get a proper error instead of just:

status	DEAD

@joshuakarp
Copy link
Contributor Author

Confirmed the above was the case. After deleting my local keynode state, I was able to reproduce the expected error that @CMCDragonkai received when not specifying the node ID.

@CMCDragonkai
Copy link
Member

I'm going to change the agent start command so we can see the a dictionary output. That is, it can return a similar information to agent status. This is useful since we would like to know the node id immediately after starting.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 15, 2022

So there are 2 problems with agent status.

The first is due to processClientStatus if the nodeId is not supplied it proceeds to try to read the status. Since no status may exist, we get an error about no status being available.
https://github.com/MatrixAI/js-polykey/blob/master/src/bin/utils/processors.ts#L282-L287

This is a problem because we want the user to understand why we wanted to read the status if one of the parameters is not supplied. So to solve this we need an exception message saying that we read the status because of missing parameter which can be one or more of --node-id, --client-host and --client-port.

These 3 parameters should be especially noted as the parameters controlling connection to an agent.

The second problem is that when the status does exist, it is read but only missing parameters are filled with status values. Somehow this leads saying that the status is dead. Instead what should happen is that it should try contacting the agent with potentially the wrong node id.

 ​  ​if​ ​(statusInfo.status​ ​===​ ​'LIVE')​ ​{ 
 ​    ​if​ ​(nodeId​ ​==​ ​null)​ ​nodeId​ ​=​ ​statusInfo.data.nodeId; 
 ​    ​if​ ​(clientHost​ ​==​ ​null)​ ​clientHost​ ​=​ ​statusInfo.data.clientHost; 
 ​    ​if​ ​(clientPort​ ​==​ ​null)​ ​clientPort​ ​=​ ​statusInfo.data.clientPort; 
 ​    ​return​ ​{ 
 ​      statusInfo​, 
 ​      status​, 
 ​      nodeId​, 
 ​      clientHost​, 
 ​      clientPort​, 
 ​    ​}; 
 ​  ​}

It appears this is because it only overrides the parameters if it is LIVE. So if the local agent is dead, then nothing is overridden and whatever the local agent's status is, is returned.

@CMCDragonkai
Copy link
Member

First change we can do is to override the parameters even if it is not LIVE. However that's not enough to solve the problem.

After this the CommandStart will check if it is LIVE before contacting the agent. If it is not LIVE it just returns what was given back by the processClientStatus.

What we could do is that if any of the parameters are set, we can override with what's read in the status and then return with undefined to indicate a remote status. That would mean we always contact the remote node if any of the --node-id or --client-host, --client-port is set.

Here are the cases:

  1. If local agent is LIVE, it has the status. If no parameters then we use status.
  2. If local agent is LIVE, it has the status. If all parameters we contact.
  3. If local agent is LIVE, it has the status. If some parameters, we default with status and we contact but only if our parameters is different. Otherwise we use status.
  4. If local agent is DEAD. If no parameters then we use status.
  5. If local agent is DEAD. If all parameters we contact.
  6. If local agent is DEAD. If some parameters, we default we can from the status and contact but only if our parameters are different. Otherwise we use status.

If we cannot contact then we have to indicate a client connection exception or a status exception but provide reasoning as to why. Part of it is that no matter what, we couldn't fulfill all 3 required parameters. This should only occur in case 6. And in case 3 if non-LIVE status (or bug in our program).

@CMCDragonkai
Copy link
Member

I'm going to change the agent start command so we can see the a dictionary output. That is, it can return a similar information to agent status. This is useful since we would like to know the node id immediately after starting.

To solve this, we can make use of the StatusLive['data'] type.

Right now this is:

type StatusLive = {
  status: 'LIVE';
  data: {
    pid: number;
    nodeId: NodeId;
    clientHost: Host;
    clientPort: Port;
    ingressHost: Host;
    ingressPort: Port;
    [key: string]: any;
  };
};

Remember the output of the agent status is not StatusLive but instead StatusLive['data'] with status key.

            data: {
              status: 'LIVE',
              pid,
              nodeId,
              clientHost,
              clientPort,
              ingressHost,
              ingressPort,
              egressHost,
              egressPort,
              agentHost,
              agentPort,
              proxyHost,
              proxyPort,
              rootPublicKeyPem,
              rootCertPem,
              rootCertChainPem,
            },

We can say both should output this.

This means we can add the additional properties here to tour StatusLive['data'].

Note that the hosts and ports do not change during the lifetime of the program, but when the key pair changes, that affects the rootPublicKeyPem, rootCertPem, and rootCertChainPem. This is facilitated by the event bus.

@CMCDragonkai
Copy link
Member

The RootKeyPairChangeData so far only has the new TLS config and the nodeId, it will need to be expanded with additional data:

    const data: RootKeyPairChangeData = {
      nodeId: this.getNodeId(),
      tlsConfig: {
        keyPrivatePem: this.getRootKeyPairPem().privateKey,
        certChainPem: await this.getRootCertChainPem(),
      },
    };
    await this.rootKeyPairChange(data);

However this is only used in:

  • renewRootKeyPair
  • resetRootKeyPair

However the certificate itself may be changed in resetRootCert. This means we need to expand these event data. The key is the same, but the certificate may be different (and the cert chain is also broken when this occurs).

Right now the tlsConfig property is based on the root key pair and root cert chain. KeyManager shouldn't need to know how this is structured, so instead we can hand over some generic data like the nodeId, rootKeypairPem, and rootCertChainPem.

So the RootKeyPairChangeData can instead be:

type RootKeyPairChangeData = {
  nodeId: NodeId;
  rootKeyPair: KeyPair;
  rootCert: Certificate;
  rootCertChain: Array<CertificatePem>;
  recoveryCode: RecoveryCode;
}

Note that if the keypair has changed, then the recovery code changes right? So we need to add this in as well!! @emmacasolin

I would also change the name of this to be a callback that is called whenever there's a change to the KeyManager. Right now with:

const eventRootKeyPairChange = Symbol('rootKeyPairChange');

We can use const eventKeyManager = Symbol('KeyManager'), meaning each domain class/manager may have a event symbol. This would mean each domain can be it's own topic in our eventbus. But the KeyManager never actually uses this, it just uses a callback. So I think the PolykeyAgent can instead orchestrate this together.

@CMCDragonkai
Copy link
Member

@joshuakarp when the await this.nodeManager.refreshBuckets(); is executed, it says it must be done upon key renewal.

However how does it get access to the new key? Is it calling the KeyManager internally? For something like this, it'd make sense to pass the the new key if it has to use it. Also is there any downsides to this? Seems like something to incorporate into #317.

@CMCDragonkai
Copy link
Member

It may be better to just pay the cost to start the generalisation process now then. At least for the setNode and syncNodeGraph.

@emmacasolin
Copy link
Contributor

emmacasolin commented Apr 20, 2022

It may be better to just pay the cost to start the generalisation process now then. At least for the setNode and syncNodeGraph.

A problem I can see with doing this is that NodeManager/NodeConnectionManager will need to be able to call methods from the queue class (to add operations to the queue), but the queue will also need to call methods from NodeManager/NodeConnectionManager (e.g. setNode, pingNode etc.). So we have a problem where the queue is dependent on NodeManager/NodeConnectionManager and these are dependent on the queue, which is a circular dependency.

I remember running into a similar problem when I was working on the Discovery Queue, which is one of the reasons why that queue ended up being entirely contained within the Discovery domain rather than being a separate class.


Actually, if the queue class gets passed in during start (like NodeManager for NodeConnectionManager) it should be fine.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Apr 20, 2022

This should easily be resolved by always passing in arrow functions. Arrow functions will encapsulate the this context, so any asynchronous tasks should be self-contained enough, that the queue does not need to know about any other domains.

So imagine:

queue.submit(async () => {
  // I am self contained
  console.log(this);
});

eventBus.on('event', async () => {
  // I am self contained
  console.log(this);
});

`NodeManager.setNode` and `NodeConnectionManager.syncNodeGraph` now utilise a single, shared queue to asynchronously add nodes to the node graph without blocking the main loop. These methods are both blocking by default but can be made non-blocking by setting the `block` parameter to false.
#322
@emmacasolin
Copy link
Contributor

The generic queue class src/nodes/Queue.ts has been implemented and is currently used by NodeManager.setNode and NodeConnectionManager.syncNodeGraph. Arrow functions can be pushed onto the queue and they will be executed asynchronously in the background. The queue at the moment is not persistent, however, this can be changed along with the addition of other features needed by queues in other domains (e.g. priority tags for the Discovery queue #329).

@tegefaulkes
Copy link
Contributor

What is happening to these?

  • tests/utils.ts:setupGlobalAgent
  • NodeManager.refreshBuckets

I currently have them stubbed out. They were commented out but I don't have the story on why. Currently they make up about 2/3rds of the test errors.

@tegefaulkes
Copy link
Contributor

Tests are passing in CICD now.

Just a small refactor. I've renamed some methods since queueStart and queuePush is unnecessarily verbose when Queue is its own class now. Also simplified some logic using the `promise` utility.
This contains fixes for failing tests as well as fixes for tests failing to exit when finished.
This checks if we await things that are not promises. This is not a problem per se, but we generally don't want to await random things.
…es when entering the network

This tests for if the Seed node contains the new nodes when they are created. It also checks if the new nodes discover each other after being created.

Includes a change to `findNode`. It will no longer throw an error when failing to find the node. This will have to be thrown by the caller now. This was required by `refreshBucket` since it's very likely that we can't find the random node it is looking for.
@tegefaulkes
Copy link
Contributor

Some tests are failing randomly. Need to look out for them and take note.

@CMCDragonkai
Copy link
Member

Are they the usual suspects:

  1. Nodes
  2. Vaults

?

Or are the timing problems like key generation taking too long?

@tegefaulkes
Copy link
Contributor

I don't have much details on it right now. I've seen gestalts failing from time to time but also some other domains are failing in CICD that I have to re-trigger. I'll have to look deeper into it when I get a chance.

@tegefaulkes
Copy link
Contributor

This PR has been migrated to #378.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment