Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TTL for agents and advertised queues #294

Merged
merged 2 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions agent/testflinger_agent/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ def start_agent():
)
while agent.check_offline():
time.sleep(check_interval)
# Refresh the updated_at timestamp on advertised queues
client.post_advertised_queues()
logger.info("Checking jobs")
agent.process_jobs()
logger.info("Sleeping for {}".format(check_interval))
Expand Down
25 changes: 24 additions & 1 deletion docs/explanation/agents.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,27 @@ it knows how to run jobs. When it is not running a job, the agent:
* Asks the server for a job to run from the list of configured :doc:`queues <queues>`
* Dispatches the device connector to execute each :doc:`phase <../reference/test-phases>` of the job
* Reports the results of the job back to the server
* Uploads artifacts (if any) saved from the job to the server
* Uploads artifacts (if any) saved from the job to the server

You can see a list of agents in the Testflinger web interface by clicking on the
"Agents" link in the top navigation bar.

Communication with the Server
-----------------------------

The agent communicates with the server using a REST API. The agent polls the
server for jobs to run at a configurable interval. When a job is found, the agent
downloads the job and any associated artifacts and begins running the job. When
the job is complete, the agent uploads the results and any artifacts to the server.

The server does not push jobs to the agent, and never needs to initiate a connection
to the agent. This makes it easy to run agents behind firewalls or in other
network configurations where the agent cannot be directly reached by the server.
However, it also means that the server has no way of knowing if an agent has gone
away forever if it stops checking in. If this happens, the server will continue to
show the agent in the "Agents" list, but it's important to pay attention to the
timestamp for when the agent was last updated. This timestamp will continue to
be updated even if the agent is offline as long as the agent is still running and
able to communicate with the server. If an agent has not checked in after 7 days,
it will automatically be removed from the database and will no longer appear in
the "Agents" list.
20 changes: 19 additions & 1 deletion docs/explanation/queues.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,22 @@ Trying to force the scheduling of jobs to :doc:`agents <agents>` would require
the server to maintain state of all :doc:`agents <agents>` at all time, and be
the arbiter of the entire process. Instead, the :doc:`agents <agents>` can
operate autonomously, and maintain their own lifecycle. The
:doc:`agents <agents>` ask for a job when they are available to run one.
:doc:`agents <agents>` ask for a job when they are available to run one.

Advertised Queues
-----------------

Advertised queues can be configured for an agent to expose certain "well-known"
queues along with descriptions and images that are known to work with them. These
queues can be seen from the CLI by running the `list-queues` command.
It's important to know that this is not an exhaustive list of all queues that can
be used, just the ones that have been intentionally advertised in order to add
a description. Clicking on the "Queues" link at the top of the web UI will show
both the advertised queues as well as the normal ones, and only the advertised ones
will have descriptions.

Because the advertised queues are declared in the agent configuration, there is no
way for the server to know if they are gone forever if an agent goes away. If an
advertised queue is not updated by the agents for more than 7 days, then it will
disappear from the list of queues to make it easier to find the ones that are
still actively being used by agents that are online.
5 changes: 3 additions & 2 deletions server/src/api/v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"""

import uuid
from datetime import datetime
from datetime import datetime, timezone

import pkg_resources
from apiflask import APIBlueprint, abort
Expand Down Expand Up @@ -405,10 +405,11 @@ def queues_post():
the user can check which queues are valid to use.
"""
queue_dict = request.get_json()
timestamp = datetime.now(timezone.utc)
for queue, description in queue_dict.items():
database.mongo.db.queues.update_one(
{"name": queue},
{"$set": {"description": description}},
{"$set": {"description": description, "updated_at": timestamp}},
upsert=True,
)
return "OK"
Expand Down
10 changes: 10 additions & 0 deletions server/src/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,16 @@ def create_indexes():
"uploadDate", expireAfterSeconds=DEFAULT_EXPIRATION
)

# Remove agents that haven't checked in for 7 days
mongo.db.agents.create_index(
"updated_at", expireAfterSeconds=DEFAULT_EXPIRATION
)

# Remove advertised queues that haven't updated in over 7 days
mongo.db.queues.create_index(
"updated_at", expireAfterSeconds=DEFAULT_EXPIRATION
)

# Faster lookups for common queries
mongo.db.jobs.create_index("job_id")
mongo.db.jobs.create_index(["result_data.job_state", "job_data.job_queue"])
Expand Down