-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New borges design #389
Comments
PrototypeThe prototype is similar to
It uses go-borges in transactional mode. Crashing the downloading process was tested and the files recovered correctly. Small repositories or repos without forks and only one rooted repo do not have great improvements as there's already a fast path for them in the latest version of borges. When the repos have forks or several rooted repos the advantage is bigger. Problems
BenchmarksTensorflowOnly the main repo. url: https://github.com/tensorflow/tensorflow
SinatraSmall repository url: https://github.com/sinatra/sinatra
Size: 9298261
Size: 9951084
Gerritreferences: 98039
stopped after 1 hour 17 minutes and 523 siva files
NOTE: This repository takes a lot of time writing references (18 minutes of the total are spent writing reference files). |
Code for the prototype: https://github.com/jfontan/borges/tree/new_borges Uses go modules so clone outside
|
Rovers
Get also if the repository is a fork and which is the parent repository. This can be done checking if
"fork": true
in the JSON and getting with the api the repository and checkingsource
, this is the first parent. This information comes handy and it can be used to get the rooted repo where to push this repository. With it we have can accomplish two things:It would also be interesting to get
size
as it could be used to schedule a mix of small and big repositories to decrease the chances of memory starvation.NOTE: there can be a special discoverer that uses ghtorrent as input.
NOTE:
fork
is already retrieved by rovers and stored in the database.source
needs to be retrieved from the repository information and would require a second call to API. Maybe using the new graphql API can improve this.Queues
We are using the queues as databases and they become huge. This make backups and other maintenance more complex than needed. It also makes us rely too much in the information they contain. Rovers queue will be consumed and repositories created in the database instead of creating new jobs. The producer will maintain the number messages in the jobs queue withing a given threshold and refill it as needed.
To make this possible the repositories will have more states and columns.
States:
discovered
: the repository was added to the databasepending
: repository is in the queue to downloadfetching
: it's being downloadedfetched
: it was successfully downloadederror
: download had an errorDatabase:
status_at
: time when it changed its statesiva
: name of the siva file where it is locatedpriority
: priority for the repository, used in schedulingfork_endpoint
: used to find the siva where it should be storederror
: cause for the errorOn error the cause will go into
error
column. There are nonot_found
andauth_req
statuses.Components
The components will change its names to make them more user friendly. Consumer and producer does not mean anything for non developers.
Discoverer
To have the same features as we have now we will have
mentions
andfile
discoverers. They will work the same as the current producers but will only create the repositories in the database instead of sending the jobs to the queue.Scheduler
We had multiple producers (file, mentions, buried,...) but now all this functionality will be in the scheduler. This will make feasible to do more complex scheduling if needed.
There are four main types of jobs to schedule:
discovered
state, the initial download of a repositoryfetched
stateerror
statepending
,fetching
) of in the buried queueEach state could be configured with a ratio to send to the queue. The repositories for each group will be queried taking repository priority into account.
NOTE: As we improve and optimize download methods we could have a different queues for big repositories or repositories that we know beforehand that cannot be optimized and will use more memory. We can then process these special repositories in some producers that have less workers to minimize the problem of memory starvation.
DISCARDED NOTE: using a queue is interesting as it already provides HA and makes possible to restart the scheduler without stopping the production. Calling directly the scheduler is another option and enables even better scheduling as information like free memory in a worker can be taken into account. Still for now I believe this adds some complexity that may not be worth the effort.
Downloader
Optimizations
Do not separate repositories per rooted repo but maintain them in the same siva file: #380
With this change some optimizations can be applied to the clone step.
go-borges siva
The current transactioner has to change to a customizable locker. In our case locks will be done using etcd mechanism so it can be used in a distributed fashion.
go-siva will use a storer that shows all objects but will mangle references. On writing a reference its name will be changed to add the repository ID. On read it will show all references for all repositories plus some virtual references for the current repository with the correct name (repository ID stripped). This allows download optimizations.
Rooted Repositories
Single siva file for repositories
Siva files will contain complete repositories instead of single history trees. This is explained in a couple of proposals in borges:
TL;DR: The siva file used to store a repository will be the initial commit of the default branch.
Reference naming
Currently the references for repositories in a rooted repository have a strange nomenclature. The identifier of the repository is added after the original name of the reference:
And its remote configuration refspec does not match:
As the repositories are added as remotes in the config file we can use the same system as git, add them to the
refs/remotes
. Here's the config modified:And its references:
This is much more similar than what git does with remotes and also lets use work with these repositories with
git
CLI. For example we can unpack a siva file and usegit fetch --all
to update the repositories in it.The text was updated successfully, but these errors were encountered: