-
Notifications
You must be signed in to change notification settings - Fork 82
Conversation
To get the functionality quick, I haven't separated code as plugin, I Basic Test with single Glusterd2
What is working:
Issues yet to be addressed:
|
Some comments without checking the code:
Further question to follow in few days after looking at code. |
@aravindavk Does this code depend on #603 ? |
These 2 cases are covered in #603. |
@aravindavk Also if possible could you elaborate on the algorithm you have used to select devices that are to be used to create bricks? |
In this PR I am using PeerID itself as zone, once #603 is merged it will be useful.
That is great. The only change to device struct is additional fields FreeSize and Used flag. |
I added lot of comments in I will add the design note here soon |
Since most of the code is common, I used the same end point. As you mentioned separate end point makes sense. I will work on making it as different end point.
Only Size and Volume name are required, all other fields are optional. Default values of other fields are to be discussed.(Like Default Volume type is Replica 3) |
Calculating Arbiter brick size is well documented here, "5.8.1.2. Arbiter capacity requirements"(https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/creating_arbitrated_replicated_volumes#sect-Arbiter_Requirements). This need to be adopted by us. |
glusterd2/commands/volumes/common.go
Outdated
// Create Logical Volume | ||
cmd := exec.Command("lvcreate", | ||
"--poolmetadatasize", "5K", | ||
"-c", "256K", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all such defaults, I think we should either keep them dynamic or we should declare constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are dummy values, I will work on these values while working on SnapshotFactor
Re-factored the code as separate plugin. REST API named temporarily as Note: This patch is not working as expected after the re-factor, I am working on fixing the issues to make it work. I will refresh this patch soon today. |
plugins/smartvol/transactions.go
Outdated
} | ||
|
||
// Create Logical Volume | ||
cmd := exec.Command("lvcreate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move commands outside of the transaction like https://github.com/gluster/glusterd2/pull/603/files#diff-ee50b0428ecdd9b14aede7e987db1b59, it makes it easier to handle commands and make our code more modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, all lv/vg commands can be moved to this file itself. Can we rename that package as deviceutils
?
I will move all device commands to that file once that is merged.
@aravindavk I think we need to handle a case where the free size of a device is different then the actual free size (if user manually uses device). In case of failure to create brick due to "not enough space" then we can resync free size of that device (implicitly i.e without asking user to do it like its done in heketi) |
Valid concern. Volume Create performance may gets affected by this if we resync devices before Volume create(If cluster has more devices/peers).
Resyncing only that device is not sufficient since we need to reallocate bricks again with all devices updated information. If we resync only that device then allocator will fail if two devices are manually consumed outside Glusterd2. For accurate(consistent) results always, First Transacton step can refresh all devices, and |
Other issue we need to solve is concurrent Volume create requests. All can see the available size and allocate accordingly but some Volume Create may fail. |
Created a small visualization for calculating Arbiter brick size. If the workload is very small files, we are not getting much size advantage with Arbiter feature. http://aravindavk.in/sandbox/gluster-arbiter-brick-size.html cc: @itisravi |
Resync is only needed if a mismatch is detected. like failure of creating brick due to not enough memory.
This will lead to perfomance problems, but we can make the code intelligent so as to avoid this, resyncing only few devices while the synchronous transaction is going on and then resyncing the rest asynchronously.
Can we solve this using locks on devices? |
mismatch will be detected while running transaction function, which is running in one node. Resync to be done on every node where devices are registered. We need more Transaction steps as below.
We do not have a way to skip a Step function based on previous steps status. 3rd step will reach all peer nodes and return.(both success and failure case)
Possible, with added complexity. For correctness, we can do the following
This is generic issue, even in case of regular Volume create. If two Volume create specifies same brick path. Only one will win. Lock framework is not yet capable of handling multiple names. In this case lock is required on Volume name and on all devices. |
4KB is only to be on the safe side. If each file's data is less than 4KB, (i.e. du <filename> is 4KB, which is the minimum XFS allocates even if only 1 byte is written to the file), then the arbiter brick will consume 0KB for disk usage. In that case the arbiter brick size is only limited by the no. of inodes that can be created on that XFS partition. |
Hi all, I'd like to understand how this relates to the project of bringing the heketi functionalities into gd2. We kind of agreed on that project, also need to make sure we don't introduce any big incompatibilities on how heketi works in general (from the consumer perspective). Hence trying to understand how this relates. |
This patch implements following functionalities
Only overlapping component I see is Allocator. Since the patch now Note: This allocator has some extra features like handling distribute Some notes about Allocator: #466 (comment) |
route.Route{ | ||
Name: "SmartVolumeCreate", | ||
Method: "POST", | ||
Pattern: "/smartvol", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not exist. The same URL should serve non-smart and smart volume create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url name can change, but why do you think we should have single API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the end feature, I agree that this shouldn't be a separate endpoint. But I wouldn't mind this for the initial development and testing while we finalize the smart provisioning design.
Also, we don't yet support plugins adding middlewares yet, so it's not possible right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original thought was to have smart and classical volume creation under the same API and just trigger the classical or smart creation based on how much of the possible input parameters are set.
If we are sticking with a separate url, we should consider making the API as heketi-compatible as possible, since heketi is the smart volume create api for gluster that exists out there, and hundreds of users are consuming it via kubernetes provisioners. The more we deviate from what heketi offers, the harder it will be to do the switch once gd2 is ready to take over in the space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hundreds of users are consuming it via kubernetes provisioners
IIUC, it is probably this: User --> KubeProvisioner --> Heketi --> Glusterd
But are there users consuming heketi API directly ?
If kube provisioner is the primary place of heketi API consumption, the end users will not be affected if kube provisioner is modified to use new APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct. It is just a few code places. Not aware of other direct consumers of the API. But the experience on the smoothness and the feature compatibility will have an impact on that large user base.
Hmmm, I think there's more overlap than this: Here's what heketi implements as components of high-level volume management:
The data store needs to be replaced by gd2's internal state store (based on etcd, afaik). And the API will need to be merged with the gd2 API, ideally in a way that's largely equivalent. But this can imho be the very last step rather than the first one. The placer can be re-used as a package at some point (we're almost there), but I think also the disk management code in heketi has a lot of value to add since it's production proven and has a lot of lessons learned implemented. Sure it uses ssh or kube/oc exec in the backend, but that's just a detail. We may not refactor the disk mgmt code in heketi enough to re-use it as a package (since that might not be worth the effort), but the logic pieces with all the rollback code and defers etc should at the very least be taken as a guidline. Ideally at some point we could copy code over (including the test cases!) and continue maintaining it in gd2 from that point on. Remember that untested code is broken code. :-)
The heketi allocator (placer) determines the distribute count for us. The main point of intelligent volume provisioning in heketi is that we only get the requested durability type of the volume (replica-3, distperse 4+2, etc...) in the request, and the size, and the distribution is handled by heketi. We could make it an optional parameter at some point to require a specific distribution count, if anyone needs this. I think the first and foremost goal to get to feature parity with the current heketi model, which is how intelligent volume provisioning with Gluster is used out there, instead of creating a new functionality that does not quite match. And we can add more features or aspects later on. Let me explain some more on the 'phased aproach' that I was mentioning. This is an approach with the goal to bring high-level volume provisioning and other features from heketi into gd2 for broader consumption so that eventually heketi will not be there any more (at least for these features), while keeping the current consumers of heketi functionality working.
Here, step 1 might be seen optional, but it would give us a quick start with little effort. Steps 3 and 4 can be done in parallel to each other and to steps 1 and 2 (they already have started). Only step 5 will start adding the new REST apis for intelligent volume management to gd2. The benefits of this approach as I see it would be to
If you want to start testing the stuff internally from the gd2 perspective, we can implement test cases that use internal functions, instead of implementing restful apis and doing e2e tests here. We do have the luxury of taking this more careful approach, since we have the heketi rest endpoints available for e2e testing. More than happy to discuss this in greater detail as we move along! I will also look at the patch in more detail. And I am sure there's a lot of goodness in it, but I wanted to prevent having it merged in a hurry, but instead make sure it is aligned with the experiences from heketi and with the broader plan (at least if we can agree to that). @aravindavk @atinmu @prashanthpai @kshlm @rishubhjain @raghavendra-talur @phlogistonjohn @Madhu-1 ....
|
Overlap I mentioned wrt to reusability of the code as library. except allocator/placer rewrite is required for all the other parts of this feature.
Glusterd2 already has data store to store the Volume details and device details(Device APIs are already part of Gd2). Disk management code can not be reused as is from Heketi, Glusterd2 already has Transaction framework we should use that. Transaction framework already supports rollback functionality. I agree that best practices from Heketi about disk management should be considered while rewriting disk management code in Glusterd2.
Distribute count is already considered as optional input. Except Size all other inputs are optional. I already mentioned about automatic distribute count in "issues yet to be addressed" here #661 (comment)
I agree phased approach helps Heketi to provide same user experience and not break the existing behavior. Sure, Heketi can consume default Volume Create REST API for its usecase. All the device management can be managed by Heketi itself. But Glusterd2 project as upstream goal it should provide smart volume create API as well. Since APIs are independent, it will not affect Heketi functionality/consumers in anyway. Once Glusterd2 smart volume Create API becomes stable, Heketi can consider to start using it.
Note, The behavior of REST end points are different in Glusterd2 and Heketi. In Glusterd2 all the APIs are syncronous right now, we may start working on async APIs but it is not currently high priority at the moment. I agree we can explore the posibility of reusing heketi's test cases for this feature. We need Glusterd2 e2e tests also since we need to uncover bugs from other part of the code(REST handler/Transaction framework) including allocator/placer.
Thanks
I am still working on some todos and tests to make this usable. |
Split the patch into two for easy review. First patchset contains re-factor and deviceutils library. Second patch contains smartvol plugin code. I will add tests and other fixes as new patchset. |
route.Route{ | ||
Name: "SmartVolumeCreate", | ||
Method: "POST", | ||
Pattern: "/smartvol", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original thought was to have smart and classical volume creation under the same API and just trigger the classical or smart creation based on how much of the possible input parameters are set.
If we are sticking with a separate url, we should consider making the API as heketi-compatible as possible, since heketi is the smart volume create api for gluster that exists out there, and hundreds of users are consuming it via kubernetes provisioners. The more we deviate from what heketi offers, the harder it will be to do the switch once gd2 is ready to take over in the space.
@@ -0,0 +1,29 @@ | |||
package bricksplanner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused by this layout. That's probably due to my background on how heketi treats these concepts. At that high level /smart view of volume creation, heketi specifies not a volume type but a "durability type" this can for instance be replicate (e.g. with replica count 3), or replicate with arbiter, or disperse (with counts 4,2). Then the distribution factor is just how many copies of such brick-sets of the durability type you have. (So a volume of (durability) type "distribute" does not exist. The "none" durability is just the special case of replica 1 and does not require code of its own.
For me at least this view and layout adds a lot of clarity. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also always confused by what a "subvolume" is. It seems this is what heketi calls bricks-set, i.e. the smallest unit of bricks for the given durability type, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the word 'subvolume' in gluster word comes from it's volfile grammar. Check any file ending with .vol
in /var/lib/glusterd/vols/$volname/
, and you can see a definition which looks something like below:
volume patchy-dht
type cluster/distribute
option lock-migration off
subvolumes patchy-client-0 patchy-client-1 patchy-client-2 patchy-client-3 patchy-client-4
end-volume
In the above, notice the use of 'subvolumes'. This is the one line which establishes the relations in gluster's translator graph, like where to place the given translator etc etc. While most of the translators have just 1 subvolumes for them, the cluster/*
translators can have more than 1 subvolumes.
Let me follow up on some previous discussion in this PR that somhow died away. @aravindavk wrote:
I think you are missing my main point here: My main point is not to help heketi but to help gd2! And to help those current consumers of heketi that will become consumers of gd2. I am not mentioning plugging gd2 into heketi in order to release it but as a test vehicle for testing gd2. There are hundreds of users of the current kubernetes->heketi->gluster stack, and I think we should make the later transition to a kubernetes->gd2 stack as safe and seamless as possible, also limiting our work in adapting the code and fixing bugs that we had already fixed in heketi. Hence I am so adamant about this phased approach. Let me get very general and state what I see as the main tasks and responsibilities for our GD2 work, in order of priority:
As pare of point 2 I think we should make APIs and features as compatible as possible.
This, imho, is just a detail and not an inhibitor for trying to as much API compatible as possible and reasonable (wrt to urls and payloads). |
Given all the points I made above, and also based on what @raghavendra-talur and @phlogistonjohn commented per review, I would really prefer and recommend not to rush this patch into the code for gluster 4.1. I think it will give a wrong or at least confusing sign to the community. I would much rather have 4.1 concentrate on bringing stability and gd1 feature parity a little further. And we can move forward in master with this very soon after 4.1 is branched off. This way we can bring these features in much more carefully and have them in a good state for 4.2. My 2 cents, @aravindavk @kshlm @prashanthpai @rishubhjain @raghavendra-talur @phlogistonjohn @Madhu-1 @ansiwen @amarts ... |
Thanks @raghavendra-talur for the review.
I am still not convinced why we need to merge it into single API? CLI is already common for both types of Volume creation. Seperate API avoids lot of confusion in API documentation/debuggebility.
Not directly related to this PR. @rishubhjain already added PeerID in device struct, and as part of device management APIs(#728) seperating out of peer is planned.
Sure, noted. But can be future PRs
Sure, but can be future PRs
Already mentioned the advantage of having separate URL. I feel issues mentioned above are not blocker for merging. |
Thanks @obnoxxx for the review.
As I commented earlier, having same API increases the complexity to end users.
In Gluster, Subvolume is each distribution group. Gluster is capable of having Distribute subvolume inside a distribution group(Ex: Tiering uses DHT inside DHT). This API also will not accept Volume type directly finds it based on the inputs like replica/disperse/arbiter count.
This is because, API supports specifying distribution count and can create pure distribute Volume
With Glusterd2, flexibility provided in Volume create request to accept list of sub volumes instead of just accepting list of bricks(https://github.com/gluster/glusterd2/blob/master/doc/quick-start-user-guide.md#create-a-volume). This layout helps in cleaner Volfiles generation, control over induvidal subvolume(Example: Replace a Subvolume, Support for mix of sub volume types etc). The same layout is reused here since after planning and preparing the bricks, rest of the steps are common to both the Volume creation APIs
In my comment, I also agreed on the phased approach for the final product.
I agree to these points for product. If we plan to integrate Volume provisioner with Heketi in final phase we will not be able to uncover many issues related to this. Since the API is independent of other Volume creation API, the phased approach for product is not affected.
Added my response above.
Glusterd2 is experimental in Gluster 4.1 release. To make this feature stable for 4.2 release, we would like this feature available in 4.1 because integration POCs(Ansible/Tendril integration) prefers released branch than the master and Also this will helps us to get feedback from Gluster community. |
Like the conversations on the patch, looks like we need to continue the momentum.
The above is the very thought which crossed my mind. I see that GD2 is still experimental with 4.1 glusterfs release, and hence it would allow us for more feedback if we push the experimental features with it, than adding it to the stable version. That way, as developers we can uncover most issues in the core part of the product, and also allows us to have time to bring changes incrementally. I recommend to merge this feature for 4.1 with a statement saying it is experimental, and also the APIs are not finalized. About same API endpoints for 'volume create' (as in old way), and new 'get me a volume', Aravinda started the first version of patch with single API, but I am the one who was asking for keeping them separate. Idea is, I don't want to confuse the 95% of the users who would use only the new APIs (ie, dynamically create the volume with given size), with the 5% of the power/expert developers who would want to play with old API for advanced features. We can debate more on this in coming days If I go through the reviews on latest code, I don't see any technical blockers for the patch, other than calling out the API is not final, and some internal code cleanups, hence request to go ahead with merging this feature, mainly as it also has CLI functionality, and test cases added to it. Which means users can actually consume this feature and give feedback. @ShyamsundarR from the release point of view, what do you think? |
Thank you everyone for all your valuable comments.
While I do agree that there’s some more work/discussions required on the
dynamic provisioning of volumes especially from a perspective of freezing
the APIs, optimizing the workflow et all but I haven’t seen a single
technical reason where there’s a concern raised on the design and hence
like Aravinda and Amar, doesn’t make me to believe that this PR needs to
wait. As this feature by no means is to be called out as ready to br
consumed, lets go with an incremental based approach, have the users to get
a feel about what it means about having this dynamicness itself in GD2 and
over next 3-4 months make it a rock solid implementation and teady to be
used.
On Thu, 31 May 2018 at 09:28, Amar Tumballi ***@***.***> wrote:
Like the conversations on the patch, looks like we need to continue the
momentum.
I would much rather have 4.1 concentrate on bringing stability and gd1
feature parity a little further. And we can move forward in master with
this very soon after 4.1 is branched off. This way we can bring these
features in much more carefully and have them in a good state for 4.2.
Glusterd2 is experimental in Gluster 4.1 release. To make this feature
stable for 4.2 release, we would like this feature available in 4.1 because
integration POCs(Ansible/Tendril integration) prefers released branch than
the master and Also this will helps us to get feedback from Gluster
community.
The above is the very thought which crossed my mind. I see that GD2 is
still experimental with 4.1 glusterfs release, and hence it would allow us
for more feedback if we push the experimental features with it, than adding
it to the stable version. That way, as developers we can uncover most
issues in the core part of the product, and also allows us to have time to
bring changes incrementally.
I recommend to merge this feature for 4.1 with a statement saying it is
experimental, and also the APIs are not finalized.
------------------------------
About same API endpoints for 'volume create' (as in old way), and new 'get
me a volume', Aravinda started the first version of patch with single API,
but I am the one who was asking for keeping them separate. Idea is, I don't
want to confuse the 95% of the users who would use only the new APIs (ie,
dynamically create the volume with given size), with the 5% of the
power/expert developers who would want to play with old API for advanced
features.
We can debate more on this in coming days
------------------------------
If I go through the reviews on latest code, I don't see any technical
blockers for the patch, other than calling out the API is not final, and
some internal code cleanups, hence request to go ahead with merging this
feature, mainly as it also has CLI functionality, and test cases added to
it. Which means users can actually consume this feature and give feedback.
@ShyamsundarR <https://github.com/ShyamsundarR> from the release point of
view, what do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#661 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGp7mBb4oGa3nZ7lSt2aXIUMII9T2Eidks5t32pagaJpZM4TYN4e>
.
--
- Atin (atinm)
|
Aravinda - can we also please capture all the valid comments in github
which need to be worked on in subsequent PRs so that they don’t get missed
out?
On Thu, 31 May 2018 at 09:40, Atin Mukherjee ***@***.***> wrote:
Thank you everyone for all your valuable comments.
While I do agree that there’s some more work/discussions required on the
dynamic provisioning of volumes especially from a perspective of freezing
the APIs, optimizing the workflow et all but I haven’t seen a single
technical reason where there’s a concern raised on the design and hence
like Aravinda and Amar, doesn’t make me to believe that this PR needs to
wait. As this feature by no means is to be called out as ready to br
consumed, lets go with an incremental based approach, have the users to get
a feel about what it means about having this dynamicness itself in GD2 and
over next 3-4 months make it a rock solid implementation and teady to be
used.
On Thu, 31 May 2018 at 09:28, Amar Tumballi ***@***.***>
wrote:
> Like the conversations on the patch, looks like we need to continue the
> momentum.
>
> I would much rather have 4.1 concentrate on bringing stability and gd1
> feature parity a little further. And we can move forward in master with
> this very soon after 4.1 is branched off. This way we can bring these
> features in much more carefully and have them in a good state for 4.2.
>
> Glusterd2 is experimental in Gluster 4.1 release. To make this feature
> stable for 4.2 release, we would like this feature available in 4.1 because
> integration POCs(Ansible/Tendril integration) prefers released branch than
> the master and Also this will helps us to get feedback from Gluster
> community.
>
> The above is the very thought which crossed my mind. I see that GD2 is
> still experimental with 4.1 glusterfs release, and hence it would allow us
> for more feedback if we push the experimental features with it, than adding
> it to the stable version. That way, as developers we can uncover most
> issues in the core part of the product, and also allows us to have time to
> bring changes incrementally.
>
> I recommend to merge this feature for 4.1 with a statement saying it is
> experimental, and also the APIs are not finalized.
> ------------------------------
>
> About same API endpoints for 'volume create' (as in old way), and new
> 'get me a volume', Aravinda started the first version of patch with single
> API, but I am the one who was asking for keeping them separate. Idea is, I
> don't want to confuse the 95% of the users who would use only the new APIs
> (ie, dynamically create the volume with given size), with the 5% of the
> power/expert developers who would want to play with old API for advanced
> features.
>
> We can debate more on this in coming days
> ------------------------------
>
> If I go through the reviews on latest code, I don't see any technical
> blockers for the patch, other than calling out the API is not final, and
> some internal code cleanups, hence request to go ahead with merging this
> feature, mainly as it also has CLI functionality, and test cases added to
> it. Which means users can actually consume this feature and give feedback.
>
> @ShyamsundarR <https://github.com/ShyamsundarR> from the release point
> of view, what do you think?
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#661 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGp7mBb4oGa3nZ7lSt2aXIUMII9T2Eidks5t32pagaJpZM4TYN4e>
> .
>
--
- Atin (atinm)
--
- Atin (atinm)
|
@aravindavk wrote:
I don't think we need it for any technical reason. But it was what @raghavendra-talur said was the original idea behind it. (See Talur's presentation at gluster summit in Prague 2017.) It seems this needs to be discussed.
I do agree that common CLI may be the more important, and it's good that it is there. For API, I don't see a big source of confusion for a consumer. It can be very natural for a REST API to have the same API with fewer or more input parameters provided that triggers different code paths like smart or basic volume creation. I see no big source of confusion here. Having separate APIs/URLs mainly keeps the code more separate. Doing it this way now and changing it later may be implying more work later if we choose to change it. ==> There seems to be a need to get to consens here. |
@aravindavk writes:
That is true, you wrote this, but I am still confused: This PR is basically taking one of the last steps of the phased aproach and rushes it as the first step, essentially reversing the approach. :-)
I am not talking about product but general technology / upstream here.
I don't understand. First of all, we don't plan to integrate a volume provisioner (of gd2?) into heketi, but move heketi's volume provisioner into gd2. The phased approach has the integration of gd2 into heketi for a test setup as one of the first steps not as a final step. In the final situation, heketi won't be there any more. This early step is meant to help catch errors, to give early access to an additional way of testing and validating the work we do in gd2. And to make sure our current consumers of the smart volume privisioner keep working. I don't see how adding integrating gd2 into heketi would prevent us from uncovering issues, since it would only be an additional source of tests and verifications...
Creating the smart volume api is part of the phased approach and rather one of the later steps than the first step, so I don't understand this statement. |
If I remember right, the presentation was about using Heketi as Middleware in Glusterd2. We already discussed about it here #466 (comment) |
@aravindavk About having a single end-point for creating volumes. Our end-goal is to have a single end-point for volume creation. The expectation would be for GD2 to create and give volumes when requested for. The default should be to expect the least amount of details from the users for creating the volume, with volume size being to only thing required is nothing is provided. But the API needs to also allow the user to provide as many details as possible, such that they are able to manually layout the volume. Our aim should be to provide a automatic create API, that supports manual customisation. Right now, as we do this initial development, we are doing it the other way round. We have a manual API, on top of which we're building the automatic API. So at the moment, it makes sense (to me at least) to have development happen as a separate API. But we should remember that the end-goal is for that we have a single API. |
Not really. It was (as I remember) about implementing heketi-like smart volume creation including disk management in gd2 as a plugin or middleware, extending the existing volume create api. |
@obnoxxx My suggestion (and request) would be to think forward instead of going backward and I believe we all are working in a phased wise approach. The action items what both GD2 and Heketi Development took was to debate and discuss on the pros/cons on reimplementing the dynamic volume provisioning in GD2 layer vs reusing certain heketi libraries (by refactoring) and use them as modules. AFAIK, Aravinda did provide the perspectives about what are the benefits of having the implementation done at GD2 layer instead of reusing heketi's libraries as certain things may not be needed at all to be reused. @aravindavk request you to provide you link to that discussion. On the same line, I haven't seen any objections to that proposal from @raghavendra-talur or @phlogistonjohn . To me, that is when the phase 1 already started. The next phases would be collaborating on working on these pieces to polish, enhance this piece and get to a solid implementation which can provide all the required user experience (be it a standalone Gluster user or a k8s user). We can parallely start feeding API requests from Heketi to the core GD2 APIs to test the functionality and stability which can stay as an independent task too but again that doesn't block us to start testing the IVP APIs too parallely as well. Now that the basic implementation is already in place, and given GD2 continues to be tech-preview in GlusterFS 4.1, I don't see any reasons to hold this PR. All the comments can be looked at and addressed in subsequent PRs. Also merging APIs into single one may get us into lot of rework is not something I'd agree with. So overall, if we do have any design issues which need to be discussed for this PR, happy to get invested for otherwise I'm not sure why that consensus can't be in place. |
@amarts wrote:
Yes, gd2 is experimental in 4.0 and will remain experimental in 4.1. For this reason and the current state of gd2 stability, I doubt we will get a lot of feed back from the community. I don't even see a lot of core gluster developers using or considering gd2. Experience from other projects shows me that this feed back from the community is often times a false hope. Hence I think putting even more experimental features in for this release (and thereby risking further destabilization and future troubles changing) instead of concentrating on stabilization is the wrong move, imho. I think we should aim for 4.2 for: gd1 feature parity and stability, and feature completeness of the heketi-level features. Whether those can be stable by then is not sure now. We can try, but no guarantee. Whether merging this PR now instead of after 4.1 will speed it up or slow it down is not 100% clear, I expect it to have a slowdown effect rather than speedup, since once code is released, progress is slower. But as always, I may be wrong. :-) Also here is how I see gluster: While heketi is not part of gluster releases, to me it belongs to gluster as the high level service interface for volume management and day-2-operations, implementing stuff that was originally planned for glusterd2 and should now move back in glusterd2. Heketi is around since > 2 years now, and while it is used mostly in the kubernetes dynamic provisioning world. So from a core gluster perspective, heketi as a similar but worse situation as gd2 in that it is largely ignored. But there is a community of gluster out there that uses IVP already today and since a while. My point is to try and involve this existing part of the community or infrastructure as we move the features into gd2!
In the end it's your call, @amarts, @atinmu, @kshlm. It can be done and it's not the end of the world. I still don't see a good reason to do it, but reasons to hold it. |
@atinmu writes:
Not sure how I'm thinking backward... I think I'm thinking very much forward, in that I'm trying to help bring gd2 more from the ivory tower development to the real world, helping us to get it tested by real-world consumer scenarios, thereby keeping us honest and giving us direction where to move.
That's probably true, but everyone seems to have very different looking phased approaches in mind. ;-)
Yeah, ok, if you're doing a retrospective: In parallel this discussion, this PR was already started. That's not a problem as a PoC to have something more concrete as the basis for the discussion. But while I had the impression that the outcome of the discussion was that we agreed to re-use as much as possible from the heketi world, either directly (as packages/libs) or indirectly, this PR is now already proposed for inclusion in the release. (Maybe I got the impression of the outcome of the discussion because I'm biased and that's what I want the outcome to be, but also because in all discussions we had, I talked and talked and wrote and wrote to explain my motivations and approaches, but only got very few responses... ;-) ) Anyways, to me it is somewhat unfortunate that some of those fundamental discussions only happened in a very direct and honest way in this PR and not earlier.
Yes, please.
I don't see any point in having heketi talk to gd2 core APIs once this pr has been merged. In my original proposal we would have first done that and then gradually implemented IVP features in gd2, taking over much of heketi, and replacing the code in heketi to talk to the IVP apis in gd2, heketi becoming a thinner and thinner shim layer until it's gone.
Now let's close on this discussion. As I said in the other comment, eventually it's your call. If you want to merge for 4.1, it's not the end of the world (if APIs and internal structures are not set in stone yet), but I've made my comments. ;-) I want to make some points clear:
Cheers - Michael |
I share the exact same opinion, here and in general too. IF getting the feedback from community is the motivation to rush this, IMHO, it's misplaced hope. Sorry if this comment seems pessimistic but it's realistic. |
Non technical
@obnoxxx @prashanthpai I am a hopeful person 😄 While I always understand I see less participation from people when things are not stable, that is the norm of the world. But all I care is for those who want to try, there is an option (that may be less than 1% but there will be people). Even that 1% feedback would be good for us. My stand always has been, if we don't experiment on new things, we will fail in long run. Ofcourse, if we experiment on stable codebase, it will have an negative impact of driving people away, but as long as the experiment is not causing issues to existing features, (or as long as everything is experimental), we should go for it. I agree that even glusterfs developers wouldn't try gd2 if its not stable, and I guess that would be valid, because for their tasks, it doesn't help in any ways right now. I am pushing for this patch to be merged even with all these discussions in 4.1, mainly to get it out there in release notes announcement. I totally understand that we don't have a community which would pick the new release as soon as it is released, but to even build such community, we should be providing reasons for them. Also note that, even if we announce it in 4.1, it wouldn't be used by someone who would test it in another month or so (or can be even longer). It will be same case for glusterfs 4.2 too. If we say gd2 is the only way forward if they upgrade, people will delay their testing (forget production migration). So, if we are pushing the experimental changes with glusterfs-4.2 release, we will be pushing out the testing effort to even further. This is the exact reason I am pushing it to be merged before 4.1 release, and going out in the field, also because it has certain tests added, while we are stabilizing the feature, we would always have a consideration of the feature, which makes it much better than not considering the whole IVP stuff during the development. As I again say, the whole project is not yet announced as GA, and not close it to it in another month or so. Hence having more experimental changes, and also everyone agreeing that nothing (not even may be the core APIs) is written in stone yet, is very critical. Until we get regressions of glusterfs running with gd2 builds, there is always a chance that we may need a fundamental change in the design. So, why not get this in, and see where it proceeds. We can always have debates and change things. |
There are nightly builds for them :) Just to be clear, I have not approved or disaproved the PR as I haven't reviewed it yet. I cannot comment on the readiness of this PR. If reviewers/maintainers who have reviewed this think that this is reasonably complete and ready to go, please do merge it. My disagreement is with the reasoning behind pushing this. The reasoning quoted:
Integration projects can continue to use master branch and help us shape and evolve APIs. I also assume that another implicit reasoning behind the rush is the Given the pace at which commits (lot of fixes, not just features) are being merged into glusterd2 master and also considering glusterd2 is not reasonably stable yet, I'd prefer that we do not maintain a separate stable release branch for 4.1. We can tag releases (including minor ones) from master branch. If that's the case, this PR can get in a While it's general practice for stable projects to not include features in a minor release, the motivation for that practice is to be careful about not breaking things for users. Given that glusterd2 is still experimental and not stable, we don't have a case to maintain a separate branch. It also saves us time in book-keeping, backporting and will help us innovate faster, on master branch. |
I guess that was a given before putting my argument forward. Don't want to push people to merge a patch even if something is not technically ready! Beats the whole purpose of opensource and community. If there are concerns on technical part of it, totally OK to get this stopped. My concern was for stopping it for the reason that we need to finalize on the API. While there are so many patches where, 'I will fix things in next PR' gets merged, I don't want a different yardstick to stop this patch. If it doesn't get merged by 4.1, it is fine too, it can get merged for 4.1.1 or 4.1.2, but having a clear ask on what more is to be done is helpful even to meet that target, otherwise, we will never get ready for merging the patch. Considering the patch is at least 42 days old. it is good to have progress on the patch for keeping the health of the project. |
Based on the discussion and agreement between @obnoxxx @phlogistonjohn @raghavendra-talur @kshlm @prashanthpai over a call we all agreed to get this PR in with the following disclaimers: It has to be explicitly called out in the release notes of GlusterFS 4.1 that this feature is still in development phase and might go through lot of changes which can change the internals and APIs. Additionally, we'd also need to have a separate discussion initiated to have a converged decision on whether to have a single API for (non)smart volumes. @aravindavk @kshlm Can one of you please rebase this patch as I see some conflicts in the branch and then get it merged? |
@kshlm could we to proper rebases instead of having conflict resolution in the merge commits? This is so ugly and unnecessary. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussions we had and the agreement about api and structure stability (or lack thereof) for the next release phase, I'm not objecting putting the changes in even for 4.1. I would like to see the conflict resolution not in the merge commit, though... :-)
Updates: #466 Signed-off-by: Aravinda VK <avishwan@redhat.com>
Opened new issue to track the changes required to make this feature complete. Please add if I missed any comments #851 |
With this feature, Gluster can choose the bricks automatically based
on the given Volume size.
Usage
Gluster will not assume the list of devices available in each
peer. Register the devices which can be used by Gluster for
provisioning the bricks.
Once the devices are registered, Gluster Volume can be created using,
To create distributed replicate Volume, specify the distribute count as below.
Options available with this command are,
If multiple peers belong to same zone(or same rack server), bricks
belonging to same volume/sub volume should not choose from the same
zone. To set the zone to a peer,
Note: Zone CLI implementation is in progress, APIs are available
How it works?
Number of sub volumes and subvolume size
Find number of sub volumes and sub volume size using the distribute
count provided
Volume Type
Find Volume Type based on replica/disperse/arbiter count
Brick Size
Find each Brick size, Thin pool size and number of bricks per sub
volume based on Volume Type, replica/disperse/arbiter count,
snapshot reserve factor and sub volume size.
If Volume Type is Replicate
If Volume Type is Arbiter
If Volume Type is Disperse
Note: Add
0.5%
(max16Gb
) to thin pool size for metadataBrick Layout
Construct Brick layout based on the above information. Below
example shows the brick layout of
testvol
withsize=1G
,distribute=2
andreplica=2
. Note that peer andvgname information is not yet available.
Choose device
To satisfy the requirement of above brick layout, iterate the devices
list and choose a device which satisfies the following requirements.
same sub volume.
different sub volumes of same volume unless
subvols-zones-overlap=true
Many approaches can be followed for choosing the peer and device
for each brick.
for allocating bricks. For example, If we have 10GB devices in 10
peers, after creating 10 volumes(replica 3) only 3 devices gets
filled up out of 10 devices.
while choosing a device for brick. Allocate from used devices only
when no new devices are available. Before allocating the bricks,
Sort the list of devices based on available free size(Descending
order).
Gluster's intelligent volume provisioning uses the "Spread first"
approach since it distributes the bricks across the peers. Heketi
uses Simple Ring Allocator to choose the devices and peers. More details
available here.
Once the bricks layout is decided, devices allocation will be done in
two iterations. In the first iteration, bricks will be chosen from
the fresh devices(Number of LVs=0), if the number of bricks and size
required is not satisfied then, in the second iteration, bricks will
be chosen from the other available devices.
To summarize,
replica/disperse/arbiter counts
Volume type.
bricks required is not satisfied with previous step.
Steps involved in Brick preparation Transaction
Transaction framework of Glusterd2 will execute the following steps in
the identified brick peers.
planner
fstab
fileIf Transaction fails, rollback will try the following steps
References
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/creating_arbitrated_replicated_volumes#sect-Arbiter_Requirements
https://docs.gluster.org/en/v3/Administrator%20Guide/Setting%20Up%20Volumes/#creating-dispersed-volumes
Updates: #466
Signed-off-by: Aravinda VK avishwan@redhat.com