Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Intelligent Volume Provisioning #661

Merged
merged 1 commit into from
May 31, 2018
Merged

Intelligent Volume Provisioning #661

merged 1 commit into from
May 31, 2018

Conversation

aravindavk
Copy link
Member

@aravindavk aravindavk commented Apr 17, 2018

With this feature, Gluster can choose the bricks automatically based
on the given Volume size.

Usage

Gluster will not assume the list of devices available in each
peer. Register the devices which can be used by Gluster for
provisioning the bricks.

glustercli device add <peerid> <device>

Once the devices are registered, Gluster Volume can be created using,

glustercli volume create gv1 --size 1G

To create distributed replicate Volume, specify the distribute count as below.

glustercli volume create gv1 --size 1G --replica 3 --distribute 2

Options available with this command are,

--size                    (required) Size of Volume
--distribute              (optional) Distribute count, Default value is 1
--replica                 (optional) Replica count(Possible values: 2, 3),
                                     Default value is 0
--arbiter                 (optional) Arbiter Count(Possible values: 1)
--disperse                (optional) Disperse Count, Default value is 0
--disperse-data           (optional) Disperse Data Count,Default value is 0
--redundancy              (optional) Disperse Redundancy Count,
                                     Default value is 0
--limit-peers             (optional) List of Peer IDs, Create Volume only
                                     using peers which are given in this
                                     list, Default value is []
--limit-zones             (optional) List of Zones, Create Volume only
                                     using Peers which belong to these
                                     zones, Default value is []
--exclude-peers           (optional) Do not use these Peers for bricks,
                                     Default value is []
--exclude-zones           (optional) Do not use the peers for creating
                                     bricks which belongs to these Zones,
                                     Default value is []
--snapshot-enabled        (optional) If Snapshot feature is required for
                                     the Volume, Default value is false
--snapshot-reserve-factor (optional) Additional space for Snapshots,
                                     Default value is 1
--subvols-zones-overlap   (optional) This can be set to true if brick
                                     belonging to other sub volume can be
                                     created on same zone, Default
                                     value is false

If multiple peers belong to same zone(or same rack server), bricks
belonging to same volume/sub volume should not choose from the same
zone. To set the zone to a peer,

glustercli zone set <peerid> <zone>

Note: Zone CLI implementation is in progress, APIs are available

How it works?

Number of sub volumes and subvolume size

Find number of sub volumes and sub volume size using the distribute
count provided

number_of_subvols = distribute_count
subvolume_size = size / number_of_subvols

Volume Type

Find Volume Type based on replica/disperse/arbiter count

if replica_count > 0 then volume_type = "Replicate"
if disperse_count > 0 then volume_type = "Disperse"
if arbiter_count > 0 then volume_type = "Arbiter"
else volume_type = "Distribute"

Brick Size

Find each Brick size, Thin pool size and number of bricks per sub
volume based on Volume Type, replica/disperse/arbiter count,
snapshot reserve factor and sub volume size.

If Volume Type is Replicate

each_brick_size = subvolume_size
thinpool_size = each_brick_size * snapshot_reserve_factor
number_of_bricks_per_subvolume = replica_count

If Volume Type is Arbiter

each_brick_size = subvolume_size
thinpool_size = each_brick_size * snapshot_reserve_factor

arbiter_brick_size = 4KB * each_brick_size / average_file_size_kb
arbiter_thinpool_size = arbiter_brick_size * snapshot_reserve_factor

If Volume Type is Disperse

each_brick_size = subvolume_size / disperse_data_count
thinpool_size = each_brick_size * snapshot_reserve_factor
number_of_bricks_per_subvolume = disperse_count

Note: Add 0.5%(max 16Gb) to thin pool size for metadata

Brick Layout

Construct Brick layout based on the above information. Below
example shows the brick layout of testvol with size=1G,
distribute=2 and replica=2. Note that peer and
vgname information is not yet available.

{
    "name": "testvol",
    "subvols": [
        {
            "type": "replicate",
            "bricks": [
                {
                    "peer": "",
                    "path": "/var/lib/glusterd2/mounts/testvol_s0_b0/brick",
                    "size": 500,
                    "tpsize": 500,
                    "vgname": ""
                },
                {
                    "peer": "",
                    "path": "/var/lib/glusterd2/mounts/testvol_s0_b1/brick",
                    "size": 500,
                    "tpsize": 500,
                    "vgname": ""
                }
            ],
            "replica": 2
        },
        {
            "type": "replicate",
            "bricks": [
                {
                    "peer": "",
                    "path": "/var/lib/glusterd2/mounts/testvol_s1_b0/brick",
                    "size": 500,
                    "tpsize": 500,
                    "vgname": ""
                },
                {
                    "peer": "",
                    "path": "/var/lib/glusterd2/mounts/testvol_s1_b0/brick",
                    "size": 500,
                    "tpsize": 500,
                    "vgname": ""
                }
            ],
            "replica": 2
        }
    ],
}

Choose device

To satisfy the requirement of above brick layout, iterate the devices
list and choose a device which satisfies the following requirements.

  • Same device/peer/zone should not be used for bricks belonging to
    same sub volume.
  • Same device/peer/zone should not be used for bricks belonging to
    different sub volumes of same volume unless
    subvols-zones-overlap=true

Many approaches can be followed for choosing the peer and device
for each brick.

  • Fill first - fully utilize the devices before picking new device
    for allocating bricks. For example, If we have 10GB devices in 10
    peers, after creating 10 volumes(replica 3) only 3 devices gets
    filled up out of 10 devices.
  • Spread first - More priority will be given to fresh devices
    while choosing a device for brick. Allocate from used devices only
    when no new devices are available. Before allocating the bricks,
    Sort the list of devices based on available free size(Descending
    order).

Gluster's intelligent volume provisioning uses the "Spread first"
approach since it distributes the bricks across the peers. Heketi
uses Simple Ring Allocator to choose the devices and peers. More details
available here.

Once the bricks layout is decided, devices allocation will be done in
two iterations. In the first iteration, bricks will be chosen from
the fresh devices(Number of LVs=0), if the number of bricks and size
required is not satisfied then, in the second iteration, bricks will
be chosen from the other available devices.

To summarize,

  • Find number of sub volumes based on distribute count
  • Find Volume Type based on replica/disperse/arbiter count
  • Find number of bricks per sub volume based on Volume Type and
    replica/disperse/arbiter counts
  • Find each brick size based on Volume size, Number of Sub volumes and
    Volume type.
  • Prepare bricks layout without Peer and device information
  • First iteration, choose fresh devices for bricks
  • Second iteration, choose other devices for bricks if the number of
    bricks required is not satisfied with previous step.
  • Pass this information to Glusterd2 Transaction to prepare the bricks.

Steps involved in Brick preparation Transaction

Transaction framework of Glusterd2 will execute the following steps in
the identified brick peers.

  • Prepare Thin pool based on ThinPool size provided by the planner
  • Prepare Logical Volume based on the Brick size provided by the
    planner
  • Create XFS File system
  • Mount the FS and add to custom fstab file

If Transaction fails, rollback will try the following steps

  • Unmount the Brick if mounted
  • Remove Logical Volume and Thin Pool
  • Cleanup the temp directories

References

Updates: #466
Signed-off-by: Aravinda VK avishwan@redhat.com

@aravindavk
Copy link
Member Author

aravindavk commented Apr 17, 2018

To get the functionality quick, I haven't separated code as plugin, I
will work on it once I get the functionality working.

Basic Test with single Glusterd2

  • Start Glusterd2

  • Create a device using,

      fallocate -l 1G dev1.img
      mknod /dev/loop0 b 7 0
      losetup /dev/loop0 dev1.img
    
  • Register device with Glusterd2 using, (Note: Change the Peer ID)

      curl -XPOST http://localhost:24007/v1/peers/45222915-26c9-4188-b09f-6708e1e35103/devices -d '{"devices":["/dev/loop0"]}'
    
  • Create a Volume using, (Default size in MB)

      curl -XPOST http://localhost:24007/v1/volumes -d '{"name":"gv1", "size": 100}'
    
  • Replica 2/3 Volume can be created if 3 nodes available and 3 devices registered, using,

      curl -XPOST http://localhost:24007/v1/volumes -d '{"name":"gv1", "size": 100, "replica": 3}'
    

What is working:

  • Register a device
  • Create Volume request with size(Distribute, Replicate, Arbiter)
  • Rest of the Volume Create goodies are still available, like specifying Volume Options during create, specify Transport Type etc
  • List of Peers can be specified to consider only those peers for bricks allocation
  • List of Zones can be specified to consider only those zones for bricks allocation
  • List of Peers/Zones can be specified to exclude during allocation
  • If Zone is not specified Peer ID will be selected as Zone
  • No bricks of same subvolume will be allocated in same zone or device
  • If subvols overlap specified, then bricks of different sub volume may get allocated from same zone.
  • First priority to fresh devices, so that bricks will be spread across the peers.

Issues yet to be addressed:

  • Add device cleanup - If device is already added Validation required
  • Add device cleanup - If device is not new do not accept
  • Size units uniformity(MB, GB etc)
  • Device list is stored as string which causes multiple JSON Marshal/Unmarshal, Change this to store as list of Device struct
  • Arbiter brick size calculation, as of now it creates arbiter brick of same size as replica brick
  • Update Free Size after Brick create or after Rollback
  • Snapshot Factor Consider
  • Disperse Volume Calculations
  • Auto distribute count
  • Add Tests
  • Persist brick mounts or mount required bricks before Volume start
  • LV cleanup on Volume delete
  • Expand Volume
  • Code separation as plugin

@amarts
Copy link
Member

amarts commented Apr 17, 2018

Some comments without checking the code:

  1. Is this the same API end point as volume create? I feel it should be different.

  2. As heketi provides some of these functionalities, have to considered the import?

  3. The ideal goal of IVP is to have a single end point, which gives consumable storage (say gluster storage) from available 'peers'.

Further question to follow in few days after looking at code.

@rishubhjain
Copy link
Contributor

@aravindavk Does this code depend on #603 ?

@rishubhjain
Copy link
Contributor

@aravindavk

  • Add device cleanup - If device is already added Validation required

  • Add device cleanup - If device is not new do not accept

These 2 cases are covered in #603.

@rishubhjain
Copy link
Contributor

@aravindavk Also if possible could you elaborate on the algorithm you have used to select devices that are to be used to create bricks?

@aravindavk
Copy link
Member Author

@aravindavk Does this code depend on #603 ?

In this PR I am using PeerID itself as zone, once #603 is merged it will be useful.

These 2 cases are covered in #603.

That is great. The only change to device struct is additional fields FreeSize and Used flag.

@aravindavk
Copy link
Member Author

@aravindavk Also if possible could you elaborate on the algorithm you have used to select devices that are to be used to create bricks?

I added lot of comments in bricks-allocator.go file. https://github.com/gluster/glusterd2/pull/661/files#diff-88d9582fd347e90db553d8790674d7d1

I will add the design note here soon

@aravindavk
Copy link
Member Author

Some comments without checking the code:

Is this the same API end point as volume create? I feel it should be different.

Since most of the code is common, I used the same end point. As you mentioned separate end point makes sense. I will work on making it as different end point.

As heketi provides some of these functionalities, have to considered the import?

  • I am referring heketi for many of these functionalities like pvcreate/vgcreate/lvcreate/mount etc. Unfortunately heketi code can't be imported as is(it involves Heketi code re-factoring).
  • Since heketi is an external entity, device states are maintained in separate database. In Glusterd2, devices and zones information will be stored with Peer information itself.
  • To make many decisions, heketi has to get Gluster peer status or existing Volume status by running gluster CLI. Now with Gd2 many of these things are not required since device status co-exists with Peers and Volumes data.
  • Pre-calculated device list(using Ring allocator) is not flexible when new conditions like limit-peers, limit-zones filters are introduced. With this patch, brick allocation is done in-memory using all the device/zone/peers information available in etcd. (Intelligent volume provisioning in GD2 #466 (comment))

The ideal goal of IVP is to have a single end point, which gives consumable storage (say gluster storage) from available 'peers'.

Only Size and Volume name are required, all other fields are optional. Default values of other fields are to be discussed.(Like Default Volume type is Replica 3)

@aravindavk
Copy link
Member Author

Calculating Arbiter brick size is well documented here, "5.8.1.2. Arbiter capacity requirements"(https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/creating_arbitrated_replicated_volumes#sect-Arbiter_Requirements). This need to be adopted by us.

// Create Logical Volume
cmd := exec.Command("lvcreate",
"--poolmetadatasize", "5K",
"-c", "256K",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all such defaults, I think we should either keep them dynamic or we should declare constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are dummy values, I will work on these values while working on SnapshotFactor

@aravindavk
Copy link
Member Author

Re-factored the code as separate plugin. REST API named temporarily as /v1/smartvol, This can be changed later once the Endpoint name is finalized.

Note: This patch is not working as expected after the re-factor, I am working on fixing the issues to make it work. I will refresh this patch soon today.

}

// Create Logical Volume
cmd := exec.Command("lvcreate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move commands outside of the transaction like https://github.com/gluster/glusterd2/pull/603/files#diff-ee50b0428ecdd9b14aede7e987db1b59, it makes it easier to handle commands and make our code more modular.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, all lv/vg commands can be moved to this file itself. Can we rename that package as deviceutils ?

I will move all device commands to that file once that is merged.

@rishubhjain
Copy link
Contributor

@aravindavk I think we need to handle a case where the free size of a device is different then the actual free size (if user manually uses device). In case of failure to create brick due to "not enough space" then we can resync free size of that device (implicitly i.e without asking user to do it like its done in heketi)

@aravindavk
Copy link
Member Author

@aravindavk I think we need to handle a case where the free size of a device is different then the actual free size (if user manually uses device). In case of failure to create brick due to "not enough space" then we can resync free size of that device (implicitly i.e without asking user to do it like its done in heketi)

Valid concern. Volume Create performance may gets affected by this if we resync devices before Volume create(If cluster has more devices/peers).

In case of failure to create brick due to "not enough space" then we can resync free size of that device

Resyncing only that device is not sufficient since we need to reallocate bricks again with all devices updated information. If we resync only that device then allocator will fail if two devices are manually consumed outside Glusterd2.

For accurate(consistent) results always, First Transacton step can refresh all devices, and planBricks should be converted as Transaction Step(Second step). Note: Still it can be fail if a device is manually consumed while Volume Create request is in progress.

@aravindavk
Copy link
Member Author

Other issue we need to solve is concurrent Volume create requests. All can see the available size and allocate accordingly but some Volume Create may fail.

@aravindavk
Copy link
Member Author

Created a small visualization for calculating Arbiter brick size. If the workload is very small files, we are not getting much size advantage with Arbiter feature.

http://aravindavk.in/sandbox/gluster-arbiter-brick-size.html

cc: @itisravi

@rishubhjain
Copy link
Contributor

rishubhjain commented Apr 19, 2018

Volume Create performance may gets affected by this if we resync devices before Volume create(If cluster has more devices/peers).

Resync is only needed if a mismatch is detected. like failure of creating brick due to not enough memory.

Resyncing only that device is not sufficient since we need to reallocate bricks again with all devices updated information. If we resync only that device then allocator will fail if two devices are manually consumed outside Glusterd2.

This will lead to perfomance problems, but we can make the code intelligent so as to avoid this, resyncing only few devices while the synchronous transaction is going on and then resyncing the rest asynchronously.

Other issue we need to solve is concurrent Volume create requests. All can see the available size and allocate accordingly but some Volume Create may fail.

Can we solve this using locks on devices?

@aravindavk
Copy link
Member Author

Resync is only needed if a mismatch is detected. like failure of creating brick due to not enough memory.

mismatch will be detected while running transaction function, which is running in one node. Resync to be done on every node where devices are registered. We need more Transaction steps as below.

1. Plan Bricks Layout (One node)
2. PrepareBricks (All identified Volume nodes)
3. ResyncIfFailureInPrevStep (All Peer nodes)
4. Plan Bricks Layout Again (One node)
5. PrepareBricks (All identified Volume node)

We do not have a way to skip a Step function based on previous steps status. 3rd step will reach all peer nodes and return.(both success and failure case)

This will lead to perfomance problems, but we can make the code intelligent so as to avoid this, resyncing only few bricks while the synchronous transaction is going on and then resyncing the rest asynchronously.

Possible, with added complexity. For correctness, we can do the following

1. Resync devices (All Peer nodes)
2. Plan Bricks Layout (One node)
3. Prepare Bricks (All participating node)
4. Other Create Volume steps 

Can we solve this using locks on devices.?

This is generic issue, even in case of regular Volume create. If two Volume create specifies same brick path. Only one will win. Lock framework is not yet capable of handling multiple names. In this case lock is required on Volume name and on all devices.

@itisravi
Copy link
Member

itisravi commented Apr 19, 2018

Created a small visualization for calculating Arbiter brick size. If the workload is very small files, we are not getting much size advantage with Arbiter feature.

http://aravindavk.in/sandbox/gluster-arbiter-brick-size.html

4KB is only to be on the safe side. If each file's data is less than 4KB, (i.e. du <filename> is 4KB, which is the minimum XFS allocates even if only 1 byte is written to the file), then the arbiter brick will consume 0KB for disk usage. In that case the arbiter brick size is only limited by the no. of inodes that can be created on that XFS partition.

@obnoxxx
Copy link
Contributor

obnoxxx commented Apr 24, 2018

Hi all, I'd like to understand how this relates to the project of bringing the heketi functionalities into gd2. We kind of agreed on that project, also need to make sure we don't introduce any big incompatibilities on how heketi works in general (from the consumer perspective). Hence trying to understand how this relates.

@aravindavk
Copy link
Member Author

Hi all, I'd like to understand how this relates to the project of
bringing the heketi functionalities into gd2. We kind of agreed on
that project, also need to make sure we don't introduce any big
incompatibilities on how heketi works in general (from the consumer
perspective). Hence trying to understand how this relates.

This patch implements following functionalities

  • REST API route and Handler - Gd2 already has framework to add REST
    API route and handler function, We can't reuse Heketi code for this
    functionality. This REST API handler is reusing Transaction steps
    from Glusterd2's Regular Volume Create.
  • Prepare Bricks Transaction step - Heketi is using SSH command
    execution for this, Glusterd2 has internal Transaction framework
    which is already capable of running commands in required nodes of
    Cluster.
  • Allocator based on available Vgs and Size - Allocator is made
    pluggable, so that allocator can be replaced if Heketi allocator is
    available to use as library. We need not rewrite other parts of code
    if we change the allocator.

Only overlapping component I see is Allocator. Since the patch now
implements the REST handler and brick preperation steps, we can now
spend more time in choosing the allocator/reusing Heketi
allocator/improving this allocator by adding missing features from
Heketi.

Note: This allocator has some extra features like handling distribute
count, limiting peers or zones for allocation etc. We need to
implement these features also in Heketi if we finalize to use Heketi
allocator.

Some notes about Allocator: #466 (comment)

route.Route{
Name: "SmartVolumeCreate",
Method: "POST",
Pattern: "/smartvol",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not exist. The same URL should serve non-smart and smart volume create.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

url name can change, but why do you think we should have single API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the end feature, I agree that this shouldn't be a separate endpoint. But I wouldn't mind this for the initial development and testing while we finalize the smart provisioning design.
Also, we don't yet support plugins adding middlewares yet, so it's not possible right away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original thought was to have smart and classical volume creation under the same API and just trigger the classical or smart creation based on how much of the possible input parameters are set.

If we are sticking with a separate url, we should consider making the API as heketi-compatible as possible, since heketi is the smart volume create api for gluster that exists out there, and hundreds of users are consuming it via kubernetes provisioners. The more we deviate from what heketi offers, the harder it will be to do the switch once gd2 is ready to take over in the space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hundreds of users are consuming it via kubernetes provisioners

IIUC, it is probably this: User --> KubeProvisioner --> Heketi --> Glusterd

But are there users consuming heketi API directly ?

If kube provisioner is the primary place of heketi API consumption, the end users will not be affected if kube provisioner is modified to use new APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. It is just a few code places. Not aware of other direct consumers of the API. But the experience on the smoothness and the feature compatibility will have an impact on that large user base.

@obnoxxx
Copy link
Contributor

obnoxxx commented Apr 25, 2018

Hi all, I'd like to understand how this relates to the project of bringing the heketi functionalities into gd2. We kind of agreed on that project, also need to make sure we don't introduce any big incompatibilities on how heketi works in general (from the consumer perspective). Hence trying to understand how this relates.

This patch implements following functionalities

  • REST API route and Handler - Gd2 already has framework to add REST API route and handler function, We can't reuse Heketi code for this functionality. This REST API handler is reusing Transaction steps from Glusterd2's Regular Volume Create.
  • Prepare Bricks Transaction step - Heketi is using SSH command execution for this, Glusterd2 has internal Transaction framework which is already capable of running commands in required nodes of Cluster.
  • Allocator based on available Vgs and Size - Allocator is made pluggable, so that allocator can be replaced if Heketi allocator is available to use as library. We need not rewrite other parts of code if we change the allocator.

Only overlapping component I see is Allocator. Since the patch now implements the REST handler and brick preperation steps, we can now spend more time in choosing the allocator/reusing Heketi allocator/improving this allocator by adding missing features from Heketi.

Hmmm, I think there's more overlap than this:

Here's what heketi implements as components of high-level volume management:

  • The RESTful API
  • The allocator (now called placer) that identifies possible places for bricks for a volume request.
  • The disk management piece that initializes the disks and carves out the bricks.
  • The data store that stores heketi's internal state.

The data store needs to be replaced by gd2's internal state store (based on etcd, afaik). And the API will need to be merged with the gd2 API, ideally in a way that's largely equivalent. But this can imho be the very last step rather than the first one. The placer can be re-used as a package at some point (we're almost there), but I think also the disk management code in heketi has a lot of value to add since it's production proven and has a lot of lessons learned implemented. Sure it uses ssh or kube/oc exec in the backend, but that's just a detail. We may not refactor the disk mgmt code in heketi enough to re-use it as a package (since that might not be worth the effort), but the logic pieces with all the rollback code and defers etc should at the very least be taken as a guidline. Ideally at some point we could copy code over (including the test cases!) and continue maintaining it in gd2 from that point on. Remember that untested code is broken code. :-)

Note: This allocator has some extra features like handling distribute count, limiting peers or zones for allocation etc. We need to implement these features also in Heketi if we finalize to use Heketi
allocator.

The heketi allocator (placer) determines the distribute count for us. The main point of intelligent volume provisioning in heketi is that we only get the requested durability type of the volume (replica-3, distperse 4+2, etc...) in the request, and the size, and the distribution is handled by heketi. We could make it an optional parameter at some point to require a specific distribution count, if anyone needs this. I think the first and foremost goal to get to feature parity with the current heketi model, which is how intelligent volume provisioning with Gluster is used out there, instead of creating a new functionality that does not quite match. And we can add more features or aspects later on.

Let me explain some more on the 'phased aproach' that I was mentioning. This is an approach with the goal to bring high-level volume provisioning and other features from heketi into gd2 for broader consumption so that eventually heketi will not be there any more (at least for these features), while keeping the current consumers of heketi functionality working.

  1. Switch heketi over to use gd2 in the backend, as a(n optional) first step using the gd2 cli for low-level volume management. This will give us an end-to-end test vehicle (ci) pretty much in a few days, optionally including the kubernetes consumers.
  2. Switch heketi to use gd2 via RESTful apis of the low level volume mgmt features.
  3. Further refactor heketi code to abstract out the data store, and make allocator(placer) and disk mgmt code reusable.
  4. Further improve gd2 so that it reaches stability and feature parity with gd1 as far as the current heketi use cases are concerned at least. (Some stuff like self-heal and rebalance needs work at least.)
  5. Move functionalities over from heketi into gd2. This will establish the high level APIs for volume creation in gd2, and at first keep the heketi RESTful endpoints just as a thin shim layer.
  6. Finally switch the consumers (e.g. kubernetes provisioner) over to using gd2 instead of heketi.

Here, step 1 might be seen optional, but it would give us a quick start with little effort. Steps 3 and 4 can be done in parallel to each other and to steps 1 and 2 (they already have started). Only step 5 will start adding the new REST apis for intelligent volume management to gd2.

The benefits of this approach as I see it would be to

  • keep working stuff working all the time, and have a ci for it from day 1.
  • benefit from the rather well-tested code and the lessons learned in heketi.
  • make sure we only land the new resftful apis in the official code once the api and the implementation are what we want them to be.

If you want to start testing the stuff internally from the gd2 perspective, we can implement test cases that use internal functions, instead of implementing restful apis and doing e2e tests here. We do have the luxury of taking this more careful approach, since we have the heketi rest endpoints available for e2e testing.

More than happy to discuss this in greater detail as we move along!

I will also look at the patch in more detail. And I am sure there's a lot of goodness in it, but I wanted to prevent having it merged in a hurry, but instead make sure it is aligned with the experiences from heketi and with the broader plan (at least if we can agree to that).

@aravindavk @atinmu @prashanthpai @kshlm @rishubhjain @raghavendra-talur @phlogistonjohn @Madhu-1 ....

Some notes about Allocator: #466 (comment)

@aravindavk
Copy link
Member Author

Hmmm, I think there's more overlap than this:

Overlap I mentioned wrt to reusability of the code as library. except allocator/placer rewrite is required for all the other parts of this feature.

The data store needs to be replaced by gd2's internal state store (based on etcd, afaik). And the API will need to be merged with the gd2 API, ideally in a way that's largely equivalent. But this can imho be the very last step rather than the first one. The placer can be re-used as a package at some point (we're almost there), but I think also the disk management code in heketi has a lot of value to add since it's production proven and has a lot of lessons learned implemented. Sure it uses ssh or kube/oc exec in the backend, but that's just a detail. We may not refactor the disk mgmt code in heketi enough to re-use it as a package (since that might not be worth the effort), but the logic pieces with all the rollback code and defers etc should at the very least be taken as a guidline. Ideally at some point we could copy code over (including the test cases!) and continue maintaining it in gd2 from that point on. Remember that untested code is broken code. :-)

Glusterd2 already has data store to store the Volume details and device details(Device APIs are already part of Gd2). Disk management code can not be reused as is from Heketi, Glusterd2 already has Transaction framework we should use that. Transaction framework already supports rollback functionality. I agree that best practices from Heketi about disk management should be considered while rewriting disk management code in Glusterd2.

The heketi allocator (placer) determines the distribute count for us. The main point of intelligent volume provisioning in heketi is that we only get the requested durability type of the volume (replica-3, distperse 4+2, etc...) in the request, and the size, and the distribution is handled by heketi. We could make it an optional parameter at some point to require a specific distribution count, if anyone needs this. I think the first and foremost goal to get to feature parity with the current heketi model, which is how intelligent volume provisioning with Gluster is used out there, instead of creating a new functionality that does not quite match. And we can add more features or aspects later on.

Distribute count is already considered as optional input. Except Size all other inputs are optional. I already mentioned about automatic distribute count in "issues yet to be addressed" here #661 (comment)

Let me explain some more on the 'phased aproach' that I was mentioning. This is an approach with the goal to bring high-level volume provisioning and other features from heketi into gd2 for broader consumption so that eventually heketi will not be there any more (at least for these features), while keeping the current consumers of heketi functionality working.

I agree phased approach helps Heketi to provide same user experience and not break the existing behavior. Sure, Heketi can consume default Volume Create REST API for its usecase. All the device management can be managed by Heketi itself. But Glusterd2 project as upstream goal it should provide smart volume create API as well. Since APIs are independent, it will not affect Heketi functionality/consumers in anyway. Once Glusterd2 smart volume Create API becomes stable, Heketi can consider to start using it.

The benefits of this approach as I see it would be to

keep working stuff working all the time, and have a ci for it from day 1.
benefit from the rather well-tested code and the lessons learned in heketi.
make sure we only land the new resftful apis in the official code once the api and the implementation are what we want them to be.

If you want to start testing the stuff internally from the gd2 perspective, we can implement test cases that use internal functions, instead of implementing restful apis and doing e2e tests here. We do have the luxury of taking this more careful approach, since we have the heketi rest endpoints available for e2e testing.

Note, The behavior of REST end points are different in Glusterd2 and Heketi. In Glusterd2 all the APIs are syncronous right now, we may start working on async APIs but it is not currently high priority at the moment. I agree we can explore the posibility of reusing heketi's test cases for this feature.

We need Glusterd2 e2e tests also since we need to uncover bugs from other part of the code(REST handler/Transaction framework) including allocator/placer.

More than happy to discuss this in greater detail as we move along!

Thanks

I will also look at the patch in more detail. And I am sure there's a lot of goodness in it, but I wanted to prevent having it merged in a hurry, but instead make sure it is aligned with the experiences from heketi and with the broader plan (at least if we can agree to that).

I am still working on some todos and tests to make this usable.

@aravindavk
Copy link
Member Author

Split the patch into two for easy review. First patchset contains re-factor and deviceutils library. Second patch contains smartvol plugin code. I will add tests and other fixes as new patchset.

route.Route{
Name: "SmartVolumeCreate",
Method: "POST",
Pattern: "/smartvol",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original thought was to have smart and classical volume creation under the same API and just trigger the classical or smart creation based on how much of the possible input parameters are set.

If we are sticking with a separate url, we should consider making the API as heketi-compatible as possible, since heketi is the smart volume create api for gluster that exists out there, and hundreds of users are consuming it via kubernetes provisioners. The more we deviate from what heketi offers, the harder it will be to do the switch once gd2 is ready to take over in the space.

@@ -0,0 +1,29 @@
package bricksplanner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused by this layout. That's probably due to my background on how heketi treats these concepts. At that high level /smart view of volume creation, heketi specifies not a volume type but a "durability type" this can for instance be replicate (e.g. with replica count 3), or replicate with arbiter, or disperse (with counts 4,2). Then the distribution factor is just how many copies of such brick-sets of the durability type you have. (So a volume of (durability) type "distribute" does not exist. The "none" durability is just the special case of replica 1 and does not require code of its own.

For me at least this view and layout adds a lot of clarity. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also always confused by what a "subvolume" is. It seems this is what heketi calls bricks-set, i.e. the smallest unit of bricks for the given durability type, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the word 'subvolume' in gluster word comes from it's volfile grammar. Check any file ending with .vol in /var/lib/glusterd/vols/$volname/, and you can see a definition which looks something like below:

volume patchy-dht
    type cluster/distribute
    option lock-migration off
    subvolumes patchy-client-0 patchy-client-1 patchy-client-2 patchy-client-3 patchy-client-4
end-volume

In the above, notice the use of 'subvolumes'. This is the one line which establishes the relations in gluster's translator graph, like where to place the given translator etc etc. While most of the translators have just 1 subvolumes for them, the cluster/* translators can have more than 1 subvolumes.

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

Let me follow up on some previous discussion in this PR that somhow died away.

@aravindavk wrote:

@obnoxxx wrote:

Let me explain some more on the 'phased aproach' that I was mentioning. This is an approach with the goal to bring high-level volume provisioning and other features from heketi into gd2 for broader consumption so that eventually heketi will not be there any more (at least for these features), while keeping the current consumers of heketi functionality working.

I agree phased approach helps Heketi to provide same user experience and not break the existing behavior. Sure, Heketi can consume default Volume Create REST API for its usecase. All the device management can be managed by Heketi itself. But Glusterd2 project as upstream goal it should provide smart volume create API as well. Since APIs are independent, it will not affect Heketi functionality/consumers in anyway. Once Glusterd2 smart volume Create API becomes stable, Heketi can consider to start using it.

I think you are missing my main point here: My main point is not to help heketi but to help gd2! And to help those current consumers of heketi that will become consumers of gd2. I am not mentioning plugging gd2 into heketi in order to release it but as a test vehicle for testing gd2.

There are hundreds of users of the current kubernetes->heketi->gluster stack, and I think we should make the later transition to a kubernetes->gd2 stack as safe and seamless as possible, also limiting our work in adapting the code and fixing bugs that we had already fixed in heketi. Hence I am so adamant about this phased approach.

Let me get very general and state what I see as the main tasks and responsibilities for our GD2 work, in order of priority:

  1. Get to a state where gd2 has feature parity with gd1 and is a solid stable replacement.
  2. Integrate heketi functionalities into gd2, so that gd2 can become a stable replacement for the heketi/gd1 stack for our users in kubernetes land. (There are hundreds!)
  3. Add more features.

As pare of point 2 I think we should make APIs and features as compatible as possible.

Note, The behavior of REST end points are different in Glusterd2 and Heketi. In Glusterd2 all the APIs are syncronous right now, we may start working on async APIs but it is not currently high priority at the moment.

This, imho, is just a detail and not an inhibitor for trying to as much API compatible as possible and reasonable (wrt to urls and payloads).

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

Given all the points I made above, and also based on what @raghavendra-talur and @phlogistonjohn commented per review, I would really prefer and recommend not to rush this patch into the code for gluster 4.1. I think it will give a wrong or at least confusing sign to the community.

I would much rather have 4.1 concentrate on bringing stability and gd1 feature parity a little further. And we can move forward in master with this very soon after 4.1 is branched off. This way we can bring these features in much more carefully and have them in a good state for 4.2.

My 2 cents, @aravindavk @kshlm @prashanthpai @rishubhjain @raghavendra-talur @phlogistonjohn @Madhu-1 @ansiwen @amarts ...

@aravindavk
Copy link
Member Author

Thanks @raghavendra-talur for the review.

** the api needs to be merged with other volume create api and you have agreed to it and made a note for the same, so it is all good.

I am still not convinced why we need to merge it into single API? CLI is already common for both types of Volume creation. Seperate API avoids lot of confusion in API documentation/debuggebility.

** Currently the device info is stored/iterated as part of node info and later we would want to have a easy way to migrate a device from a node to another ; so it might mean a little change in the structs there.

Not directly related to this PR. @rishubhjain already added PeerID in device struct, and as part of device management APIs(#728) seperating out of peer is planned.

** brick struct needs to be more modular if we want to support other type of bricks, like ext4 bricks, dir path etc.

Sure, noted. But can be future PRs

** loop that picks device for bricks will need changes/replacement when we start integrating allocator/placer from heketi.

Sure, but can be future PRs

I would prefer to fix issues mentioned above and merge after that. But if we have to merge, then we should agree to not freeze the rest URLs or the API structs with this PR.

Already mentioned the advantage of having separate URL. I feel issues mentioned above are not blocker for merging.

@aravindavk
Copy link
Member Author

Thanks @obnoxxx for the review.

I think the original thought was to have smart and classical volume creation under the same API and just trigger the classical or smart creation based on how much of the possible input parameters are set.

As I commented earlier, having same API increases the complexity to end users.

If we are sticking with a separate url, we should consider making the API as heketi-compatible as possible, since heketi is the smart volume create api for gluster that exists out there, and hundreds of users are consuming it via kubernetes provisioners. The more we deviate from what heketi offers, the harder it will be to do the switch once gd2 is ready to take over in the space.

I'm slightly confused by this layout. That's probably due to my background on how heketi treats these concepts. At that high level /smart view of volume creation, heketi specifies not a volume type but a "durability type" this can for instance be replicate (e.g. with replica count 3), or replicate with arbiter, or disperse (with counts 4,2). Then the distribution factor is just how many copies of such brick-sets of the durability type you have. (So a volume of (durability) type "distribute" does not exist. The "none" durability is just the special case of replica 1 and does not require code of its own.

In Gluster, Subvolume is each distribution group. Gluster is capable of having Distribute subvolume inside a distribution group(Ex: Tiering uses DHT inside DHT). This API also will not accept Volume type directly finds it based on the inputs like replica/disperse/arbiter count.

(So a volume of (durability) type "distribute" does not exist.

This is because, API supports specifying distribution count and can create pure distribute Volume

For me at least this view and layout adds a lot of clarity. :-)

With Glusterd2, flexibility provided in Volume create request to accept list of sub volumes instead of just accepting list of bricks(https://github.com/gluster/glusterd2/blob/master/doc/quick-start-user-guide.md#create-a-volume). This layout helps in cleaner Volfiles generation, control over induvidal subvolume(Example: Replace a Subvolume, Support for mix of sub volume types etc).

The same layout is reused here since after planning and preparing the bricks, rest of the steps are common to both the Volume creation APIs

I think you are missing my main point here: My main point is not to help heketi but to help gd2! And to help those current consumers of heketi that will become consumers of gd2. I am not mentioning plugging gd2 into heketi in order to release it but as a test vehicle for testing gd2.

There are hundreds of users of the current kubernetes->heketi->gluster stack, and I think we should make the later transition to a kubernetes->gd2 stack as safe and seamless as possible, also limiting our work in adapting the code and fixing bugs that we had already fixed in heketi. Hence I am so adamant about this phased approach.

In my comment, I also agreed on the phased approach for the final product.

Let me get very general and state what I see as the main tasks and responsibilities for our GD2 work, in order of priority:

Get to a state where gd2 has feature parity with gd1 and is a solid stable replacement.
Integrate heketi functionalities into gd2, so that gd2 can become a stable replacement for the heketi/gd1 stack for our users in kubernetes land. (There are hundreds!)
Add more features.

I agree to these points for product. If we plan to integrate Volume provisioner with Heketi in final phase we will not be able to uncover many issues related to this. Since the API is independent of other Volume creation API, the phased approach for product is not affected.

As pare of point 2 I think we should make APIs and features as compatible as possible.

Given all the points I made above, and also based on what @raghavendra-talur and @phlogistonjohn commented per review, I would really prefer and recommend not to rush this patch into the code for gluster 4.1. I think it will give a wrong or at least confusing sign to the community.

Added my response above.

I would much rather have 4.1 concentrate on bringing stability and gd1 feature parity a little further. And we can move forward in master with this very soon after 4.1 is branched off. This way we can bring these features in much more carefully and have them in a good state for 4.2.

Glusterd2 is experimental in Gluster 4.1 release. To make this feature stable for 4.2 release, we would like this feature available in 4.1 because integration POCs(Ansible/Tendril integration) prefers released branch than the master and Also this will helps us to get feedback from Gluster community.

@amarts
Copy link
Member

amarts commented May 31, 2018

Like the conversations on the patch, looks like we need to continue the momentum.

I would much rather have 4.1 concentrate on bringing stability and gd1 feature parity a little further. And we can move forward in master with this very soon after 4.1 is branched off. This way we can bring these features in much more carefully and have them in a good state for 4.2.

Glusterd2 is experimental in Gluster 4.1 release. To make this feature stable for 4.2 release, we would like this feature available in 4.1 because integration POCs(Ansible/Tendril integration) prefers released branch than the master and Also this will helps us to get feedback from Gluster community.

The above is the very thought which crossed my mind. I see that GD2 is still experimental with 4.1 glusterfs release, and hence it would allow us for more feedback if we push the experimental features with it, than adding it to the stable version. That way, as developers we can uncover most issues in the core part of the product, and also allows us to have time to bring changes incrementally.

I recommend to merge this feature for 4.1 with a statement saying it is experimental, and also the APIs are not finalized.


About same API endpoints for 'volume create' (as in old way), and new 'get me a volume', Aravinda started the first version of patch with single API, but I am the one who was asking for keeping them separate. Idea is, I don't want to confuse the 95% of the users who would use only the new APIs (ie, dynamically create the volume with given size), with the 5% of the power/expert developers who would want to play with old API for advanced features.

We can debate more on this in coming days


If I go through the reviews on latest code, I don't see any technical blockers for the patch, other than calling out the API is not final, and some internal code cleanups, hence request to go ahead with merging this feature, mainly as it also has CLI functionality, and test cases added to it. Which means users can actually consume this feature and give feedback.

@ShyamsundarR from the release point of view, what do you think?

@atinmu
Copy link
Contributor

atinmu commented May 31, 2018 via email

@atinmu
Copy link
Contributor

atinmu commented May 31, 2018 via email

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

@aravindavk wrote:

@raghavendra-talur wrote:

** the api needs to be merged with other volume create api and you have agreed to it and made a note for the same, so it is all good.
I am still not convinced why we need to merge it into single API?

I don't think we need it for any technical reason. But it was what @raghavendra-talur said was the original idea behind it. (See Talur's presentation at gluster summit in Prague 2017.) It seems this needs to be discussed.

CLI is already common for both types of Volume creation. Seperate API avoids lot of confusion in API documentation/debuggebility.

I do agree that common CLI may be the more important, and it's good that it is there. For API, I don't see a big source of confusion for a consumer. It can be very natural for a REST API to have the same API with fewer or more input parameters provided that triggers different code paths like smart or basic volume creation. I see no big source of confusion here. Having separate APIs/URLs mainly keeps the code more separate. Doing it this way now and changing it later may be implying more work later if we choose to change it.

==> There seems to be a need to get to consens here.

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

@aravindavk writes:

In my comment, I also agreed on the phased approach for the final product.

That is true, you wrote this, but I am still confused: This PR is basically taking one of the last steps of the phased aproach and rushes it as the first step, essentially reversing the approach. :-)

Let me get very general and state what I see as the main tasks and responsibilities for our GD2 work, in order of priority:

  1. Get to a state where gd2 has feature parity with gd1 and is a solid stable replacement.
  2. Integrate heketi functionalities into gd2, so that gd2 can become a stable replacement for the heketi/gd1 stack for our users in kubernetes land. (There are hundreds!)
  3. Add more features.

I agree to these points for product.

I am not talking about product but general technology / upstream here.

If we plan to integrate Volume provisioner with Heketi in final phase we will not be able to uncover many issues related to this.

I don't understand. First of all, we don't plan to integrate a volume provisioner (of gd2?) into heketi, but move heketi's volume provisioner into gd2. The phased approach has the integration of gd2 into heketi for a test setup as one of the first steps not as a final step. In the final situation, heketi won't be there any more.

This early step is meant to help catch errors, to give early access to an additional way of testing and validating the work we do in gd2. And to make sure our current consumers of the smart volume privisioner keep working.

I don't see how adding integrating gd2 into heketi would prevent us from uncovering issues, since it would only be an additional source of tests and verifications...

Since the API is independent of other Volume creation API, the phased approach for product is not affected.

Creating the smart volume api is part of the phased approach and rather one of the later steps than the first step, so I don't understand this statement.

@aravindavk
Copy link
Member Author

I don't think we need it for any technical reason. But it was what @raghavendra-talur said was the original idea behind it. (See Talur's presentation at gluster summit in Prague 2017.) It seems this needs to be discussed.

If I remember right, the presentation was about using Heketi as Middleware in Glusterd2. We already discussed about it here #466 (comment)

@kshlm
Copy link
Member

kshlm commented May 31, 2018

@aravindavk About having a single end-point for creating volumes.

Our end-goal is to have a single end-point for volume creation. The expectation would be for GD2 to create and give volumes when requested for. The default should be to expect the least amount of details from the users for creating the volume, with volume size being to only thing required is nothing is provided. But the API needs to also allow the user to provide as many details as possible, such that they are able to manually layout the volume.

Our aim should be to provide a automatic create API, that supports manual customisation.

Right now, as we do this initial development, we are doing it the other way round. We have a manual API, on top of which we're building the automatic API. So at the moment, it makes sense (to me at least) to have development happen as a separate API. But we should remember that the end-goal is for that we have a single API.

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

I don't think we need it for any technical reason. But it was what @raghavendra-talur said was the original idea behind it. (See Talur's presentation at gluster summit in Prague 2017.) It seems this needs to be discussed.

If I remember right, the presentation was about using Heketi as Middleware in Glusterd2.

Not really. It was (as I remember) about implementing heketi-like smart volume creation including disk management in gd2 as a plugin or middleware, extending the existing volume create api.

@atinmu
Copy link
Contributor

atinmu commented May 31, 2018

@obnoxxx My suggestion (and request) would be to think forward instead of going backward and I believe we all are working in a phased wise approach. The action items what both GD2 and Heketi Development took was to debate and discuss on the pros/cons on reimplementing the dynamic volume provisioning in GD2 layer vs reusing certain heketi libraries (by refactoring) and use them as modules. AFAIK, Aravinda did provide the perspectives about what are the benefits of having the implementation done at GD2 layer instead of reusing heketi's libraries as certain things may not be needed at all to be reused. @aravindavk request you to provide you link to that discussion. On the same line, I haven't seen any objections to that proposal from @raghavendra-talur or @phlogistonjohn . To me, that is when the phase 1 already started. The next phases would be collaborating on working on these pieces to polish, enhance this piece and get to a solid implementation which can provide all the required user experience (be it a standalone Gluster user or a k8s user). We can parallely start feeding API requests from Heketi to the core GD2 APIs to test the functionality and stability which can stay as an independent task too but again that doesn't block us to start testing the IVP APIs too parallely as well.

Now that the basic implementation is already in place, and given GD2 continues to be tech-preview in GlusterFS 4.1, I don't see any reasons to hold this PR. All the comments can be looked at and addressed in subsequent PRs. Also merging APIs into single one may get us into lot of rework is not something I'd agree with.

So overall, if we do have any design issues which need to be discussed for this PR, happy to get invested for otherwise I'm not sure why that consensus can't be in place.

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

@amarts wrote:

@aravindavk wrote:

@obnoxxx wrote:

I would much rather have 4.1 concentrate on bringing stability and gd1 feature parity a little further. And we can move forward in master with this very soon after 4.1 is branched off. This way we can bring these features in much more carefully and have them in a good state for 4.2.

Glusterd2 is experimental in Gluster 4.1 release. To make this feature stable for 4.2 release, we would like this feature available in 4.1 because integration POCs(Ansible/Tendril integration) prefers released branch than the master and Also this will helps us to get feedback from Gluster community.

The above is the very thought which crossed my mind. I see that GD2 is still experimental with 4.1 glusterfs release, and hence it would allow us for more feedback if we push the experimental features with it, than adding it to the stable version. That way, as developers we can uncover most issues in the core part of the product, and also allows us to have time to bring changes incrementally.

Yes, gd2 is experimental in 4.0 and will remain experimental in 4.1. For this reason and the current state of gd2 stability, I doubt we will get a lot of feed back from the community. I don't even see a lot of core gluster developers using or considering gd2. Experience from other projects shows me that this feed back from the community is often times a false hope. Hence I think putting even more experimental features in for this release (and thereby risking further destabilization and future troubles changing) instead of concentrating on stabilization is the wrong move, imho.

I think we should aim for 4.2 for: gd1 feature parity and stability, and feature completeness of the heketi-level features. Whether those can be stable by then is not sure now. We can try, but no guarantee. Whether merging this PR now instead of after 4.1 will speed it up or slow it down is not 100% clear, I expect it to have a slowdown effect rather than speedup, since once code is released, progress is slower. But as always, I may be wrong. :-)

Also here is how I see gluster: While heketi is not part of gluster releases, to me it belongs to gluster as the high level service interface for volume management and day-2-operations, implementing stuff that was originally planned for glusterd2 and should now move back in glusterd2. Heketi is around since > 2 years now, and while it is used mostly in the kubernetes dynamic provisioning world. So from a core gluster perspective, heketi as a similar but worse situation as gd2 in that it is largely ignored. But there is a community of gluster out there that uses IVP already today and since a while. My point is to try and involve this existing part of the community or infrastructure as we move the features into gd2!

I recommend to merge this feature for 4.1 with a statement saying it is experimental, and also the APIs are not finalized.

In the end it's your call, @amarts, @atinmu, @kshlm. It can be done and it's not the end of the world. I still don't see a good reason to do it, but reasons to hold it.

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

@atinmu writes:

@obnoxxx My suggestion (and request) would be to think forward instead of going backward

Not sure how I'm thinking backward...

I think I'm thinking very much forward, in that I'm trying to help bring gd2 more from the ivory tower development to the real world, helping us to get it tested by real-world consumer scenarios, thereby keeping us honest and giving us direction where to move.

and I believe we all are working in a phased wise approach.

That's probably true, but everyone seems to have very different looking phased approaches in mind. ;-)

The action items what both GD2 and Heketi Development took was to debate and discuss on the pros/cons on reimplementing the dynamic volume provisioning in GD2 layer vs reusing certain heketi libraries (by refactoring) and use them as modules.

Yeah, ok, if you're doing a retrospective: In parallel this discussion, this PR was already started. That's not a problem as a PoC to have something more concrete as the basis for the discussion. But while I had the impression that the outcome of the discussion was that we agreed to re-use as much as possible from the heketi world, either directly (as packages/libs) or indirectly, this PR is now already proposed for inclusion in the release.

(Maybe I got the impression of the outcome of the discussion because I'm biased and that's what I want the outcome to be, but also because in all discussions we had, I talked and talked and wrote and wrote to explain my motivations and approaches, but only got very few responses... ;-) )

Anyways, to me it is somewhat unfortunate that some of those fundamental discussions only happened in a very direct and honest way in this PR and not earlier.

AFAIK, Aravinda did provide the perspectives about what are the benefits of having the implementation done at GD2 layer instead of reusing heketi's libraries as certain things may not be needed at all to be reused. @aravindavk request you to provide you link to that discussion.

Yes, please.

We can parallely start feeding API requests from Heketi to the core GD2 APIs to test the functionality and stability which can stay as an independent task too but again that doesn't block us to start testing the IVP APIs too parallely as well.

I don't see any point in having heketi talk to gd2 core APIs once this pr has been merged. In my original proposal we would have first done that and then gradually implemented IVP features in gd2, taking over much of heketi, and replacing the code in heketi to talk to the IVP apis in gd2, heketi becoming a thinner and thinner shim layer until it's gone.

Now that the basic implementation is already in place, and given GD2 continues to be tech-preview in GlusterFS 4.1, I don't see any reasons to hold this PR. All the comments can be looked at and addressed in subsequent PRs. Also merging APIs into single one may get us into lot of rework is not something I'd agree with. So overall, if we do have any design issues which need to be discussed for this PR, happy to get invested for otherwise I'm not sure why that consensus can't be in place.


Now let's close on this discussion.

As I said in the other comment, eventually it's your call. If you want to merge for 4.1, it's not the end of the world (if APIs and internal structures are not set in stone yet), but I've made my comments. ;-)

I want to make some points clear:

  • I don't want to diminish all the good work that @aravindavk and others have done!!
  • My intention is to help both the gd2 project and the heketi consumers to come out with a good solution via what I think is the best and fasted way.
  • Whatever will be the decision, I'm looking forward to further and improved collaboration on the merging code bases and merging teams. ;-)

Cheers - Michael

@prashanthpai
Copy link
Contributor

prashanthpai commented May 31, 2018

I doubt we will get a lot of feed back from the community. I don't even see a lot of core gluster developers using or considering gd2. Experience from other projects shows me that this feed back from the community is often times a false hope

I share the exact same opinion, here and in general too. IF getting the feedback from community is the motivation to rush this, IMHO, it's misplaced hope. Sorry if this comment seems pessimistic but it's realistic.

@amarts
Copy link
Member

amarts commented May 31, 2018

Non technical

I doubt we will get a lot of feed back from the community. I don't even see a lot of core gluster developers using or considering gd2. Experience from other projects shows me that this feed back from the community is often times a false hope

I share the exact same opinion, here and in general too. IF getting the feedback from community is the motivation to rush this, IMHO, it's misplaced hope. Sorry if this comment seems pessimistic but it's realistic.

@obnoxxx @prashanthpai I am a hopeful person 😄 While I always understand I see less participation from people when things are not stable, that is the norm of the world. But all I care is for those who want to try, there is an option (that may be less than 1% but there will be people). Even that 1% feedback would be good for us.

My stand always has been, if we don't experiment on new things, we will fail in long run. Ofcourse, if we experiment on stable codebase, it will have an negative impact of driving people away, but as long as the experiment is not causing issues to existing features, (or as long as everything is experimental), we should go for it.

I agree that even glusterfs developers wouldn't try gd2 if its not stable, and I guess that would be valid, because for their tasks, it doesn't help in any ways right now. I am pushing for this patch to be merged even with all these discussions in 4.1, mainly to get it out there in release notes announcement. I totally understand that we don't have a community which would pick the new release as soon as it is released, but to even build such community, we should be providing reasons for them.

Also note that, even if we announce it in 4.1, it wouldn't be used by someone who would test it in another month or so (or can be even longer). It will be same case for glusterfs 4.2 too. If we say gd2 is the only way forward if they upgrade, people will delay their testing (forget production migration). So, if we are pushing the experimental changes with glusterfs-4.2 release, we will be pushing out the testing effort to even further.

This is the exact reason I am pushing it to be merged before 4.1 release, and going out in the field, also because it has certain tests added, while we are stabilizing the feature, we would always have a consideration of the feature, which makes it much better than not considering the whole IVP stuff during the development.

As I again say, the whole project is not yet announced as GA, and not close it to it in another month or so. Hence having more experimental changes, and also everyone agreeing that nothing (not even may be the core APIs) is written in stone yet, is very critical.

Until we get regressions of glusterfs running with gd2 builds, there is always a chance that we may need a fundamental change in the design. So, why not get this in, and see where it proceeds. We can always have debates and change things.

@prashanthpai
Copy link
Contributor

prashanthpai commented May 31, 2018

But all I care is for those who want to try, there is an option (that may be less than 1% but there will be people). Even that 1% feedback would be good for us.

There are nightly builds for them :)

Just to be clear, I have not approved or disaproved the PR as I haven't reviewed it yet. I cannot comment on the readiness of this PR. If reviewers/maintainers who have reviewed this think that this is reasonably complete and ready to go, please do merge it.

My disagreement is with the reasoning behind pushing this. The reasoning quoted:

Glusterd2 is experimental in Gluster 4.1 release. To make this feature stable for 4.2 release, we would like this feature available in 4.1 because integration POCs(Ansible/Tendril integration) prefers released branch than the master and Also this will helps us to get feedback from Gluster community.

Integration projects can continue to use master branch and help us shape and evolve APIs. I also assume that another implicit reasoning behind the rush is the 4.1 tagging deadline.

Given the pace at which commits (lot of fixes, not just features) are being merged into glusterd2 master and also considering glusterd2 is not reasonably stable yet, I'd prefer that we do not maintain a separate stable release branch for 4.1. We can tag releases (including minor ones) from master branch. If that's the case, this PR can get in a 4.1.x release, whenever it's deemed ready. This takes away the motivation to rush this before 4.1 gets tagged.

While it's general practice for stable projects to not include features in a minor release, the motivation for that practice is to be careful about not breaking things for users. Given that glusterd2 is still experimental and not stable, we don't have a case to maintain a separate branch. It also saves us time in book-keeping, backporting and will help us innovate faster, on master branch.

@amarts
Copy link
Member

amarts commented May 31, 2018

If reviewers/maintainers who have reviewed this think that this is reasonably complete and ready to go, please do merge it.

I guess that was a given before putting my argument forward. Don't want to push people to merge a patch even if something is not technically ready! Beats the whole purpose of opensource and community. If there are concerns on technical part of it, totally OK to get this stopped.

My concern was for stopping it for the reason that we need to finalize on the API. While there are so many patches where, 'I will fix things in next PR' gets merged, I don't want a different yardstick to stop this patch. If it doesn't get merged by 4.1, it is fine too, it can get merged for 4.1.1 or 4.1.2, but having a clear ask on what more is to be done is helpful even to meet that target, otherwise, we will never get ready for merging the patch. Considering the patch is at least 42 days old. it is good to have progress on the patch for keeping the health of the project.

@atinmu
Copy link
Contributor

atinmu commented May 31, 2018

Based on the discussion and agreement between @obnoxxx @phlogistonjohn @raghavendra-talur @kshlm @prashanthpai over a call we all agreed to get this PR in with the following disclaimers:

It has to be explicitly called out in the release notes of GlusterFS 4.1 that this feature is still in development phase and might go through lot of changes which can change the internals and APIs.

Additionally, we'd also need to have a separate discussion initiated to have a converged decision on whether to have a single API for (non)smart volumes.

@aravindavk @kshlm Can one of you please rebase this patch as I see some conflicts in the branch and then get it merged?

@obnoxxx
Copy link
Contributor

obnoxxx commented May 31, 2018

@kshlm could we to proper rebases instead of having conflict resolution in the merge commits? This is so ugly and unnecessary. :-)

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussions we had and the agreement about api and structure stability (or lack thereof) for the next release phase, I'm not objecting putting the changes in even for 4.1. I would like to see the conflict resolution not in the merge commit, though... :-)

Updates: #466
Signed-off-by: Aravinda VK <avishwan@redhat.com>
@aravindavk
Copy link
Member Author

Opened new issue to track the changes required to make this feature complete. Please add if I missed any comments #851

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.