Intelligent volume provisioning in GD2 #466

atinmu · 2017-11-30T12:27:25Z

As of now in GD1 an user has to provide the exact volume & brick topology details as part of the volume create request to carve out a storage. Going forward with GD2 it should have a way to intelligently provision volumes where one can just mention the size of it and GD2 should depend on its algorithm to carve out the relevant volume for the user.

This github is a tracker id to assess the work required to enable this capability in GD2.

Checklist for Definition of Done

ReST APIs availibility in wiki
CLI commands
Sufficient logging
Availibility of metrics data if applicable
Xlator options table upto date?
Test cases & test coverage of the functionality

rishubhjain · 2018-01-29T10:33:07Z

Part of #417

rishubhjain · 2018-02-27T14:42:06Z

@prashanthpai @aravindavk I feel this feature should be a part of core glusterd2, i.e I should update the existing create volume code , I will detect whether size is passed in the request or not , and if the size is passed then the dynamic volume provisioning part of code will be executed.

prashanthpai · 2018-02-28T02:15:05Z

@rishubhjain I can see that it's easy for you to implement it in volume create. May be it's suitable for a later refactor in far future, but not right away. Volume create handler is already not trivial and is quite long. I don't prefer to add more complexity there right now. Besides, if parts of heketi becomes a library, it's cleaner for it to be imported in plugin and middleware.

brainfunked · 2018-03-01T03:17:51Z

Once the initial functionality is in, would it be possible to allow the administrator to specify the workload or type for the volume to be created? This would enable some flexibility, which would be great to have in gadmin.

Also, would it make sense to allow the administrator to optionally specify the hosts on which the bricks would be spread out?

To sum up, I'm wondering if the following format of the volume create command in gadmin would make sense:

gadmin# volume create size <size> <workload <workload> | type <type>> [hosts <list of hosts>]

prashanthpai · 2018-03-01T04:46:53Z

@rishubhjain

We can either have two separate request type API for volume create or add fields to share the existing one. I'd like inputs from @aravindavk, @kshlm and @raghavendra-talur on this.

For the initial implementation, we are looking at the request API to be minimal, something along these lines:

type VolCreateDynamicReq struct {
        Name string `json:"name"`
        Size string `json:"size"`
        Type string `json:"type,omitempty"`
}

The Type field is not meant to be mere one-on-one mapping to volume types. It's generic enough for admins to define types with names such as fast, gold, ssd, region-8, blr, openshift etc. These pre-defined and user-configurable types can then map to a particular volume type, bricks, physical location, workload etc.

One approach is for the API endpoint to be /volumes with a query param such as /volumes?dynamic=true. And the middleware logic somewhat like the following:

func Heketi(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {

                q := r.URL.Query()
                if _, ok := q["dynamic"]; !ok {
                        // pass through
                        next.ServeHTTP(w, r.WithContext(ctx))
                }

                // unmarshall request into VolCreateDynamicReq type
                // process the dynamic request

                // do the heketi magic to figure out bricks, replicas and subvols
                // prepare devices/bricks 

                // this is where you can (optionally) introduce heketi's
                // async model and reply to client.

                // send a normal volume create request to gd2
        })
}

With a separate API, the middleware can have a simple and efficient pass-through.

Without a separate API for dynamic volume create, we'll have to:

Either add Size and Type field to existing volume create request. This is pretty straight-forward. But note that core volume create handler doesn't consume these fields.
Or drain (and replenish) req.Body in middleware to check every incoming volume create request for Size. This seems like a less clean approach.

aravindavk · 2018-03-07T08:53:19Z

With middleware approach, we can't make prepare bricks and actual volume create as single Transaction. I think we should have this as part of Volume Create Handler itself.

// Unmarshal request into VolCreateDynamicReq
// If Size is 0, Unmarshal request into VolCreateReq

Preparing the brick can be one of the Transaction step, On success populate Subvols info in volinfo which will be used by other transaction steps.

We can also add LimitHosts field to VolCreateDynamicReq so that, choosing bricks can be limited from those nodes.

rishubhjain · 2018-03-07T09:03:41Z

@prashanthpai Sharing one API seems to be a better approach as it makes the volume create a less complex process. I think introducing a small flow in already existing volume create handler should do the job though it then doesn't make the dynamic_volume pluggable.

Also, I think we should discuss on which approach (normal volume create or dynamic volume create) will be used more w.r.t the new direction that was discussed in meetings and according to that, we can restructure the documentation as well.

prashanthpai · 2018-03-07T09:27:34Z

With middleware approach, we can't make prepare bricks and actual volume create as single Transaction. I think we should have this as part of Volume Create Handler itself.

I'm okay with making it part of volume create. However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Sharing one API seems to be a better approach...

May be.

as it makes the volume create a less complex process

May be not. Now you'll have more steps, some steps being conditional.

@kshlm Thoughts on this ?

rishubhjain · 2018-03-09T09:38:22Z

However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Yes the volume create is Async operation, but it seems keeping the volume create operation async is a better approach for components such as openshift and kubernetes.

atinmu · 2018-03-09T10:03:55Z

What's the exact need of having a volume create to be an async operation? IMO, commands which do not involve a heavy lifting transaction workflow can remain as not an async operation and volume create is definitely one of them.

raghavendra-talur · 2018-03-09T10:37:25Z

What's the exact need of having a volume create to be an async operation? IMO, commands which do not involve a heavy lifting transaction workflow can remain as not an async operation and volume create is definitely one of them.

I am yet to go through the complete discussion but I suggested it to be a async op because with brick creation and other transactions coming in with plugin it will become a long operation.

raghavendra-talur · 2018-03-09T10:55:49Z

@prashanthpai

I'm okay with making it part of volume create. However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Yes, we discussed the same. Async is required if the unique id is generated by server and served back later. In GD2, client gives the unique id for volume and it is fine if you choose synchronous model. There are other benefits in using async though.

raghavendra-talur · 2018-03-09T10:59:16Z

@prashanthpai @kshlm @aravindavk

Without a separate API for dynamic volume create, we'll have to:
Either add Size and Type field to existing volume create request. This is pretty straight-forward. But note that core volume create handler doesn't consume these fields.
Or drain (and replenish) req.Body in middleware to check every incoming volume create request for Size. This seems like a less clean approach.

The other model where the URL changes for dynamic is not acceptable. Either of the above two models should work. I suggest Type to be replaced by Options if you wish you use it later for other purposes. It could be a map of key value pairs.

prashanthpai · 2018-03-15T10:10:02Z

@rishubhjain @raghavendra-talur

The glusterd2 process/service can support async requests if deemed necessary for certain operations. However, the volume create REST handler will remain synchronous. This means that the core handler will not be dealing with serving/replying to asynchronous requests.

The "asynchronicity" shall be added by the heketi middle-ware. It will handle only async requests, convert them to synchronous requests and pass it down. The middle-ware will maintain state of asynchronous requests (job queue/id) in its store namespace.

                                  +------+
                                  | etcd |
                                  +--+---+
                                     ^
                                     |Maintain
                                     |state of request
                                     |(job queue/id)
                                     |
                               +-----+------+                     +----------+
                               |            |                     |          |
+-------+    Async Request     | ASYNC      |    Synchronous      |  Volume  |
| CLIENT| -------------------> | MIDDLWWARE | ------------------> |  Create  |
+-------+                      |            |                     |  Handler |
                               +------------+                     |          |
                                                                  +----------+

aravindavk · 2018-03-16T04:54:43Z

I still think, we should not implement this as Middleware. Because Volume create with Size is not treated as single Transaction. If Middleware is successful and actual Volume Create fails, then no rollback available for Middleware.(Another API required to Cleanup)

We can split this into two parts,

Middleware chooses the bricks paths and size based on incoming request and using the information about the devices. Note: This will not create any bricks it only picks the node and brick device information.
Actual Volume create understands these information and adds prepare brick Transaction step. This Transaction still can be maintained in plugin code itself. Volume create Transaction step just use the name.(Extra logic to include this Transaction step conditionally).

With this approach, all the steps will be executed in same Transaction. Let me know your thoughts.

prashanthpai · 2018-03-16T04:59:47Z

I still think, we should not implement this as Middleware. Because Volume create with Size is not treated as single Transaction.

Agreed. That is a better approach.

I'm only not in favour of making the volume create handler async. The middleware should be the one tracking state of the async request. All our core handlers will be synchronous.

aravindavk · 2018-03-16T07:44:16Z

I'm only not in favour of making the volume create handler async. The middleware should be the one tracking state of the async request.

We can make requests Async later if required. We need to work on providing general framework for handling Async requests including Volume Create.

aravindavk · 2018-03-16T08:01:35Z

Sharing some notes about Intelligent Volume Provisioning required from Glusterd2. Please add if any change required with the logic/implementation.

Cluster Setup

Attach all the nodes using POST /v1/peers API

Status: Already available

Register the available devices

Register available devices for each Peer. This also prepare the device by creating pv and vg.

Example Request

POST /v1/devices
{
    "peer-id": "peer_id",
    "names": ["device_name1", "device_name2"]
}

This needs to be done for all the Peer nodes

Status: Already Available

Peer Grouping

If all Peers belongs to different failure domain, then configuring
Group is unnecessory. Peer groups are required if we need to group the
peers such that bricks of same sub volume will not be created in same
group. For example, Cluster has 4 Peers P1, P2, P3 and P4, but
P1 and P2 belongs to same group/rack server then group it
accordingly so that when choosing bricks for subvolumes multiple
bricks for a sub volume will not be picked from same group. In this
example, grouping is only required for P1 and P2.

POST /v1/peers/P1/group/P1_P2
POST /v1/peers/P2/group/P1_P2

Once the grouping is available, If a Replica 3 Volume is requested,
bricks will be choosen as below

Brick 1 - From P1 or P2, since it belongs to same Group
Brick 2 - From P3
Brick 3 - From P4

Without the Group information, Bricks will be created as, Brick1 from P1, Brick2 from P2 and Brick3 from P3.

If Group information is modified after Volume Create, no change to be
done to already created Volume. In case of shrink or expand, we need
to consider newly configured Group information. Chossing Peer for
Expand/Shrink is still under discussion.

Note: If Group is not configured, PeerID will be used as Group.

Status: Patch under review

Choosing Bricks for a Volume

Heketi's Simple Ring Allocator
creates the Ring and stores it Db. This will easily go out of sync
when the devices are prepared manually and Volume is created using
Volume Create API.

Storing the Ring in Db is not Flexible since we need to update the
Ring information on Device Add, When device used externally for other
Volume operations. With Glusterd2, we will store only the details
which are required to choose bricks automatically, for example
PVFreeSize information in device list. On Volume Create request,
Glusterd2 can prepare Bricks list based on choosen logic and then pass
it on to Volume Create Request.

This gives more flexibility to choose bricks compared to choosing
Bricks from the prepared list.

Allow brick in same group if part of different Sub Volume
If user requests Volume to be created using those Peers/device whose
Metadata says "SSD".
Restrict Volume creation only if Peers belong to given list of
Groups
Create Volume only using given list of Peers.

Psudo code to choose bricks,

required_bricks = <num>
devices = []
for group in groups {
    peer = pick_a_peer_from_group(group)
    device = pick_a_device_from_peer(peer)
    if peer.is_online() && size_ok {
       devices.append(device)
    }
}

Above logic looks similar to Simple Ring Allocator, but Glusterd2 will
create this information as and when required. Also customization
possible while picking Peer from group or while picking device from
Peer.

Following details are required from user

Size                    - Size of Volume
Distribute Count        - (Optional, Default to 1) To Decide how many sub volume required
Replica Count           - (Optional, Default to 3) Number of Replica required(2 or 3)
Arbiter Count           - (Optional, Default to 0) If Arbiter brick is required
DisperseCount           - (Optional, Default to 0) Only if Disperse Volume required
RedundancyCount         - (Optional, Default to 0) Only if Disperse Volume required
Snapshot Feature Enable - (Optional, Default to false) If Gluster Snapshot will be taken on this Volume
Snapshot Reserve Factor - (Optional, Default to 1) Snapshots require extra space.
Peers List              - (Optional, Default to empty) Choose Bricks only from these Peers
Groups List             - (Optional, Default to empty) Choose Bricks only from these Groups

FAQs

If Volume is created by regular Volume Create API and uses the
devices managed by Glusterd2: Volume Create/Expand API will update
the PVFreesize if the device is managed by Glusterd2.
If the Group information is updated after creating the Volume:
If Volume is already created, then those bricks will not be
modified. Only new Volume create and Expand will use the new Group
information.

prashanthpai · 2018-03-16T09:13:14Z

We can make requests Async later if required. We need to work on providing general framework for handling Async requests including Volume Create.

That general framework is the async middleware described above

brainfunked · 2018-03-20T04:09:55Z

From a gadmin perspective (well, UX perspective in general), it would be nice to be able to split up a transaction into steps and provide updates to the user about the steps, so as to provide a 'progress report'. I would like all your thoughts regarding the following:

Would it be possible for the async APIs to declare a step-by-step breakdown of an operation to be carried out, upfront? As the operation progresses, the responses could indicate the status of the steps declared initially.
Whether we provide such a breakdown or not, would we want to include any details regarding subvolumes in the final output of the transaction?

Looking at debugging scenarios where there's a problem that needs to be traced to the block devices, the composition of the volume via subvolumes on specific per-node block devices would probably be necessary information. We need to consider this as a user experience problem across all the gd2 APIs, rather than just this particular API. The consistency of information presented should, IMHO, be a prime concern.

aravindavk · 2018-03-20T07:05:38Z

From a gadmin perspective (well, UX perspective in general), it would be nice to be able to split up a transaction into steps and provide updates to the user about the steps, so as to provide a 'progress report'. I would like all your thoughts regarding the following:

The comment #466 (comment) talks about multiple APIs(Cluster register, add devices, volume create). Glusterd2 will not combine all these steps into single transaction.

Async APIs are low priority right now, we would like to see the functionality working with Synchronous API.

Would it be possible for the async APIs to declare a step-by-step breakdown of an operation to be carried out, upfront? As the operation progresses, the responses could indicate the status of the steps declared initially.

Transaction can have multiple steps(Already supported), once we add the Async API support, status breakdown is possible for Transaction steps(Example, Step1: Complete, Step2: Progress etc). It is also possible to give other informations like number of steps in Transaction, time taken for each step etc.

Note: We are not planning the API to accept the list of steps from user and make it into Transaction. Glusterd2 new API should be implemented as plugin or middleware only.

Whether we provide such a breakdown or not, would we want to include any details regarding subvolumes in the final output of the transaction?

Volume create will return Volume info of the volume created, which will have sub volume details.

Looking at debugging scenarios where there's a problem that needs to be traced to the block devices, the composition of the volume via subvolumes on specific per-node block devices would probably be necessary information. We need to consider this as a user experience problem across all the gd2 APIs, rather than just this particular API. The consistency of information presented should, IMHO, be a prime concern.

Sub volume information and devices information is already available. Please provide more details to this use case.

rishubhjain · 2018-03-27T11:17:25Z

If Group information is modified after Volume Create, no change to be
done to already created Volume.

@aravindavk If group is changed and no changes are made to the already created volumes then won't the volume loose its property of being distributed or replica ?

aravindavk · 2018-03-27T11:22:46Z

@aravindavk If group is changed and no changes are made to the already created volumes then won't the volume loose its property of being distributed or replica ?

Volume functionality is unaffected, but if two peers were in different zones earlier now moved to same zone then bricks distributions are not optimal. It is also possible that multiple bricks of same replica reside in same zone. But moving bricks on zone/group change is expensive operation, for now we can keep this as known issue.

aravindavk · 2018-07-26T14:11:19Z

New issues opened for the enhancements for this feature. Closing this issue since the main feature is merged.

atinmu assigned rishubhjain Nov 30, 2017

atinmu added this to the mvp-2 milestone Nov 30, 2017

atinmu added FW: Plugins FW: ReST priority: high Integration labels Nov 30, 2017

rishubhjain mentioned this issue Dec 29, 2017

WIP: Intelligent Volume provisioning #513

Closed

rishubhjain mentioned this issue Jan 29, 2018

node grouping option #416

Closed

aravindavk mentioned this issue Apr 17, 2018

Intelligent Volume Provisioning #661

Merged

aravindavk mentioned this issue May 23, 2018

Refactoring for Intelligent Volume Provisioning #782

Merged

atinmu added Gluster 4.2 GCS/1.0 Issue is blocker for Gluster for Container Storage labels Jun 23, 2018

atinmu assigned aravindavk and phlogistonjohn Jun 23, 2018

atinmu mentioned this issue Jun 23, 2018

Abstract the concept of 'bricks' or 'exports' of Gluster. #417

Open

aravindavk closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent volume provisioning in GD2 #466

Intelligent volume provisioning in GD2 #466

atinmu commented Nov 30, 2017

rishubhjain commented Jan 29, 2018

rishubhjain commented Feb 27, 2018

prashanthpai commented Feb 28, 2018 •

edited

Loading

brainfunked commented Mar 1, 2018 •

edited

Loading

prashanthpai commented Mar 1, 2018 •

edited by kshlm

Loading

aravindavk commented Mar 7, 2018

rishubhjain commented Mar 7, 2018 •

edited

Loading

prashanthpai commented Mar 7, 2018

rishubhjain commented Mar 9, 2018

atinmu commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018 •

edited

Loading

prashanthpai commented Mar 15, 2018 •

edited

Loading

aravindavk commented Mar 16, 2018

prashanthpai commented Mar 16, 2018 •

edited

Loading

aravindavk commented Mar 16, 2018

aravindavk commented Mar 16, 2018

prashanthpai commented Mar 16, 2018

brainfunked commented Mar 20, 2018 •

edited

Loading

aravindavk commented Mar 20, 2018

rishubhjain commented Mar 27, 2018

aravindavk commented Mar 27, 2018 •

edited

Loading

aravindavk commented Jul 26, 2018

Intelligent volume provisioning in GD2 #466

Intelligent volume provisioning in GD2 #466

Comments

atinmu commented Nov 30, 2017

Checklist for Definition of Done

rishubhjain commented Jan 29, 2018

rishubhjain commented Feb 27, 2018

prashanthpai commented Feb 28, 2018 • edited Loading

brainfunked commented Mar 1, 2018 • edited Loading

prashanthpai commented Mar 1, 2018 • edited by kshlm Loading

aravindavk commented Mar 7, 2018

rishubhjain commented Mar 7, 2018 • edited Loading

prashanthpai commented Mar 7, 2018

rishubhjain commented Mar 9, 2018

atinmu commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018

raghavendra-talur commented Mar 9, 2018 • edited Loading

prashanthpai commented Mar 15, 2018 • edited Loading

aravindavk commented Mar 16, 2018

prashanthpai commented Mar 16, 2018 • edited Loading

aravindavk commented Mar 16, 2018

aravindavk commented Mar 16, 2018

Cluster Setup

Register the available devices

Peer Grouping

Choosing Bricks for a Volume

FAQs

prashanthpai commented Mar 16, 2018

brainfunked commented Mar 20, 2018 • edited Loading

aravindavk commented Mar 20, 2018

rishubhjain commented Mar 27, 2018

aravindavk commented Mar 27, 2018 • edited Loading

aravindavk commented Jul 26, 2018

prashanthpai commented Feb 28, 2018 •

edited

Loading

brainfunked commented Mar 1, 2018 •

edited

Loading

prashanthpai commented Mar 1, 2018 •

edited by kshlm

Loading

rishubhjain commented Mar 7, 2018 •

edited

Loading

raghavendra-talur commented Mar 9, 2018 •

edited

Loading

prashanthpai commented Mar 15, 2018 •

edited

Loading

prashanthpai commented Mar 16, 2018 •

edited

Loading

brainfunked commented Mar 20, 2018 •

edited

Loading

aravindavk commented Mar 27, 2018 •

edited

Loading