Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Intelligent volume provisioning in GD2 #466

Closed
6 tasks
atinmu opened this issue Nov 30, 2017 · 24 comments
Closed
6 tasks

Intelligent volume provisioning in GD2 #466

atinmu opened this issue Nov 30, 2017 · 24 comments
Assignees
Labels
FW: Plugins FW: ReST GCS/1.0 Issue is blocker for Gluster for Container Storage Integration priority: high
Milestone

Comments

@atinmu
Copy link
Contributor

atinmu commented Nov 30, 2017

As of now in GD1 an user has to provide the exact volume & brick topology details as part of the volume create request to carve out a storage. Going forward with GD2 it should have a way to intelligently provision volumes where one can just mention the size of it and GD2 should depend on its algorithm to carve out the relevant volume for the user.

This github is a tracker id to assess the work required to enable this capability in GD2.

Checklist for Definition of Done

  • ReST APIs availibility in wiki
  • CLI commands
  • Sufficient logging
  • Availibility of metrics data if applicable
  • Xlator options table upto date?
  • Test cases & test coverage of the functionality
@rishubhjain
Copy link
Contributor

Part of #417

@rishubhjain
Copy link
Contributor

@prashanthpai @aravindavk I feel this feature should be a part of core glusterd2, i.e I should update the existing create volume code , I will detect whether size is passed in the request or not , and if the size is passed then the dynamic volume provisioning part of code will be executed.

@prashanthpai
Copy link
Contributor

prashanthpai commented Feb 28, 2018

@rishubhjain I can see that it's easy for you to implement it in volume create. May be it's suitable for a later refactor in far future, but not right away. Volume create handler is already not trivial and is quite long. I don't prefer to add more complexity there right now. Besides, if parts of heketi becomes a library, it's cleaner for it to be imported in plugin and middleware.

@brainfunked
Copy link

brainfunked commented Mar 1, 2018

Once the initial functionality is in, would it be possible to allow the administrator to specify the workload or type for the volume to be created? This would enable some flexibility, which would be great to have in gadmin.

Also, would it make sense to allow the administrator to optionally specify the hosts on which the bricks would be spread out?

To sum up, I'm wondering if the following format of the volume create command in gadmin would make sense:

gadmin# volume create size <size> <workload <workload> | type <type>> [hosts <list of hosts>]

@prashanthpai
Copy link
Contributor

prashanthpai commented Mar 1, 2018

@rishubhjain

We can either have two separate request type API for volume create or add fields to share the existing one. I'd like inputs from @aravindavk, @kshlm and @raghavendra-talur on this.

For the initial implementation, we are looking at the request API to be minimal, something along these lines:

type VolCreateDynamicReq struct {
        Name string `json:"name"`
        Size string `json:"size"`
        Type string `json:"type,omitempty"`
}

The Type field is not meant to be mere one-on-one mapping to volume types. It's generic enough for admins to define types with names such as fast, gold, ssd, region-8, blr, openshift etc. These pre-defined and user-configurable types can then map to a particular volume type, bricks, physical location, workload etc.

One approach is for the API endpoint to be /volumes with a query param such as /volumes?dynamic=true. And the middleware logic somewhat like the following:

func Heketi(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {

                q := r.URL.Query()
                if _, ok := q["dynamic"]; !ok {
                        // pass through
                        next.ServeHTTP(w, r.WithContext(ctx))
                }

                // unmarshall request into VolCreateDynamicReq type
                // process the dynamic request

                // do the heketi magic to figure out bricks, replicas and subvols
                // prepare devices/bricks 

                // this is where you can (optionally) introduce heketi's
                // async model and reply to client.

                // send a normal volume create request to gd2
        })
}

With a separate API, the middleware can have a simple and efficient pass-through.

Without a separate API for dynamic volume create, we'll have to:

  • Either add Size and Type field to existing volume create request. This is pretty straight-forward. But note that core volume create handler doesn't consume these fields.
  • Or drain (and replenish) req.Body in middleware to check every incoming volume create request for Size. This seems like a less clean approach.

@aravindavk
Copy link
Member

With middleware approach, we can't make prepare bricks and actual volume create as single Transaction. I think we should have this as part of Volume Create Handler itself.

// Unmarshal request into VolCreateDynamicReq
// If Size is 0, Unmarshal request into VolCreateReq

Preparing the brick can be one of the Transaction step, On success populate Subvols info in volinfo which will be used by other transaction steps.

We can also add LimitHosts field to VolCreateDynamicReq so that, choosing bricks can be limited from those nodes.

@rishubhjain
Copy link
Contributor

rishubhjain commented Mar 7, 2018

@prashanthpai Sharing one API seems to be a better approach as it makes the volume create a less complex process. I think introducing a small flow in already existing volume create handler should do the job though it then doesn't make the dynamic_volume pluggable.

Also, I think we should discuss on which approach (normal volume create or dynamic volume create) will be used more w.r.t the new direction that was discussed in meetings and according to that, we can restructure the documentation as well.

@prashanthpai
Copy link
Contributor

With middleware approach, we can't make prepare bricks and actual volume create as single Transaction. I think we should have this as part of Volume Create Handler itself.

I'm okay with making it part of volume create. However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Sharing one API seems to be a better approach...

May be.

as it makes the volume create a less complex process

May be not. Now you'll have more steps, some steps being conditional.

@kshlm Thoughts on this ?

@rishubhjain
Copy link
Contributor

However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Yes the volume create is Async operation, but it seems keeping the volume create operation async is a better approach for components such as openshift and kubernetes.

@atinmu
Copy link
Contributor Author

atinmu commented Mar 9, 2018

What's the exact need of having a volume create to be an async operation? IMO, commands which do not involve a heavy lifting transaction workflow can remain as not an async operation and volume create is definitely one of them.

@raghavendra-talur
Copy link
Member

What's the exact need of having a volume create to be an async operation? IMO, commands which do not involve a heavy lifting transaction workflow can remain as not an async operation and volume create is definitely one of them.

I am yet to go through the complete discussion but I suggested it to be a async op because with brick creation and other transactions coming in with plugin it will become a long operation.

@raghavendra-talur
Copy link
Member

@prashanthpai

I'm okay with making it part of volume create. However, I prefer the volume create handler to be not async. IIRC, the heketi way is async i.e give back the client a ID/URI of some sort that the client can check back on. Can someone from heketi confirm this ?

Yes, we discussed the same. Async is required if the unique id is generated by server and served back later. In GD2, client gives the unique id for volume and it is fine if you choose synchronous model. There are other benefits in using async though.

@raghavendra-talur
Copy link
Member

raghavendra-talur commented Mar 9, 2018

@prashanthpai @kshlm @aravindavk

Without a separate API for dynamic volume create, we'll have to:
Either add Size and Type field to existing volume create request. This is pretty straight-forward. But note that core volume create handler doesn't consume these fields.
Or drain (and replenish) req.Body in middleware to check every incoming volume create request for Size. This seems like a less clean approach.

The other model where the URL changes for dynamic is not acceptable. Either of the above two models should work. I suggest Type to be replaced by Options if you wish you use it later for other purposes. It could be a map of key value pairs.

@prashanthpai
Copy link
Contributor

prashanthpai commented Mar 15, 2018

@rishubhjain @raghavendra-talur

The glusterd2 process/service can support async requests if deemed necessary for certain operations. However, the volume create REST handler will remain synchronous. This means that the core handler will not be dealing with serving/replying to asynchronous requests.

The "asynchronicity" shall be added by the heketi middle-ware. It will handle only async requests, convert them to synchronous requests and pass it down. The middle-ware will maintain state of asynchronous requests (job queue/id) in its store namespace.

                                  +------+
                                  | etcd |
                                  +--+---+
                                     ^
                                     |Maintain
                                     |state of request
                                     |(job queue/id)
                                     |
                               +-----+------+                     +----------+
                               |            |                     |          |
+-------+    Async Request     | ASYNC      |    Synchronous      |  Volume  |
| CLIENT| -------------------> | MIDDLWWARE | ------------------> |  Create  |
+-------+                      |            |                     |  Handler |
                               +------------+                     |          |
                                                                  +----------+

@aravindavk
Copy link
Member

I still think, we should not implement this as Middleware. Because Volume create with Size is not treated as single Transaction. If Middleware is successful and actual Volume Create fails, then no rollback available for Middleware.(Another API required to Cleanup)

We can split this into two parts,

  • Middleware chooses the bricks paths and size based on incoming request and using the information about the devices. Note: This will not create any bricks it only picks the node and brick device information.
  • Actual Volume create understands these information and adds prepare brick Transaction step. This Transaction still can be maintained in plugin code itself. Volume create Transaction step just use the name.(Extra logic to include this Transaction step conditionally).

With this approach, all the steps will be executed in same Transaction. Let me know your thoughts.

@prashanthpai
Copy link
Contributor

prashanthpai commented Mar 16, 2018

I still think, we should not implement this as Middleware. Because Volume create with Size is not treated as single Transaction.

Agreed. That is a better approach.

I'm only not in favour of making the volume create handler async. The middleware should be the one tracking state of the async request. All our core handlers will be synchronous.

@aravindavk
Copy link
Member

I'm only not in favour of making the volume create handler async. The middleware should be the one tracking state of the async request.

We can make requests Async later if required. We need to work on providing general framework for handling Async requests including Volume Create.

@aravindavk
Copy link
Member

Sharing some notes about Intelligent Volume Provisioning required from Glusterd2. Please add if any change required with the logic/implementation.

Cluster Setup

Attach all the nodes using POST /v1/peers API

Status: Already available

Register the available devices

Register available devices for each Peer. This also prepare the device by creating pv and vg.

Example Request

POST /v1/devices
{
    "peer-id": "peer_id",
    "names": ["device_name1", "device_name2"]
}

This needs to be done for all the Peer nodes

Status: Already Available

Peer Grouping

If all Peers belongs to different failure domain, then configuring
Group is unnecessory. Peer groups are required if we need to group the
peers such that bricks of same sub volume will not be created in same
group. For example, Cluster has 4 Peers P1, P2, P3 and P4, but
P1 and P2 belongs to same group/rack server then group it
accordingly so that when choosing bricks for subvolumes multiple
bricks for a sub volume will not be picked from same group. In this
example, grouping is only required for P1 and P2.

POST /v1/peers/P1/group/P1_P2
POST /v1/peers/P2/group/P1_P2

Once the grouping is available, If a Replica 3 Volume is requested,
bricks will be choosen as below

  • Brick 1 - From P1 or P2, since it belongs to same Group
  • Brick 2 - From P3
  • Brick 3 - From P4

Without the Group information, Bricks will be created as, Brick1 from P1, Brick2 from P2 and Brick3 from P3.

If Group information is modified after Volume Create, no change to be
done to already created Volume. In case of shrink or expand, we need
to consider newly configured Group information. Chossing Peer for
Expand/Shrink is still under discussion.

Note: If Group is not configured, PeerID will be used as Group.

Status: Patch under review

Choosing Bricks for a Volume

Heketi's Simple Ring Allocator
creates the Ring and stores it Db. This will easily go out of sync
when the devices are prepared manually and Volume is created using
Volume Create API.

Storing the Ring in Db is not Flexible since we need to update the
Ring information on Device Add, When device used externally for other
Volume operations. With Glusterd2, we will store only the details
which are required to choose bricks automatically, for example
PVFreeSize information in device list. On Volume Create request,
Glusterd2 can prepare Bricks list based on choosen logic and then pass
it on to Volume Create Request.

This gives more flexibility to choose bricks compared to choosing
Bricks from the prepared list.

  • Allow brick in same group if part of different Sub Volume
  • If user requests Volume to be created using those Peers/device whose
    Metadata says "SSD".
  • Restrict Volume creation only if Peers belong to given list of
    Groups
  • Create Volume only using given list of Peers.

Psudo code to choose bricks,

required_bricks = <num>
devices = []
for group in groups {
    peer = pick_a_peer_from_group(group)
    device = pick_a_device_from_peer(peer)
    if peer.is_online() && size_ok {
       devices.append(device)
    }
}

Above logic looks similar to Simple Ring Allocator, but Glusterd2 will
create this information as and when required. Also customization
possible while picking Peer from group or while picking device from
Peer.

Following details are required from user

Size                    - Size of Volume
Distribute Count        - (Optional, Default to 1) To Decide how many sub volume required
Replica Count           - (Optional, Default to 3) Number of Replica required(2 or 3)
Arbiter Count           - (Optional, Default to 0) If Arbiter brick is required
DisperseCount           - (Optional, Default to 0) Only if Disperse Volume required
RedundancyCount         - (Optional, Default to 0) Only if Disperse Volume required
Snapshot Feature Enable - (Optional, Default to false) If Gluster Snapshot will be taken on this Volume
Snapshot Reserve Factor - (Optional, Default to 1) Snapshots require extra space.
Peers List              - (Optional, Default to empty) Choose Bricks only from these Peers
Groups List             - (Optional, Default to empty) Choose Bricks only from these Groups

FAQs

  • If Volume is created by regular Volume Create API and uses the
    devices managed by Glusterd2:
    Volume Create/Expand API will update
    the PVFreesize if the device is managed by Glusterd2.
  • If the Group information is updated after creating the Volume:
    If Volume is already created, then those bricks will not be
    modified. Only new Volume create and Expand will use the new Group
    information.

@prashanthpai
Copy link
Contributor

We can make requests Async later if required. We need to work on providing general framework for handling Async requests including Volume Create.

That general framework is the async middleware described above

@brainfunked
Copy link

brainfunked commented Mar 20, 2018

From a gadmin perspective (well, UX perspective in general), it would be nice to be able to split up a transaction into steps and provide updates to the user about the steps, so as to provide a 'progress report'. I would like all your thoughts regarding the following:

  • Would it be possible for the async APIs to declare a step-by-step breakdown of an operation to be carried out, upfront? As the operation progresses, the responses could indicate the status of the steps declared initially.
  • Whether we provide such a breakdown or not, would we want to include any details regarding subvolumes in the final output of the transaction?

Looking at debugging scenarios where there's a problem that needs to be traced to the block devices, the composition of the volume via subvolumes on specific per-node block devices would probably be necessary information. We need to consider this as a user experience problem across all the gd2 APIs, rather than just this particular API. The consistency of information presented should, IMHO, be a prime concern.

@aravindavk
Copy link
Member

From a gadmin perspective (well, UX perspective in general), it would be nice to be able to split up a transaction into steps and provide updates to the user about the steps, so as to provide a 'progress report'. I would like all your thoughts regarding the following:

The comment #466 (comment) talks about multiple APIs(Cluster register, add devices, volume create). Glusterd2 will not combine all these steps into single transaction.

Async APIs are low priority right now, we would like to see the functionality working with Synchronous API.

Would it be possible for the async APIs to declare a step-by-step breakdown of an operation to be carried out, upfront? As the operation progresses, the responses could indicate the status of the steps declared initially.

Transaction can have multiple steps(Already supported), once we add the Async API support, status breakdown is possible for Transaction steps(Example, Step1: Complete, Step2: Progress etc). It is also possible to give other informations like number of steps in Transaction, time taken for each step etc.

Note: We are not planning the API to accept the list of steps from user and make it into Transaction. Glusterd2 new API should be implemented as plugin or middleware only.

Whether we provide such a breakdown or not, would we want to include any details regarding subvolumes in the final output of the transaction?

Volume create will return Volume info of the volume created, which will have sub volume details.

Looking at debugging scenarios where there's a problem that needs to be traced to the block devices, the composition of the volume via subvolumes on specific per-node block devices would probably be necessary information. We need to consider this as a user experience problem across all the gd2 APIs, rather than just this particular API. The consistency of information presented should, IMHO, be a prime concern.

Sub volume information and devices information is already available. Please provide more details to this use case.

@rishubhjain
Copy link
Contributor

If Group information is modified after Volume Create, no change to be
done to already created Volume.

@aravindavk If group is changed and no changes are made to the already created volumes then won't the volume loose its property of being distributed or replica ?

@aravindavk
Copy link
Member

aravindavk commented Mar 27, 2018

@aravindavk If group is changed and no changes are made to the already created volumes then won't the volume loose its property of being distributed or replica ?

Volume functionality is unaffected, but if two peers were in different zones earlier now moved to same zone then bricks distributions are not optimal. It is also possible that multiple bricks of same replica reside in same zone. But moving bricks on zone/group change is expensive operation, for now we can keep this as known issue.

@aravindavk
Copy link
Member

New issues opened for the enhancements for this feature. Closing this issue since the main feature is merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FW: Plugins FW: ReST GCS/1.0 Issue is blocker for Gluster for Container Storage Integration priority: high
Projects
None yet
Development

No branches or pull requests

7 participants