Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination Deployment to Service Handlers and CLI Commands and Domain Collections #237

Open
9 tasks
CMCDragonkai opened this issue Sep 2, 2021 · 22 comments
Open
9 tasks
Labels
design Requires design development Standard development enhancement New feature or request r&d:polykey:supporting activity Supporting core activity

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 2, 2021

Specification

Pagination is the process by which a client program can acquire a subset of data from a larger stream of data.

Polykey maintains potentially large streams of data:

  • Vaults
  • Sigchain
  • Nodes Database
  • Gestalts

Atm, all of this data is either returned in 1 unary call which returns an in-memory deserialised array of data, or it is returned with a stream. This creates a problem when the amount of data is large, or when you want to go to specific point in the stream and not have to stream from the beginning again.

The standard for doing this is "pagination". Pagination uses a "cursor" to index into a larger dataset. The 2 main ways of pagination are:

  • Cursor - use a ordered key and an limit
  • Offset - start with an offset

Of the 2, cursor pagination is the more "simpler" and flexible form and fits into our usecase quite well.

In addition to this, one can combine cursors with streaming to return a stream of results based on the cursor. The only difference at this point is holding things in memory when you are streaming versus accumulating the results in memory and returning the result.

In the case of returning a static result in-memory, you free-up locks but you use up more memory. In the case of returning a stream, you may use less memory but it ends up being more complicated and more locking takes place. Due to our usage of leveldb and its streams making use of leveldb snapshots this may hide some of the complexity.

As a first stage prototype, let's add in pagination to all unary calls, and return static result arrays to be used by the CLI/GUI. Later we can explore using streaming.

We've built pagination protocol before here: https://github.com/MatrixAI/js-pagination that library is intended to be used on the client, but it describes what you might expect the server to take. It would mean GRPC methods will need:

  • direction
  • seek
  • limit
  • seekBefore
  • seekAfter

The last 2 may not be so necessary as they introduce more flexibility.

We've done this before on the Prism project, so there are some fiddly things here to note that require further discussion.

Another point is that streaming may be more useful for "reactivity". Or observational APIs. PK isn't configured to push events out anywhere. If we intend to do some CQRS to the GUI to maintain eventual consistency, we may need to figure out if we designate streams as "live" events, so that downstream UIs can react to changes in state. See: https://stackoverflow.com/questions/39439653/events-vs-streams-vs-observables-vs-async-iterators/47214496

We want to apply the following parameters to any generator method. This should be the standard we use to allow pagination on steams. This can be applied to any GRPC streams as well.

    {
      order = 'asc',
      seek,
      limit
    }: {
      order?: 'asc' | 'desc';
      seek?: ClaimId;
      limit?: number;
    } = {},
Service handlers

GRPC service handlers for RPC calls that provide streaming need to support this pagination. The pagination parameters need to be supplied either as part of the requesting message or the call metadata. These parameters are provided to the generator we are consuming for the stream.

This will need to be applied to every RPC call that returns a stream.

CLI Commands

Some CLI commands output a list as a result. We need to apply pagination here as well. The CLI command will need to have parameters for seek, order and limit as specified above. Using these parameters it should be simple to make the GRPC call with them.

Domain Collections

In a few domains we provide a getXs that will provide a generator for a collection. And example of this is the sigchain.getClaims or gestaltGraph.getGestalts. Theses will need to take the pagination parameters as specified. THis will be the basis for the other two sections. Reference the sigchain's implementation for how this is done.

    {
      order = 'asc',
      seek,
      limit
    }: {
      order?: 'asc' | 'desc';
      seek?: ClaimId;
      limit?: number;
    } = {},

Additional context

Tasks

  1. - Identify RPC methods requiring pagination, method calls are the ones returning lists of results
  2. - Identify which can be streams, or which should just be returning static arrays
  3. - Identify which streams if any should be part of "reactive" APIs
  4. - Change the protobuf message types to incorporate pagination parameters
  5. - Update GRPC service handlers to make use of the pagination parameters in a common way, create utility functions to be shared between all of them
  6. - Update tests to include testing the pagination mechanism, create a common utility functions for this to test all paginated APIs
  7. - CLI code may not need to make use of the pagination, and may just ask for the entire stream of data, this is because CLI workflows rarely have any pagination built in, so we don't need this.
  8. - GUI code does make use of pagination, pagination is more a common usecases in GUI, so this should be considered.
  9. - Pagination parameters described above need to applied to all relevant iterators across all domains.
@CMCDragonkai CMCDragonkai added development Standard development design Requires design enhancement New feature or request labels Sep 2, 2021
@CMCDragonkai
Copy link
Member Author

When working on pagination in Prism we discovered one UI/UX issue.

Noticed a bug in pagination. Actually I think I saw this before. The problem is that you're on the "first page", but the direction is still false. So if you change the limit, it changes the limit but doesn't reset the direction. Now I remember that the reset button was intended for this. But this is a bit unintuitive behaviour regarding cursor. Might be a good idea to visualise the cursor direction so that users know to control it.

With cursor pagination direction is essential for being able to "go back" in pages. If the pagination only allows going forward in pages, that's fine, but that usually is not the full functionality expected for user interfaces. Once you have the seek key, if you want to paginate backwards, it can only be done if direction is flipped to false. But if you do this, it can be complicated if the other parameters are being changed, such as changing the limit.

So there has to be a "redesign" of the pagination UX at least expected from the library (maybe after a review of CQRS as well), there is still too much logic being done on the client side. More logic needs to be incorporated into the js-pagination with examples of how it's done with VueX store and dealing with GC as well.

Also possible issues with dealing with the last page/end of the paginated stream should be considered as well.

@CMCDragonkai CMCDragonkai mentioned this issue Oct 18, 2021
14 tasks
@CMCDragonkai
Copy link
Member Author

This should be reviewed with respect to push-pull dataflow and control flow.

Push vs Pull

Push vs pull is 2 paradigms of "reactivity" (https://en.wikipedia.org/wiki/Reactive_programming).

These concepts are applicable widely in many scenarios.

  • Configuration Management
    • Push configuration like Ansible, that push desired state to a target system
    • Pull configuration like Chef or agentful systems that pull desired state
    • Always Push then Pull, pulling requires bootstrapping off a pushed configuration first
    • Target system is not aware of origin system (origin system is where the origin of change begins)
  • Pagination vs Streaming
    • Pagination is pulling
    • Streaming is pushing
    • Hybrid systems require a client to initiate a pull, which can establish a push channel for the server to stream results.
  • Amphibious Operations
    • Initial beachhead is a push
    • Subsequent logistics is pull
  • Polykey Integration
    • Pushing capabilities require the pusher to have knowledge of the target system, and how to push - PK integrates into target systems
    • Pulling capabilities require the target system to know how to pull from PK - target systems "integrate into PK"
  • Framework vs Library
    • You call the Library
    • You get called by the Framework
    • Framework can push into your system, and Framework may pull from your system, either way the framework dictates your API
    • Your system can push into the library, and you may pull from the Library, you dictate the library's API
  • Dependency Graphs
    • These are pull based systems
    • Downstream dependencies pull upstream dependencies
    • Dataflow is from upstream to downstream
    • Control flow is from downstream to upstream

You want both depending on the circumstance, and many systems are both push and pull, just in different ways.

Push and pull systems when composed together can form a graph. This graph does not need to be acyclic, cycles in the graph can occur. Reactive systems are ultimately something that can be cyclic. But cyclic does not imply unproductive infinite evaluation. Productivity can still occur with infinite evaluation. Complete evaluation of the graph is not possible. Fundamentally systems are lazy and eventually consistent. Consider 2 agents communicating to each other. Each agent is a state machine. Each transition of the state may trigger transition in state on the other agent. The relationship is not one way, but 2 ways. Even in configuration systems, real state forms feedback into desired state. Thus an iterative system occurs as long as the system is "unstable". Stablity may never be reached... it's possible divergence can occur. Managing divergence is an exercise in complexity. Think about machine learning systems: convergence and divergence. Stability may be a "process", not an end state, just like security. Perturbations occur in complex systems simply due to change and entropy.

The Origin of Change is an important concept. In a push interaction, the origin of change starts at the system pushing. In a pull interaction, the origin of change still occurs the system being pulled. It's the change being applied in a configuration management system.

The initiator of the transaction is also important. This dictates which system has knowledge about the other system. This is independent of the origin of change (which indicates the direction of dataflow). The initiator of the transaction implies a "dependency" relationship in terms of integration direction.

The direction of dataflow may be opposite to the direction of dependency (data flow vs control flow).

                      Data Flow
        ┌─────────────────────────────────────┐
        │                                     │
┌───────┴───────┐                    ┌────────▼───────┐
│               │                    │                │
│ Desired State ├────────Push────────► Realised State │
│               │                    │                │
└───────┬───────┘                    └────────▲───────┘
        │                                     │
        └─────────────────────────────────────┘
                    Control Flow



                      Data Flow
        ┌─────────────────────────────────────┐
        │                                     │
┌───────┴───────┐                    ┌────────▼───────┐
│               │                    │                │
│ Desired State ├────────Pull────────► Realised State │
│               │                    │                │
└───────▲───────┘                    └────────┬───────┘
        │                                     │
        └─────────────────────────────────────┘
                    Control Flow

In push based systems, the dataflow is the direction of the pusher to the pushed.

In pull based systems, the dataflow is the direction of the pulled to the puller.

In push based systems, the control flow is the direction of the pusher to the pushed. The pusher is aware of the pushed.

In pull based systems, the control flow is the direction of the puller to the pulled. The puller is aware of the pulled.

Primitives in JS in push vs pull:

https://stackoverflow.com/questions/39439653/events-vs-streams-vs-observables-vs-async-iterators

https://github.com/kriskowal/gtor/blob/master/presentation/README.md

So all of computing are reactive systems.

@tegefaulkes tegefaulkes changed the title GRPC Pagination via Cursors and Streams Pagination Deployment to Service Handlers and CLI Commands and Domain Collections Dec 6, 2022
@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Dec 6, 2022

This is being pulled from #327 to here:

  1. Refactor src/agent/service/nodesChainDataGet.ts to instead be src/agent/service/sigchainClaimsGet.ts and this should also receive pagination parameters. This includes seek, limit and order. The order should be a protobuf enum: https://developers.google.com/protocol-buffers/docs/proto3#enum

@CMCDragonkai
Copy link
Member Author

With the transition to JSON RPC, this is still valid. We will still be returning collection data as a stream of individual JSON messages. However we will need to take input parameters to act as a cursor to control where the stream starts.

@CMCDragonkai
Copy link
Member Author

For the input JSON request, we can reserve a meta keyword in the params property. And this can be where we put "authentication" details.

Of course things like direction, limit, seek, they would just be at the root level of the params property.

@CMCDragonkai
Copy link
Member Author

When we move from the GRPC to the JSON RPC, we want to have the seek, limit and order as parameters on the JSONRPCRequest object in the params subobject.

These will translate directly into the server streaming handlers, which themselves will just hand it over any async generator method.

@CMCDragonkai
Copy link
Member Author

On the caller side, these parameters should be passable from the CLI parameters.

So for example:

pk vaults list --seek <vaultId> --limit 10 --order asc

So techncially our CLI doesn't really do much here. It doesn't become that useful until you get a the GUI ready.

Normally...

pk vaults list

Will just stream the entire collection fully.

For the calling side, if it calls a stream streaming method, it should use the output formater at each iteration, it shouldn't be accumulating all the data then outputting it. This is what will enable the CLI to also be streamable.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes I remember we discussed this especially in reference to changes you're doing for deadlines, did you add in the pagination capability to the stream handlers on the server side? And all that needs to be done is to propagate --seek, --limit and --order parameters from CLI to the client calls?

@tegefaulkes
Copy link
Contributor

I don't think I've made changes for this yet.

@CMCDragonkai
Copy link
Member Author

Neither the server nor client side?

@tegefaulkes
Copy link
Contributor

I recall looking it over and seeing the generators implementing the seeking behaviour. Right now I don't recall if its standard across the board. I don't think all the bin commands where seeking applies have all of the seeking options right now.

@CMCDragonkai
Copy link
Member Author

All bin commands has seeking? But do all client handlers have relevant seeking parameters?

@tegefaulkes
Copy link
Contributor

I don't recall at this time. Its something I'll have to check.

@CMCDragonkai
Copy link
Member Author

Check all CLI commands for --seek and --limit and --direction.

Check all JSON RPC handlers for seek, limit and direction.

Apply them to the generator codes.

@tegefaulkes
Copy link
Contributor

nodesClaimsGet handler has the seeking disabled currently. It needs to be re-enabled as part of this issue.

@CMCDragonkai
Copy link
Member Author

@addievo also review this too.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 14, 2023

In the context of audit domain - this will be important.

You start with a DB transactional snapshot iterator. That becomes a AsyncIterable through AsyncGenerator syntax, and then at the client service it comes a server streaming call.

#599 - dashboard backend may use js-rpc and js-ws to make a server streaming call to the seed cluster agents.

When doing so, it will need provide some parameters to control the result.

Normally the result is finite. You can control the finiteness using pagination parameters as expressed above.

@amydevs - there are no client service handlers that currently behave as per the OP spec.

We can start with the audit domain to do so.

If you do a while loop where you are continuously calling the server streaming call to get the new results, while preserving a cursor, that is equivalent to having an infinite iterator, that is one way to do get live updates for #599. This is still a pull-based architecture.

Alternatively there could be a server streaming call, that would always be alive. And it is now on the handler side's responsibility to push data into the call. Then the client is pulling forever. It would then only close if the client decides to close stream.

Dashboard service singleton which could do both.

If we do both, there should be a standard of distinguishing between these kinds of server stream calls.

  • getNodes() - default to finite
  • getNodesInfinite() - default to infinite?

Another way is to provide a parameter that distinguishes the 2.

    {
      order = 'asc',
      seek,
      limit
    }: {
      order?: 'asc' | 'desc';
      seek?: ClaimId;
      limit?: number;
    } = {},

Imagine:

  • limit = 10 - this is always finite
  • limit = undefined - ?
  • limit = Infinity - ? - the problem with this, is that it doesn't exist in JSON - however it does become null
  • limit = 0 - this is still finite
  • limit = -1 - ?

Therefore we could do something like:

  • limit = 10 - that is finite
  • limit = undefined - that is whatever the handler decides it is
  • limit = null - this means Infinity - at this point we could choose to have an infinite server stream - then you could have the audit handler facilitate this internally - if the handler cannot support infinite streams, it can treat this as equivalent to undefined

Also by default - I like to prefer undefined to mean result set of 1. Rather than 0.

@CMCDragonkai
Copy link
Member Author

Also if you want to limit by a seek, you could add one more parameter that seekLimit - which represents the end of the seek.

It is mutually exclusive to limit. So you can only use one of them at a time.

@CMCDragonkai CMCDragonkai assigned okneigres and amydevs and unassigned tegefaulkes Nov 14, 2023
@CMCDragonkai
Copy link
Member Author

We can start this issue in the audit domain first - but closing this issue will require full adoption in the client service and agent service.

@tegefaulkes
Copy link
Contributor

I'm moving this to todo since it's not actively worked on.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes I'm hoping that we can start this now with all the new Unix commands.

@aryanjassal should be reviewing this issue. I want to make sure that our pagination system is being put together properly.

@CMCDragonkai
Copy link
Member Author

For something like secrets ls, there may not be a natural "seek key", because when you read a directory, you're not really able to propagate it in. In that case seek may not make sense.

However things like limit and order might make sense. The order usually is based on whatever is the natural order of the filesystem. Which I think is based on the EFS's usage of the inode index - although I don't remember if that is ordered numerically in the DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Requires design development Standard development enhancement New feature or request r&d:polykey:supporting activity Supporting core activity
Development

No branches or pull requests

4 participants