Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git status functionality #309

Closed
paraseba opened this issue Oct 21, 2024 · 12 comments · Fixed by #711
Closed

Add git status functionality #309

paraseba opened this issue Oct 21, 2024 · 12 comments · Fixed by #711

Comments

@paraseba
Copy link
Collaborator

No description provided.

@dcherian
Copy link
Contributor

dcherian commented Nov 5, 2024

Design Doc

Goal

Our goal is to have .status() surface useful information to the user. Since status is inherently backward-looking, I expect users will approach it with four questions in mind:

  1. Where did I start?
  2. What did I do?
    • Did I create a new group?
    • Did I delete an existing?
    • Did I update an existing group?
      • Did I update attributes of this group?
      • Did I create a new array?
      • Did I delete an array?
      • Did I modify existing arrays?
        • Did I modify attributes?
        • Are these chunk modifications?
          • did I write new chunks?
          • did I delete existing chunks?
          • did I overwrite existing chunks?
        • How did I modify the chunks (if relevant)?
          • What fraction of the array was modified?
          • Did I append?
          • Did I write a region?
  3. Wait, I didn't know I did that?
    • We should let the user interrogate the changeset to see what exact changes were made. E.g. you added an attribute to an array at this path.
  4. What do I do next?
    • Commit!

Information to surface

  1. For Q1, we should surface VCS history information:
    a. Repo bucket
    b. Base Snapshot ID and commit time.
    c. Do we need to surface a truncated commit message too?
    d. Current branch

  2. For Q2, we should surface information in the current ChangeSet:

    • new_groups, new_arrays
    • deleted_groups, deleted_arrays
    • updated_arrays -> zarr array metadata
    • updated_attributes -> group/array user attributes
    • set_chunks -> modified chunks (create/delete/overwrite)
  3. For Q3, we punt to later :)

  4. For Q4, we should encourage the user to make a commit by saying loud and clear that these are "uncommitted changes" and will be lost (?).

How to surface

As a tree?

One particularly neat way to surface this information would be to show the hierarchy as tree. We could construct a tree for the snapshot + changeset: new-tree. Then iterate through the changeset and add information to the appropriate node of the tree. During diffing, if two nodes are in the same place, we examine the changeset for metadata, chunk modifications and annotate the new-tree appropriately. The diffed-tree structure could then be rendered to text in Rust, and transferred to Python for rendering with rich (for example)

As plain text?

We could simply output formatted lists of created/updated/deleted groups/arrays.

Misc

Some things to be careful about:

  1. Initial commit should not be confusing to the user.
  2. Comparing two different status outputs should be easy
    1. Information should be easy to scan, sorting is a good idea.

@paraseba
Copy link
Collaborator Author

paraseba commented Nov 6, 2024

I'll add another option to "How to surface":

  • full structured list of changes.

That could help us in the future have an answer to 3.

Example:

struct Status(Vec<(Path, Vec[Change])>)

enum Change {
   ArrayCreated(ZarrMetadata),
   ArrayMetadataUpdated(ZarrMetadata),
   ChunksDeleted(Vec<ChunkIndices>),
   ChunksWritten(Vec<ChunkIndices>),
   GroupDeleted,
    ....
}

This would give us a high level language. Then we can build different types of formaters on top of it.

@paraseba
Copy link
Collaborator Author

paraseba commented Nov 6, 2024

@dcherian I realize my proposal is very similar to work I'll have to do for transaction log support. That structured status looks a lot like a transaction log. Maybe we shouldn't try to do both things in parallel, but we can discuss more.

@dcherian
Copy link
Contributor

dcherian commented Nov 6, 2024

Ah nice, yes that makes sense.

I'm also thinking the formatting of the structure should live in Rust, and we may want to expose it through a CLI in the future?

@rabernat
Copy link
Contributor

rabernat commented Nov 6, 2024

We should probably design both status and transaction log as abstract data structures and then build ways to transform them into something for display in different contexts, e.g.

  • json
  • python class with a nice __repr__
  • rich formatted text for CLI
  • etc

@paraseba paraseba moved this to Backlog in Icechunk 1.0 Dec 5, 2024
@paraseba paraseba moved this from Backlog to Ready in Icechunk 1.0 Jan 14, 2025
@TomNicholas
Copy link
Contributor

TomNicholas commented Feb 5, 2025

I was about to raise an issue asking for a way to do a "git diff" between two icechunk commits, which in my mind would return a ChangeSet object. Seems like that would be a prerequisite for part 2 of @dcherian's plan for "git status" as described here.

expose it through a CLI in the future?

I also think this would be useful.

@mpiannucci
Copy link
Contributor

mpiannucci commented Feb 5, 2025

git status when in a writable session is somewhat easy because we can just us the ChangeSet to make something useful. Getting a diff between commits is harder though because we will need to use the transaction logs to look at what happened and when

@TomNicholas
Copy link
Contributor

A diff between the current state and the last commit would also be very useful in the short term, and presumably doesn't require going back through and summarizing the whole transaction log.

@paraseba
Copy link
Collaborator Author

paraseba commented Feb 5, 2025

Every commit generates a Transaction Log. it should be quite easy to convert a Transaction Log (or a merge of many transaction logs) into a textual description. Tx logs are a pretty simple datastructure:

pub struct TransactionLog {

    pub new_groups: HashSet<NodeId>,
    pub new_arrays: HashSet<NodeId>,
    pub deleted_groups: HashSet<NodeId>,
    pub deleted_arrays: HashSet<NodeId>,
    pub updated_user_attributes: HashSet<NodeId>,
    pub updated_zarr_metadata: HashSet<NodeId>,
    pub updated_chunks: HashMap<NodeId, HashSet<ChunkIndices>>,
}

We introduced Tx logs for two reasons: diffs, and rebases.

@TomNicholas
Copy link
Contributor

@paraseba how hard would it be for me to add this feature now? I just want to be able to run something from the command line that prints some textual representation of the diff.

@paraseba
Copy link
Collaborator Author

paraseba commented Feb 5, 2025

@TomNicholas I don't think it would be hard, have you done any Rust? I can be your guide through the Rust code. Worst case scenario, we could expose the ChangeSet to python, and implement the status representation in python, but we are trying to avoid that and put all logic in Rust.

@TomNicholas
Copy link
Contributor

I don't think it would be hard, have you done any Rust?

I have not, but this seems like a good opportunity to learn.

I can be your guide through the Rust code.

I'll raise a separate issue for this now, as it seems like a smaller feature than described by this issue.

@paraseba paraseba self-assigned this Feb 10, 2025
@paraseba paraseba moved this from Ready to In progress in Icechunk 1.0 Feb 10, 2025
paraseba added a commit that referenced this issue Feb 10, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in Icechunk 1.0 Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants