Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Work in Progress] 3rd-party API: ETag for faster synchronization #51

Closed
wants to merge 4 commits into from

Conversation

korelstar
Copy link
Member

This is my prototype which introduces ETags for notes. See #49 for further details. For now, it's work in progress but I'm looking forward for any kind of feedback and a lively discussion in #49. 😃

@korelstar korelstar force-pushed the api-sync-speed branch 4 times, most recently from 3cc9b01 to d585c82 Compare January 20, 2017 18:53
@korelstar korelstar added the feature: API Related to the API for 3rd-party apps, i.e. Android or iOS label Jan 22, 2017
@korelstar
Copy link
Member Author

I'm running my approach now for several weeks and it is working very well. However, there is one issue I want to discuss: It scales only up to approximately 400 notes.

Background: The ETag uses the md5 algorithm in order to generate a unique value. md5 generates a 128-bit (16 Byte) binary hash value. Our API uses HTTP's GET in order to request notes. Hence, the ETag has to be transmitted in the HTTP header and, therefore, md5's binary hash value has to be encoded in an ASCII string. For this case, the best approach is base85 which encodes the 16-Byte binary value into a 20-Byte ASCII string. Unfortunately, the HTTP header has a practical size limit, which differs between server software and configuration. A common limit is 8KiB (requests with a larger header are not processed). Since we have to transfer some other data in the header, there is space for the ETag of approximately 400 notes.

Solutions: There are several possible solutions for this:

  1. Change the API's HTTP request method for getting notes, so that we can transfer ETags in the content of the requests (unlimited size).
    Pro: full scalability in a single request.
    Con: breaks backward-compatibility (i.e. the client has to know which API version the server talks).
  2. Use HTTP header Range in order to split the request into multiple parts of only <400 notes.
    Pro: full scalability, full backward-compatibility.
    Con: requires multiple requests (if #notes>400), makes clients more complex.
  3. The client bounds the number of sent ETags to <400. If there are more notes, the supplementary notes will be transferred always (like in the current release).
    Pro: full backward-compatibility, easy implementation.
    Con: Speed-up works only for the first ~400 notes.

Conclusion: Is this (>400 notes) an edge case? Should we merge this pull request now and care about the open issue later? I would argue for this. But what do you think?

Do you even understand what I'm talking about? Maybe my English is not good enough? 😎
Please give feedback!

@nextcloud/notes @Henni @stefan-niedermann and everybody else 😀

@BernhardPosselt
Copy link
Member

The solution should be to use one etag per note if you request a specific note and one etag for all notes (e.g. etag of the latest note or hash all notes' etags in a specific order)

So basically: make it so an etag will always be 32 chars long

@korelstar
Copy link
Member Author

Thanks for you comment! 😃
I thought about that solution, too. But I don't like the low performance: In order to speed-up my update process, I don't want to check every single note, but all notes in one request. However, if I have one ETag for all, and only one single note has changed, the ETag changes as well. Then all notes have to be transferred, although most of them haven't changed.

I wanted to improve this, and therefore, in my approach the client sends all known ETags in a single request when updating all notes.

@BernhardPosselt
Copy link
Member

Well if you send only the etag of the latest one you could actually send a diff

@BernhardPosselt
Copy link
Member

BTW this is probably a scenario where the last modified header would make more sense than an etag

@BernhardPosselt
Copy link
Member

Another idea: you send back all notes but notes with a smaller last modified time than the one that was requested will only contain an Id. That way you can in fact keep track of deleted notes as well.

@korelstar
Copy link
Member Author

Unfortunately, we can't use the filemtime here, because of the following scenario:

  • a note is updated on a client A at time t=1, but not synchronized (no network)
  • client B synchronizes at time t=2
  • client A synchronizes at time t=3, the note is saved on the server with filemtime=1 (!)
  • client B synchronizes at time t=4 with If-Modified-Since: 2 and does not receive an update for the changed note, because it looks that it wasn't change since last synchronization (filemtime=1!).

Nevertheless, please continue thinking with about this, maybe we find a better solution 😃

We would need another type of modified time, i.e. we need to distinguish between the real filemtime (here t=3) and the application-level modified time (here t=1). However, Nextcloud doesn't provide such an approach -- as far as I know. Or could we utilize that the database has two attributes mtime and storage_mtime? Currently, both cols have the same value.

Nevertheless, the approach of using the modified time of the file does not take into account that meta-data (i.e. the category or favorite attribute) can change independently of the file content (see my analysis in #49).

@BernhardPosselt
Copy link
Member

BernhardPosselt commented Feb 10, 2017

AFAIK filemtime is not synchronised. If a client created a note he won't have an id for it yet so you can pretty easily check which files are in the server and which aren't

@korelstar
Copy link
Member Author

Oh yes, the filemtime is synchronized. I wrote that part: 7d914c8. It's in the release since v2.1.0.

To your second part: Yes, note creating detection is easy. But updates have to be handled correctly, as well. ;-)

@BernhardPosselt
Copy link
Member

Seems like a bad idea then. ;)

I don't see the value, client and server time don't have to match at all and I'm unsure if this could even cause sync client issues. If the client update date is important for you, why not store it as an additional fields in the dB?

@BernhardPosselt
Copy link
Member

What I'm trying to say is: there is this really simple, fast restful approach which is blocked by a minor feature that can be solved in a different fashion :)

@korelstar
Copy link
Member Author

I understand what do you mean. However, I like that feature (there are at least 3 different use-cases for this*) and -- which is more important -- the server core has the same feature (in WebDAV using propset or the HTTP header X-OC-MTime and since Nextcloud 11 also in the Webupload, see owncloud/core#21237 resp. nextcloud/server#1283).
Hence, if somebody synchronizes her notes using WebDAV or uploads a note in the files app (and that's an explicit use-case for the notes app!), then we can have problems if we use the filemtime. It looks like this is a show-stopper -- regardless we have the feature in this app or not.

*) use-cases:

  1. When synchronization happens a week after editing (e.g. no network on vacations), then it's confusing if last change date is wrong
  2. Undo-delete in the Android app: Recreates the note with the old date after an accidentally done "swipe to delete": Undo for "Swipe to delete" notes-android#158
  3. Cancel edit restores the old note (with old date) if auto-save has already saved intermediate changed: New feature: Cancel edit (restores old state) notes-android#146

@BernhardPosselt
Copy link
Member

BernhardPosselt commented Feb 11, 2017

(in WebDAV using propset or the HTTP header X-OC-MTime and since Nextcloud 11 also in the Webupload, see owncloud/core#21237 resp. nextcloud/server#1283).

I see, that's a bummer then. I guess this is where simple APIs like webdav fall apart and you need custom handling :)

In that case I'd order all files by filemtime, concatenate their etags and hash them again and use that as etag. If the E-Tag doesn't match, then send all files again.

@korelstar
Copy link
Member Author

In that case I'd order all files by filemtime, concatenate their etags and hash them again and use that as etag. If the E-Tag doesn't match, then send all files again.

Now we are at a point, we were already before: #51 (comment) (see above) 😉

Sending all notes although only one of them may have changed is not reasonable (especially in mobile environments).

Maybe, do you have another idea, how we can improve this situation?

The only other way I could imagine, is to have two steps:

  1. Use a ETag for all notes together to detect whether something was changed. But in contrast to your approach, the server sends then only a list of note IDs and respective ETags (no ressource wasting).
  2. The client compares the ETags and makes a second request for all notes that have different ETags.

For the second step, the API has to introduce a new routing. But I think, backward-compatibility can be achieved, because the client can distinguish the server version on behalf of the response from 1.

Pro: If nothing has changed, the synchronization will be faster than all my other suggestions from #51 (comment).
Con: If something has changed, we need two requests to get the updated data. Since latency is an issue for mobile connections, this will slow down the synchronization.

@BernhardPosselt
Copy link
Member

The less requests (especially on mobile) the faster it will be because latency is the killer and you can compress text pretty well using gzip.

Another approach would be to hook into file creation and update process and maintain a modified date on the server so you can make the "give me all notes modified since x" approach would work.

@korelstar
Copy link
Member Author

That's a promising idea! I wasn't aware of the existence of hooks. But according to the documentation, this should work for us. I thought about this for some time, and finally, I came to the insight that it would be better to do it without hooks.

Variant using hooks

We would have to introduce a database table with two columns fileid and realmtime. Using the hooks, we can fill/update that table on every postWrite,postCreate,postTouch,postCopy,postRename and postScanFile (don't know if we really need them all) if the target is inside the notes directory. Then we need a hook for postDelete and postRename which removes the respective row from the database table if the source was in the notes directory.

But what happens if the notes app is deactivated temporarily and files are modified or deleted in this time period? We will get orphaned rows in the database table and we will lost updates of the files! When thinking about a solution for this, I came to this variant:

Variant without hook

Introduce a database table with three columns fileid, etag and lastupdate which work more like a cache. When performing GET /notes, the table content is loaded and updated:

  • orphaned entries (note was deleted) are removed
  • entries with differing etag are removed
  • notes that are then not in database (which means they are new or update=changed ETag), are inserted with lastupdate=now()

Then, lastupdate doesn't reflect the real time of the last update. Instead it will be typically higher. But that is sufficient for the use-case. What do you think about this solution?

@korelstar korelstar changed the title 3rd-party API: ETag for faster synchronization [Work in Progress] 3rd-party API: ETag for faster synchronization Mar 3, 2017
@jancborchardt
Copy link
Member

@korelstar what’s the current state here? Can you rebase? :) @Henni @irgendwie @BernhardPosselt @Gomez do you have feedback? :)

@korelstar
Copy link
Member Author

korelstar commented May 1, 2017 via email

@Henni
Copy link
Member

Henni commented Jun 27, 2017

@korelstar we can close this, right?

@korelstar
Copy link
Member Author

yes

@korelstar korelstar closed this Jun 27, 2017
@korelstar korelstar deleted the api-sync-speed branch June 27, 2017 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature: API Related to the API for 3rd-party apps, i.e. Android or iOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants