Solution to speed up large folder structures

slaymaker1907 · December 15, 2018, 6:31am

I personally don’t really want to implement this myself since you guys seem to do such a good job, but I think it should be possible to greatly speed up the initial sync on startup by keeping track of the time last synced (debouncing as necessary to minimize disk writes).

By keeping track of this time, you can quickly scan to check for local changes by just comparing the last modified time on every file recursively. This can be done very quickly since you only need to read the file metadata and not the whole file. I can enumerate all 180k files recursively in a couple of seconds this way (also grabbing the modified time) in a couple of seconds in python 3.6 using os.scandir versus the hour that InSync is taking for some reason.

I assume InSync is careful to download a file which has updated remotely by first putting it into a temp file and then rename that file to move it into the correct position in order to avoid corrupted files since that is the only way to guarantee atomicity.

I might be missing some really weird edge cases where this doesn’t work, but the only thing I can come up with is that you need to be careful that the last synced time gets written/updated fairly often. In the case of many updates occurring continuously to many files, care must be taken so that this time eventually gets written once parallel changed files have been processed by the sync. I know Windows is particularly stupid with its file watching API since it can notify watchers before the file is actually changed.

slaymaker1907 · December 15, 2018, 6:49am

Also, if the concern is with moving files, that is very easily tracked by checking inodes/file ids as long as the file system is not ancient (i.e. it should work for NTFS and all Linux file systems). This is because while files might move/change locally, the inode will be the same unless the data is copied. In the case where someone moves a file/directory by copying and then deleting the old one, I don’t think it is even possible in general to detect this.

Hawk · December 15, 2018, 8:18am

I don’t think this is a solution to speed up the initial scan.

Insync isn’t reading whole files. If so, the initial scan would take significantly longer than it currently does.
Insync also already caches enough metadata to quite reliably identify changed files by only looking at their metadata.

There must be something else that’s slowing down the initial scan. For example, it needs to check if there’s any changes on the cloud side. There could be loads of other operations that need to be done. The logic of a solid real-time sync client is quite complicated.