File Integrity Checking

Hi,

What sort of mechanism are you employing to check file integrity after a sync? In other words, how do you ensure that a file that was uploaded to Google Drive is an exact copy of the one on the local machine?

Normally something like a file hash check would be run on both sides to confirm the match as a lot can go wrong during a file sync and leave you with an alternative version of a file on the other end.

Nothing specifically is mentioned in your product feature set about this and it concerns me that no checking may be done?

Werner

Hi @WvdW, I think I have responded to your email, but will also be responding here forother users’ reference:

Integrity check in file upload/download is built-in in the protocols we’re using (TCP/IP + TLS), so there’s negligible chance that files are corrupted there. As an extra check we use MD5 sum metadata provided by Google Drive.

Thanks @jaduenas. Your mail response was after I already posted this :slight_smile:

Additional questions relating to the same subject matter:

  1. In the event where a large file is being synced and Insync is either paused or the machine is rebooted halfway through the sync, what happens? Will Insync automatically continue with that sync as soon as it comes back online, or will it restart that file sync from the beginning again?
  2. Seeing as you are doing MD5 sum checks, I am assuming that no file will be shown on the opposite side as complete until the sums match, correct? What then happens with a file that is indicated as having been synced but the MD5 sums do not match… does Insync not show it on the opposite, removes the corrupt version and then starts the sync again, repeating this process until such a time as the MD5 sums do match?

Werner

Tagging our engineer @marte so he can help you out with this :slight_smile:

I’d like to know more about this too. I have synced about 2TB of data so far with the idea that at some point I will stop maintaining local copies of certain folders. However, I’m concerned that I have no way to confirm whether everything made it to my Google Drive with no issues or not. Would love to hear how you guys are handling this within the app. Thanks!

@Jason_Miller making it there is easy to confirm through basic checks that can be done running insync-headless commands (I am assuming you are using headless as that’s the only way we run it) but the same can be done with the GUI tools as well, although I can’t tell you exactly where to find the equivalent options in the GUI.
The concern that I have is not whether it made it there but rather whether it made it there in exactly the same state as it was on the local source. We sync about 25TB, so you can understand that there are many chances of things getting corrupted along the way if the integrity checks aren’t being done.

The 4 main checks you can do to confirm that everything got there are:

  1. insync-headless get_sync_progress: Will show you which files are syncing right now and how many in total are still left to sync.
  2. insync-headless get_actions_required: Will show any pending actions that Insync must take that for some reason or another it was not possible to complete successfully.
  3. insync-headless get_errors: Will show any specific errors Insync encountered. You can keep on retrying those until the list is clear.
  4. insync-headless get_status: Will show if Insync is busy actively syncing right now or just sitting and waiting.

The objective with the 4 above commands is to get them to display the following (in the same sequence as above):

  1. No syncing activities
  2. None
  3. None
  4. SHARED
    When you get to this state you will know that Insync has done everything it has to, and that all files in the source folders have been synced.

Now all we need is confirmation of the integrity checking and then you will know that its an exact copy of the source.

We sync everything to a staging folder in Drive and then once we’re happy its synced, we just move all the content in this folder to another which isn’t being sync-checked and there you have it - no local copies and all original content available in Drive. Insync will clean up after itself as soon as you do the Drive folder content move, and it will remove all the local content automatically for you.

Werner

Hello WvdW,

  1. We continue from where we left off, as long as the “upload session” we use with Google Drive for that file is still valid.

  2. We used to show an error when there’s MD5 mismatch but because of an API issue in Google Drive where the queried metadata does not match for some files of some users, and the supposed error became a red herring because the files were actually downloaded correctly. So right now we’re just logging it if there’s an MD5 mismatch. File corruption due to network transfer shouldn’t be a concern.

Hi @marte. Thanks for the feedback.

General comment:
I am a little concerned with your responses however (or maybe I’m just misunderstanding :slight_smile: ), as so far I have not received a clear answer from anyone in a simple format stating categorically “Yes, we check file integrity for each file after its synced using mechanism XX (or not)”.
If we as users are to use your product with full trust and without hesitation accept its results, then we should have a guarantee that its doing its job 100%. Now, correct me if I’m wrong, but a file syncing tool has 2 jobs as far as I am concerned - 1. provide a channel to move the content of a file from side A to side B, and 2. check that both sides match after the move has completed.
Making a statement like “File corruption due to network transfer shouldn’t be a concern” is very dangerous as there are lots of things that can go wrong with a network transfer over a WAN link. Sure you are using TLS to secure the data packets and confirm packet integrity for each packet as it goes across, but that in itself still does not guarantee overall file integrity for a file as a whole (especially with large files synced over WAN links that may experience temporary breaks at any time). One can “assume” that the file will be okay but unless the file as a whole is checked in some way (like a hash check) on both sides after a sync, its impossible to guarantee its integrity.
As more users are totally replacing their local file servers with Google Drive, its imperative that we know when using your tool to help us sync content from local to cloud that there is no way whatsoever for us to corrupt any of our valuable data in the process.

Question responses:

  1. So what happens if its not in the same upload session and it times out? Does the sync for that file start again? What happens to the partially synced file on the target side? Does it get removed or just left there before the new sync starts? Who cleans up all these partials over time?
  2. If you are logging MD5 mismatches, where is this logged? In your logs.db? So other than logging it, you are basically not taking any corrective action at all? What if there is a mismatch not because of an API metadata failure but a genuine file difference? If I was to search in logs.db for MD5 mismatches, what field or parameter value should I be searching for?

Werner

Wow, this thread is becoming very concerning!

And thanks @WvdW for your reply, however, I think we are viewing this differently. You are being very specific about the technical details, but in my mind, if the remote file doesn’t end up being the exact same thing as the local file, then “it didn’t make it”! Whether that’s the file never transferred, it partially transferred, or it fully transferred but it’s corrupted in some way. You can break it down into as many technical pieces as you want, but at the end of the day, my file either made or it didn’t.

To your point regarding using Insync tools to confirm transfer success (whether UI or command line), I don’t really understand where that provides any peace of mind. If I am questioning whether Insync properly transferred all my files, then why would I trust the outcome of those checks?

For example, one of the transfers I did was around 1TB in size and during that transfer, I lost power at the house due a to a windstorm and the transfer was of course cutoff. When power returned, I restarted everything and Insync resumed transferring my files as expected after doing a full scan. However, like an hour or so later it showed that the sync was complete. So yeah, I call BS on that. There is NO way that it scanned, uploaded and checked 1TB worth of files in roughly 4 hours or so of total runtime. In addition to that, when I check the size of my local Google Drive folder against the size of my Google Drive storage and there is a difference of about 500GB. So yeah, I am super concerned now, especially after reading @marte 's response.

So maybe Google is doing some fancy stuff to detect duplicates or anything else along those lines that would cause a difference in file sizes and counts, but right now I am completely lacking any confidence that this is all happening as it should be.

Not the most scientific way to compare the two, but clearly, something is off.

@WvdW,

I understand your general concerns, but because of some issues we encountered with the Google Drive API, we were not able to rely on their metadata to use it for corrective action. What I mean by the statement “File corruption due to network transfer shouldn’t be a concern” is that the network transfer itself shouldn’t be an issue. If there’s corruption, it would come from other causes like disk corruption, memory corruption, bugs (either client-side or server-side), and that’s why we still added the MD5 check in case of those other issues. We haven’t encountered any file corruption issues so far and when we thought we did (because of MD5 mismatch), it was only the metadata that was mismatched but not the actual contents themselves. We’ll revisit this and try to add it back while working around the issues we encountered before.

For your questions:

  1. It’s the Google Drive API that handles the session. If it times out, then it becomes inaccessible (and presumably already cleaned up by them – we don’t have control over it), so the upload starts again.
  2. Yes, you can find it in logs.db. You can use the term MD5 for searching in the message field.

Thanks, we appreciate your feedback.

@Jason_Miller,

We save the state of our sync often so it’s possible that Insync was still able to quickly continue the sync from where it left off without issues. If you’re able to identify discrepancies, let us know by emailing support.

As for the sizes, there are many ways that they can differ:

  • “Size on Disk” is normally larger than “Size”, and it’s only the latter that Google Drive counts.
  • Some files, like Google Docs and shared files, are not counted to your quota.
  • The local folder have an .insync-trash folder that contains recent remotely deleted files (for help in recovery).

Thanks for your feedback.