Files Scanning after reboot takes a long time. Can it be done in parallel?

In my main computer I have around >500,000 files amounting to ~1TB in my Google drive.

Every time I start the computer, insync does the initial “Scanning…”, which takes a bit over 30 minutes. Working during these 30 minutes is quite risky, because if I modify a file that changed in my laptop before the scanning gets to it, bad things happen.

In the system settings, I can see a single CPU thread taking care of the scanning. Would it be possible to do this using multiple threads, and hopefully being this configurable by the user? For example, I would love to use 8 of the 16 threads I have available for the initial scanning.

Cheers.

That would only be faster if the process is CPU limited. I expect it is isn’t as it has to scan the, presumably single, disk, so I expect it is I/O limited.

You are completely right, if there is an I/O bottleneck, more CPUs won’t make a difference.

In any case, I wonder if there can be a better way (faster) to do this. ~30 minutes waiting for the initial scan is a long daily wait…

Cheers

You could have a look in task manager to see if it is actually using 100% of a core. It could be CPU limited if it is coded badly. Otherwise it is down to the speed of your storage, e.g. SSD will be faster than disks. I keep my PC on 24/7 and only have the free Google drive limit, so it doesn’t take long for me.

It seems to be using mostly 100% of one of the threads…

I have one of the fast Samsung SSD’s PRO, so not much room for improvement there. And keeping the PC always on is a no-no for me…

Hopefully the insync folks can take a look at this to see if there is anything they can do to improve things.

Thanks!

Hey @Gorka_Navarrete! I’d be happy to send your logs to our team for investigation. Do you mind sending the logs.db and out.txt files to support@insynchq.com with the link to this post?

Shoutout to @nop_head for the assist :slight_smile:

Thanks @mia, logs sent.

1 Like

I feel like your asking the wrong question.

Why does Insync have to do a full scan everytime it restarts.
Why can’t it start where it left off last time.
Like a save point.
And if the user wants to do a full sync they can push a button.

Otherwise all that happens is Insync memory usage goes up the moment you restart.

But to do a process like this comes with a caveat, that Insync memory collection is cleaned up. Presently it is out of control.
The app needs to be lighter.

Good points raised as well @Jamie_Browning. I’ve sought our engineers’ help on this to gain more insight on what’s going on.

@Gorka_Navarrete, it seems like my colleague has received your email containing the logs :slight_smile:

@mia, following up on @Jamie_Browning, I was thinking about how insync really does not need a full rescan in every restart.

When we start computer B, we need a list of all files modified and deleted in computer A, so we check only those files in computer B.

I tried this in R, not known for being particularly fast.

1) Get a hash for all files modified in my computer in the last 24h

ALL_changes = tibble::tibble(filename = system("find ~/myinsyncfolder/* -mtime -1", intern = TRUE)) 
DF = ALL_changes |> dplyr::mutate(HASH = tools::md5sum(filename)) 

It’s 2 seconds for ~9000 files.

2) Get a list of all files deleted

ALL_files_t0 = tibble::tibble(file = system("find ~/myinsyncfolder/*", intern = TRUE)) # 0.7 seconds
ALL_files_t1 = tibble::tibble(file = system("find ~/myinsyncfolder/*", intern = TRUE)) # 0.7 seconds
ALL_files_t0 |> anti_join(ALL_files_t1, by = "file") # 2.3 seconds

Checking ~ 500K files in 3.7 seconds

I compared a snapshot of all files in my system at t0 with files at t1. Given the first snapshot should be taken when starting up, the time to get deleted files is ~3s.


This is around 5 seconds to get all the files modified and deleted in the last 24 hours in computer A.

And could limit the scanning in computer B to those 9000 files or so, being mere seconds, instead of the current state (tens of minutes in my case).

Maybe something along these lines could be done every hour or so, and always before shutting down and suspending a computer? If the expected log from the other computer is not present (we don’t have the hourly log, or the shutdown log, or the suspend log), kick a full rescan…

Anyway, I am sure all this is more complex and there are lots of nuances I am missing, but wanted to keep the discussion going.

1 Like

Thank you so much for sharing your thoughts regarding this, @Gorka_Navarrete!

Rest assured I’ve sent all of your feedback to our engineers so we can continue to map out how to make Insync more seamless :slight_smile: