Rewriting the repository history to remove large binary files

For years now I’ve wanted to clean up some really old mistakes in the Thrive git repository. Way back before the switch to subversion, large binary files were included directly in the Thrive repo. These haven’t been used for like 8+ years at this point but still they take up download bandwidth whenever Thrive repo is cloned.

What I’d like to do to solve this is run a git history cleanup to remove all the large binary files from history. I’d obviously create a new branch to preserve the original history so that nothing is lost, but the default branch would be cleaned.

This has the impact that all other branches will become unmergeable until rebased onto the cleaned branch (this is because the cleaning literally rewrites the repo history). And everyone who has cloned the repo will need to do force pulls / other cleanup on their local copies to stay up to date. That’s why I haven’t undertaken this effort yet, but I think I finally want to do this. Unless people really object to this I want to get this done this year.

In order to minimize disruptions, I’ll say now that I plan to do this on June 5th 2023 The disruption should at most be just a few hours as I lock weblate and merge latest changes before then performing the branch cleaning. I’ll warn all open PRs one week before that that there’s just one week to get merged to avoid having to do a rebase.

2 Likes

Seeing as no one complained, I’ve now marked this in my own calendar so that I remember to actually do this.

1 Like

Agree with this course of action, better now than never.

1 Like

I originally said that I would do this today, however that doesn’t line up very nicely with the 0.6.3 release. So instead I’m giving 3 more weeks time. That way once the 0.6.3 feature freeze begins, I’ll do the history rewrite on the following Monday (the 26th of June). That’ll work much nicer with the scheduled 0.6.3 release.

As a reminder everyone who currently has an open pull request or has Thrive cloned locally, will need to take some actions when this happens. I’ll now comment on all open PRs about this to give people some time to work on them still before they need to do a bit of extra work.

Well I did the rewrite, but it seems that it didn’t have that big of an impact on the overall repo size. I think everything properly was removed from history, but it seems like maybe the constant translation changes etc. mean that all of our large code changes are actually pretty big so the old binary assets didn’t matter that much.

Also it seems like weblate got really unhappy with everything I tried to get it to accept the new history. I really don’t want to undertake this right now, but weblate being entirely unusable right now certainly is motivating towards doing something else for translations…

Using something else than weblate could allow much neater translation / code workflow integration with less “useless” commits. And making the language specific files smaller by not having to include the reference line numbers in them. It would not be very optimal to have translations closed for multiple months, but it isn’t the end of the world as right now we have just 1 or 2 weekly translators.

Edit: translating Thrive on weblate is now closed until the situation is resolved: