Sub-modules are the Devil

Sub-modules (or if you prefer mercurial, sub-repositories) are a popular method of dealing with complex dependencies spread across multiple repositories. In theory, you can develop libraries while developing multiple applications that use them (or more libraries), and you don’t have to deal with setting up your build environment beyond what it takes to do a simple build.

In practice they lead to lots of issues.

Now there are some places where sub-modules are okay. If you have external dependencies required just at build time (e.g., a testing framework or code formatter) it’s no big deal having them as sub-modules; we can argue about whether or not its a good idea, but it doesn’t cause problems by itself.

Let’s say you have a dependency graph like the one above (my terrible drawing skills mean circles are libraries and squares are programs). In a perfect world none of this is an issue, since B, C, and D will point to the same version of E, while A and B will point to the same version of D. Let’s be negative though and think of all the problems we can run into if they point to different versions.

ABI changes in commits can lead to runtime problems. This will bite us especially hard if B and D point to different versions of E, but can happen if any versions are out of sync. Depending on what the changes are and who “wins” when installing artifacts, it could be hard or impossible to detect until weird things start happening. If we create types in one repository but pass them between others (in the example above, passing a type provided by E between B and D) static linking can’t even solve the problem since the compiled code will use the memory layout its version of E contains.

I care about my systems being reliable, so I want to make absolutely sure I’m using the same versions of D and E across everybody who’s using those libraries. We could probably get scripts to walk repositories and validate the repositories are in sync, but I have doubts this scales in practice. They can be solved of course, but I doubt my ability to get it right, especially in complex systems.

Instead, let’s look to a source-based package manager like portage and see how they solve it. Like most package managers, the solution includes explicit dependency management. I tried to make something more generic, just for developers.

I call it dev-pipeline, and it’s available on Github under the two-clause BSD license. The design is pretty simple, and all you have to do is tell it what you need to build a project, including dependencies, in a human-readable ini-style configuration file (documentation is in the repository). I’ve tried it small-scale with full builds (it worked!) along with a much more complex fake project than I have here (mirrors some real world examples I’ve dealt with) and it still appeared to scale well.

# fetch the latest version of a repo and all its dependents
dev-pipeline checkout some-project
# build 'em
dev-pipeline build some-project

You have to specific some things like cmake arguments, but you only need to do that once. In my head, the config file lives in a bootstrap repository (for lack of a better term), then engineers can check that out and ignore the small details. Since the configuration is ini-style and even supports comments (thanks, Python), it plays nicely with source control and even code reviews.

Right now dev-pipeline supports git for checkouts and cmake for builds, but it should be modular enough internally that supporting new stuff is simple (I use git and cmake for everything, so I start with my needs); patches welcome.

I’ll assert at this point dev-pipeline is suitable for more users than just me, although it’s definitely rough around the edges.

2 thoughts on “Sub-modules are the Devil

Leave a Reply

Your email address will not be published. Required fields are marked *