Distributing Initial Data

Having to deal with pre-populated datasets can be painful. A little architecture trick can help.

Quite often we have to deal with a chicken egg problem, especially when creating a data-driven application that requires some initial data to work correctly.

Examples of such applications:

  • An e-mail service based on templates (no mail w/o at least a basic template)
  • A recommendation engine with dynamic weights (no useful recommendations w/o a basic trained model)
  • An authentication system (no administration without at least one admin account)

Sometimes people tend to include special scripts in their applications to circumvent this initial data problem. However, this has some severe downsides:

  • If the script is running during deployment it cannot cope with removed data (or could overwrite modified data)
  • The script will be fairly untested, especially if its mostly not needed
  • If the script is running at service startup we have a race condition / service scaling issue (additionally, the script cannot deal with removed data unless the service crashes and restarts under such circumstances)
  • Same applies during runtime

The additional logic, database checks, and make this kind of solution also undesired. Still, supplying a script is potentially the most popular option to deliver initial data.

Instead, I argue we should publish this "hardcoded" initial data with the code (i.e., not in form of a script ending up in our database). The data could be managed in a different repository and distributed via a package manager (e.g., NuGet or NPM). As such we gain an audit and (independent) versioning over the initial data we reference.

How is the data now served / wired up? Since we should abstract the specific database connection anyway, we only need to take get of having a custom implementation for retrieval of data from the database. By preference, we will always use the data from the database, however, in case of a not found entry we will return the result of the associated initial data set.

In order to fully support this model the returned data should indicate if its a readonly entry (i.e., one of the initial data set) or a normal one. Now we are free in the UI and on the service what to do. We could either use the initial data as basis for modification (which is then stored in the database), or prevent any further changes to items already contained in the initial data set.

The chicken egg problem can be circumvented by abusing the middle layer between the real database and our business logic. As a result the initial data is tested as well, does not run into scaling issues and can be properly audited and packaged.

Created .

References

Sharing is caring!