Lessons from my PhD

Please note: This article is only available in English.
Some things I would do different if I would have to start again.

One is always smarter afterwards. Pursuing a PhD is no exclusion to this rule. Actually, it is quite an amplified experience in that regard. You'll pursue it in the first place to be smarter afterwards. Besides quite a couple of valuable lessons in computer science, physics, and mathematics, I also found some organizational aspects, which may be helpful (especially for upcoming PhD students). This is a short summary of some of these organizational aspects.

The Importance of Micromanagement

Before starting my PhD I outlined a very ambitious timeline of research. It was clear from the beginning, that the outline represents the upper limit and is nearly impossible to achieve. This timeline is a plan for macromanagement. I would guess that one of my problems was that I tried to stick to it too much. In fact it is nice to have such a guideline, but it is much more important to do some proper micromanagement.

It makes sense to focus on micromanagement for scientific research. There are a couple of reasons for that. First, you probably only have a vague clue what to do and how to do it. Second, scientific work is about little steps. If, e.g., you encounter an interesting anomaly, behavior, or sub-topic, then you should pursue it. This will eventually result in a deviation of the plan, which has much smaller effects on the micromanagement outline than on the overall (macromanagement) plan. Lastly, it lets you structure the potentially upcoming tasks already. The big picture is nice, but in the end the big picture will be solely determined by all the little steps.

Start with Something Easy and Fun

Focusing on new stuff is interesting. But, there is a long gap before anything interesting comes out. Instead, you should try to mix in a topic that can be covered with existing knowledge and potentially yields some new outcome. Most importantly, it should be fun, e.g., practical. Personally, I made the mistake to focus too long on novel methods and their applications to serious problems. While this is interesting, it certainly is long and frustrating work. Also it does not lead to immediate publications. If publications are more important to you or required formally, then relevant side-projects have to be incorporated early. The closer the side-project is to the big picture the better.

In the long run it may be beneficial to include smaller, potentially unrelated topics. One never knows what they are good for. It could be also advantageous to look at other fields or methods from time to time.

Organize Your Data

Data organization is something that is neither learned nor recommended in some way or the other. This is unfortunate as data is our most precious good. We have to come up with some scheme that makes it easy to retrieve former data and pin it to evaluations / models / the exact point in research. Even though this all sounds quite trivial, it is not. Everybody I know came up with a different scheme and most schemes seem only to be efficient locally.

I have to admit my data organization scheme has also been far from ideal. It seemed to be well worked out first, but in retrospective it was not. My main issue was, I believe, that I did not write an application to do this kind of job. Instead I relied on good filenames, folders, and auxiliary files. I will explain a good format for output files in the next point, but for now I want to focus on what I would do better in the future.

Open a git repository and create a branch for every evaluation you do. The output should be text based. If it is binary, then handle this by referring to the original binary files from text files. The binary files should be placed on one common place and should have unique identifiers as their file names. Don't worry about these file names, as they will be cross-referenced anyway. Why a git repository? Yes, you could have chosen any other (distributed) VCS, so there is no particular reason (I personally use nothing else, hence the decision is easy for me). Replacing potentially outdated data cleans up your file system representation. If you really need the (assumed to be) outdated data later on, then you can simply go back in time.

The structure should be that you have a master branch with a README explaining all the different (independent) branches. Each (independent) branch may come with its own README explaining the contained folders. Each folder contains a file explaining the different files. Each file should be at best self-explanatory.

The Right Data Format

For my applications I wrote a small JSON serializer / deserializer in C++. I regard JSON as a lightweight, yet expressive, text format. In the output of a simulation I store some meta information, such as the used configuration (this is more or less the original input file, i.e., I don't need the original input file to understand the contents of the output file), runtime, or the application. And there is one more thing ...

I also store the SHA of the latest commit at the time of building the currently used application. Why is that information useful? Well, it could be that you have an evaluation that does not look so special in the beginning. So you tweak the application, change algorithms, and so forth. Of course, you'll track your changes using your VCS. Potentially, you'll also use different (development) branches, however, eventually it will merge and go on unnoticed that the former evaluation actually great. Now the situation comes that you want to go back to exactly that point in time in your application code, to run similar (or the same) simulations or evaluations. Storing the SHA in the generated output helps you a lot in doing that.

Use GitLab

I've written several times about GitLab and I am actually a heavy GitLab user. Shortly, after I've setup my own GitLab server in the office the whole group got one. That rendered my personal one both, obsolete and inefficient. One of the best things from GitLab (besides having one additional centralized git server) is that the whole issue management is more or less an exact copy of the one available at GitHub. And that rocks!

Write issues (also use it for planning, i.e., create milestones and note bugs plus potential features) and dump ideas. Also collaboration is much easier. Why should you use Dropbox (or any other data cloud service)? GitLab (or git in general) is not only good for code, but also for documents, ideas, or general project management. Make use of it!

(The previous statements can all be applied to a paid GitHub account, or any other repository manager with wiki and issue tracking features)

Conclusions

Well, I know this was quite data-centric, but I hope that one or the other good advice can be found in here. Most of the problems with planning come by not knowing the problem well enough. A PhD is mostly a problem that cannot be planned, however, knowing this, we can try to come up with a system that is at least adaptive to changing requirements.

Created . Last updated .

References

Sharing is caring!