How to debug code in your mind (most useful dev skill imo)
No matter if you're the person doing it, or witness someone else having one of those moments - it's always pretty cool.
If I look back on my 15 years in software development, I believe this "debugging in our minds" is the main source of some people's success in their careers. I know I have gotten compliments, promotions and pay raises for things that were obvious and where the error message was right there in front of me telling me what needed to be done, and it appeared like black magic fuckery to the ones who had asked for my help. Similarly, I have learned a lot from asking more senior developers for their help, only to be schooled in the efficient art of not-single-stepping-through-lines-of-code.
It's a big thing, and I haven't really seen much, if anything, in the way of tutorials or books on this topic. It's something that we either pick up along the way, or are lucky enough to learn from more experienced developers who happen to be in the right place at the right time.
Understand the flow that data takes through the process chain
Many times, I was able to figure out the source of a problem simply because it made no sense. When you understand that data passes through a particular route only when input data is correct, you can deduce that this other data field causing issues can't possibly be true because they contradict each other --> so the solution is likely not to be found in the code, but on the database, aka faulty input data.
In other cases, you can deduce that thing A can't be the problem, because then the data wouldn't have even reached point B, so we can safely work with the working assumption that popular Problem A has passed, and don't look into that until we have exhausted all other options.
Similarly, when we look at a timeout issue and there is no clear way to judge which system or endpoint caused it, we can probably deduce that it's either of three endpoints that this component can potentially call.
A lot of it comes down to time spent existing in the ecosystem at work, but then again, a lot of it comes down to training our minds to think in ways of inputs and outputs, and knowing which hops our data takes to even get to this point where an error has happened. That is crucial, and I don't think there is any way to learn this other than repeated exposure to architecturally-thinking-beings. I learned most of this from a guy at work who always took the time to sit me down, talk about my problem, and then think out loud and quiz me on these very same processes.
You don't find people this calm and helpful that often, and most people don't have the time to do the grunt work slowly, and would much more often just jump to the conclusion and spit it out for you to take and copy, and feel superior to those who came after them.
Working with probabilities in a black-and-white world can save hours
After 15 years, my most surprising insight about programming is that there is a lot of grey in a world that is so black-and-white at first glance. Everything is a boolean, the system either works or it doesn't.
However, the higher up the chain you work, the more other things like politics, formerly-good-choices, side effects of other systems start to matter - and in similar ways, it becomes harder to debug things that are technically true, but objectively wrong. But that is not the main point I want to make here, but rather this: I don't always know exactly the source of a problem, I just know one is most likely.
The main thing that I see other developers struggling with is the "I don't know for sure" phase of identifying the problem. Yeah, me neither, the issue might just end up being at the other side of the system, and there is a reasonable chance that I might be completely wrong and wasting daylight while I'm zeroing in on this one component. But here's the thing: I still have a savings account of previous guesses coming out right, and saving a lot of hours just by working off probabilities in a clear cut world.
Sometimes, this is as simple as noticing that one component causes significantly more issues that another, so it makes sense to look there first. Other times, I would start with the part of the system that relies on user inputs, because we all know that humans have squishy brains and often times manage to find the one place in the GUI that isn't hardened against user inputs.
Other times, it makes sense to completely ignore the system causing the error and go right into the server log for the system that it calls, because the job log inside a server usually has better log output than what is transported in the API request http response. A lot of servers and third-party tools live and breathe by the principle of security by obscurity, and would rather not report what is going on to the outside world.
Learn to read stack traces
The most common knowledge or skill gap that I see with developers is that surprisingly many are either afraid of, or have never learned how to interact with stack traces.
Look, don't get me wrong, I would rather read anything else. Stack traces that span across a whole screen are not exactly fun fiction books about dragons and beautiful ladies being rescued by strong and unwavering heroes - even if our minds drift towards that direction, and wish that a strong lady dragon in armor would rescue us from the fate of having to understand this ancient scroll of illegible text.
However, once you develop some very simple patterns, stack traces are a godsend, especially compared to getting a call from your coworker who says "it all doesn't work" and can't be brought to tell you what doesn't work, where and when, and if he can repeat the behavior with other datasets. A stack trace is honestly pretty great, and it tells you exactly what went wrong, where and when, down to the individual character of a single line of code. As always, the more I deal with people, the more I value machine outputs.
Here is my short list of tricks that make stack traces fun:
Most tools tell you either in the beginning or at the end what exactly went wrong. They are usually consistent with the reading direction, so most of the time you only need the first few or last few lines.
If you deal with messy output that is hard to filter, search for "error", "fatal", "refused" and "timeout", those are usually good ways to get to the source while ignoring all the regular status updates that most tools produce.
Ideally, you can filter the output with some kind of dashboard
If that is not an option, copy it all or open the log file with VS Code, it has a nice selection of tricks and plugins that can help you out.
One of the most useful ones is a plugin that converts server timestamps either to your own locale, or converts milliseconds-since-the-birth-of-Christ kind of date notations to human readable output. Quite often, you only need to zero in on the time that an error has happened to find it in the logs, even if you have nothing else to filter by
There aren't many kinds of issues that can happen in most servers. Input data, encoding issues, timeouts and refused connections. Edge cases exist, but usually these are the ones.
If an error message seems odd, that is usually a follow-up issue masking the real symptom. For example, something like "invalid character ',' " will often mean that no input data was received, and the server tries to stitch together the returned JSON with something else, and fails because it doesn't get the data it expects. This is one of the fields where other people can look at you like you are a magician because they could not jump from the nonsensical message to the most probable solution.
At least in my world, there are some issues that simply arise due to fallible systems. We do a lot of migration from ancient tech to modern (or less ancient) systems, and there is just a lot of leeway for things to go bad one second, and run fine the next. you always try to reduce those cases as close to zero as possible, but they always happen with some frequency.
This question is super important in figuring out where to look, because it might not be the code and rather the infrastructure. I have seen a lot of issues that seemed like "our errors", when in the end it turned out that there was an update of a server that wasn't even part of the process chain, but would consume all the resources of the Kubernetes cluster during that time, so our jobs would fail because that other endpoint on that cluster was unresponsive for a second.
I always do this first, trying to repeat the issue, because a lot of times, cases that are a week old are overtaken by current commits, and might already be fixed, or wrong in a different way to when the ticket was created. Also, users can find weird ways to click things, so you might not be able to reproduce the bug by clicking "the right way".
It is surprisingly easy to waste hours on finding a bug that isn't even there, I have seen it often enough to put this on number two of the worst wastes of time in industrial scale distributed architecture, right after not knowing for weeks that an error is even happening, and then having to clean it all up when it would have been much easier to fix had we known about it from the day it started. Monitoring is a big thing in these architectures, I have seen gruesome errors persisting for weeks until someone noticed them by accident.
Try to repeat the issue, and everything will become much clearer.
Learn the common issues in systems of this pattern
Again, this often only comes from time spent in the ecosystem, or having an experienced dev there to tell you - but there are some patterns that always repeat. I already mentioned the case of empty input data leading to obscure error messages like unexpected closing characters, but there are many others:
Encoding issues anytime you jump from one system to another, or persist states on the file system or in a database. Line endings, encodings, pretty-printing following a different pattern. These aren't usually issues while you stay in one system, but jumping can mean different operating systems, default encodings, and endpoints that are configured differently. Lots of opportunity for the input not resembling the output.
Timeouts when calling internal API endpoints
Timeouts aren't always timeouts, but often the symptom of the underlying illness. For example, an invalid server certificate or outdated authentication can also lead to timeouts when the response isn't read correctly, or the server has a guideline to not respond at all to invalid credentials.
Anytime you work with parsing JSON files, expect the point that the error message reports to not be the actual issue. If there is an invalid character in the JSON, you often get issues with the unexpected closing sign, not the accidentally escaped starting point and things like that. Always copy the JSON into a better editor and validate it, that's usually the easiest path to the solution.
Try catch blocks can be the devil's most devious invention when they allow an issue to be passed on to the next step in the chain, when they should lead to a much clearer and obvious issue in that first system.
Converting text to booleans is a fun way to mess up someone's day. I once saw a function at work that was called "convertToBool()" and laughed about it - right until I investigated the code and saw just how many different ways there are to write True or False (I mean true or false, 0/1, yes/no,Yes/No, Ja/Nein, y/n, Y/N, J/N, j/n, wahr/falsch, x/null)
Quotes can mess up things quite a bit. Not only are there several types of quotes, but they are also interpreted differently by different systems, and once you save something to the database and read it back, you might end up with extra or missing quotes.