Professional development in data science

Last updated 2020-06-25

There are lots of great resources that help data scientists land their first job, or learn about specific subjects. But I haven’t seen any that focus on long-term professional development. This is how I’m currently thinking about it.

What is the goal?

In a sense, the goal is constant improvement, but not just so that you can make more money or pass a test. It’s about continuously breathing new life into your self-efficacy—giving yourself increasingly strong evidence that you can take on more responsibility and be more ambitious.

You’re lucky if you can get this evidence in an organic way through the work you’re already doing. But I have the impression that many data scientists, like me, eventually start to worry that certain skills are atrophying and that they are becoming over-specialized or complacent.

What should you focus on?

If you want to continually increase your self-efficacy, I think there are four areas to focus on:

  1. Concepts
  2. Skills
  3. Tools
  4. Resources

My suggestion is that you should try not to neglect any one of these categories for too long. If you go a whole year without making an active effort to increase your conceptual understanding, something is probably missing. But at the same time, you don’t need to constantly work on them all in parallel. One or another might be more or less rewarding depending on your mood at a given time. Think of them like directions on a joystick.

1. Concepts

Concepts are about what you know. Think:

  • Bias-variance tradeoff
  • Central limit theorem
  • Curse of dimensionality
  • Bayes’ theorem

Ideally, you are constantly deepening your knowledge of fundamental concepts, and branching out to new ones that interest you. Some people do this by regularly reading papers or textbooks that force them to call fundamentals to mind. It’s a good approach (much better than nothing!), but for me it lacks something crucial: testing yourself explicitly on your understanding.

I see two main ways to test yourself. One is through a spaced repetition practice, using software like Anki to help you stay sharp. The other is communication: explaining concepts to others through writing or speaking. I’m especially excited by the prospect of evergreen note-taking as a practice for engaging more deeply with data science concepts. If you haven’t heard of this idea, check out Andy Matuschak’s notes on it.

2. Skills

Skills are about what you can do. Think:

  • Wrangle data
  • Run statistical tests
  • Write software packages
  • Create data visualizations
  • Train and validate models

Again, the idea is constantly go deeper on fundamental skills and go broader on ones that interest you.

I see two main approaches to improving skills. One is to work on projects—ideally ones that are meaningful to you. The other is to regularly work through targeted exercises, like on Project Euler or Brilliant. The key is that you’re doing something.

When developing skills, it’s also worth asking whether targeted practice might be a bit overrated relative to the pursuit of tacit knowledge—the kind of know-how that can’t be (or isn’t) expressed in words. Cedric Chin has a great series of essays on this:

3. Tools

Tools are about what you use to do the things you do. Think:

  • Specific programming languages
  • Packages
  • IDEs

It’s hard to say whether there even are fundamentals when it comes to tools, since they change year by year. But you shouldn’t ignore them. There’s overlap here with concepts and skills—e.g. becoming a better Python programmer involves both learning new Python concepts and writing more Python code. So you can use spaced repetition, communication, projects, or targeted exercises. The key is to not neglect your tools as a subject of your learning.

4. Resources

Resources are about what you reference when you haven’t stored important information in your brain. Think:

  • Textbooks
  • Online courses
  • Cheatsheets
  • Subscriptions
  • Wikis

The goal here is to build yourself a “second brain”, to borrow Tiago Forte’s phrase. Improving your familiarity with data science resources is probably the easiest way to improve your professional self-efficacy, because it requires little effort but is a force multiplier for your efforts in learning about concepts, skills, and tools. Essentially, you want to build yourself a central repository of the best resources on every relevant topic, so that you always know where to go to find information that you haven’t memorized or to refresh your understanding on material that you feel rusty on.

Links to this page