Karpathy Talk - Multitask learning

A ton of insight into how Tesla’s process for coming up with a particular NN architecture and how to go about training it. Also includes insights into the dynamics of team development on a common network.


Great find. Thanks.

I learned so much about Tesla’s NN design from this. For instance, I hadn’t understood why the network seemed so monolithic given that it generates so many different outputs. The architecture of the current camera nets has a large inception style ‘backbone’ but the heads that feed the various outputs are pretty small - basically they just deconvolve, refactor, or minimally interpret what seems to be a single massive representation generated by the inception backbone. This shouldn’t be possible, or at least it shouldn’t be very efficient, for outputs that are very different.

So the answer seems to be that, for training purposes the network is actually tree shaped with large sections of the higher layers being devoted to particular outputs or groups of outputs, but for inference purposes they preserve the functionally monolithic nature of the backbone because it’s computationally efficient. To pull this off they need to perform backprop from each output only to the neurons which feed that output while leaving the other branches unchanged. If you do that while managing the total neuron count in each layer you can get the benefit of a network that at inference time performs like a single big backbone while minimizing weight conflict in higher layers.


Do you think the benefit of a big multitask network is just less compute required for inference? Or do different tasks boost each other’s accuracy because they deal with correlated visual features of the environment?

The main benefit is of the approach described is that you get better (potentially, much better) inference efficiency, but the issue of synergy/conflict is an important one that will impact a lot of decisions about how to structure and train the network’s branches. A lot of task pairs will have positive synergy: training common elements together will improve both. Others will conflict: if one improves it harms the other. There will be a certain craft to finding an overall structure that links synergistic ones while separating (and appropriately allocating resources to) tasks that have adversarial resource utilization at a particular layer. This is the element where AK says that you can do it manually if you have just two tasks (heads/branches), but if there are more - and Tesla has 30 to 50 - you can’t do it manually. He doesn’t say what they do beyond manual stuff besides saying that grid search isn’t really an option. But we can guess, though I’ll omit that here.


Please speculate!

If you need to search but you can’t do grid search then you have to do some kind of smart search. The most obvious thing for neural networking types is gradient descent, but it require a lot of samples. Random search is surprisingly effective if your parameter space is smaller than 25 dimension or so. And beyond that there is the ne-plus-ultra -> training a neural network to search the space for you!

1 Like

This is one of the best blogs out there on Multi-task Learning(with research paper citation and important intuitions)-

1 Like