James Douma on Tesla’s Fleet Data Collection Effort

@jimmy_d I quit Twitter so I hope I can reach you here. Do you think Tesla is meaningfully constrained by FSD Computer compute capacity right now, especially in light of verygreen’s recent tweets about multi-node computing?

I don’t think what green is seeing is sufficient evidence to draw that conclusion, no. It’s the nature of NNs to be opaque under simple observation so I can’t say, and neither can anyone else outside Tesla, how close they are to needing HW4. But my sense of the problem is that they aren’t getting close to needing it yet.

I think we’ll see them move to HW4 as soon as they think it might be needed.

What do you know about multi-node computing for neural networks? Is there any reason for Tesla not to do that?

They’ll do it. They will need it for reliability when they start moving beyond ADAS functions. Until then it is of much lower priority because there is always an attentive driver available to take over in the extremely rare situations where you have a hardware failure on one of the nodes. I am sure they have had people working on it for sometime but haven’t yet deployed it. Perhaps it isn’t a priority right now or perhaps it is complicated enough that they want to get it thoroughly polished before they push it out to the fleet. It’s possible that what Green is seeing is some initial versions of multi node going out to the fleet.

Green was claiming that Tesla is using the two nodes to run two different halves of the same neural network, i.e., spreading the NN across two nodes rather than running the NN on one node.

Green’s interpretation is that redundancy is out the window and Tesla is trying to squeeze as much compute as possible out of HW3 using multi-node computing.

@jimmy_d Do you know how practical multi-node NN computing is and would you take it as a sign Tesla is running out of compute resources on the HW3 board?

I don’t see a good reason to favor that particular interpretation. I can imagine various things that might lead to this change in how the networks are being implemented. Most of them are not related to capacity limits on critical features being a constraint.

