Email from Elon to Tesla employees:
Tesla needs a few hundred more internal participants in the full self-driving program, which is about to accelerate significantly with the introduction of the Tesla designed neural net computer (known internally as Hardware 3). This has over 1000% more capability than HW2!
Not clear when wide-scale Hardware 3 testing will begin (or whether it’s begun already). Also not clear what software features employees will be (or already are) testing.
The main takeaway for me is that Hardware 3 has “over 1000% more capability than HW2”. Previously Elon had tweeted the increase would be somewhere between 500% and 2000%.
Based on my previous surmisings, this means that HW3 has over 100 teraops of neural network processing.
Edit (Dec. 31): Small update to this story. As was the case previously, employees who participate in the Full Self-Driving testing program get a free premium interior (a $5,000 value).
Yeah, and if they are using an NN specific architecture (as one would presume) then they can probably get much higher utilization of that 100Tops than you normally get in an GPU. GP106 has somewhere between 5 and 10 Tops UINT8, but it’s hard to get all that from a GPU due to insufficient on-chip memory.
So even with 10x higher peak Tops rating the practical impact may well be closer to 20x.
I wonder if Elon’s 1000% already refers to the amount of compute that can be used in practice. On Twitter, his terminology was “useful ops/sec”.
GP106 isnt optimized for learning or inference but can run NN models due to the flexibility of GPUs. So in that case what would be best between INT8 or FP16 precision for inference?
Or am I completely off base and trying to compare an apple to an orange?
The vast majority of mainstream models can be easily converted to ‘quantized’ integer versions without loss of accuracy. The conversion operation involves running sample frames through the network and gathering activation statistics at each layer boundary, then adjusting the weights so that the mean and standard deviation of the distribution is well represented in a short integer. You can go as low as 5 bits, but because 8 bit operations are widely supported in existing hardware that has become the standard. GP106 can run UINT8 twice as fast as FP16 and 4 times as fast as FP32. Even greater gains are possible if you look at performance per watt, where UINT8 can be well over 10x improvement.
AP2 models have been mainly UINT8 for the last 18 months or so. Earlier AP models had some FP32 mixed it, but that disappeared over a year ago.
FP16 is a great numerical format for efficient training, especially if you use the google version which has a 5 bit mantissa and an 8 bit exponent. But for inference it’s overkill for any model that you can gather activation statistics on. Because UINT8 is twice as memory efficient, and because memory capacity/bandwidth is the limiter for current generation NN hardware, it makes sense to go with UINT8 right now. Even better approaches can be had at lower resolution, variable resolution, and if accommodating sparsity, but that will probably come in the 2nd generation of custom NN hardware.