Interesting post that is worth a read. My personal thoughts are below.
It would be extremely dangerous to let a predictive system like a neural net (infamously cryptic to debug and prone to confusion) take control of a self-driving car directly (the end-to-end learning approach).
Ultimately this means leaning even more heavily on neural nets — with their unpredictable and extreme failure cases — for safety critical systems.
It’s somewhat surprising to me that a company with “AI” in its name seems to take a philosophical anti-neural network stance (when it comes to safety-critical robotics applications). But these are popular arguments.
My own gut feeling is that the best hope for self-driving cars is something closer to end-to-end learning than what Waymo currently seems to be deploying in its Waymo One minivans. If behaviour generation (a.k.a. planning a.k.a. path planning and driving policy) relies on engineers manually tweaking hand-written code after watching simulations and real world tests, that seems like a recipe for failure. Today, that would seem like a quaint approach if you insisted it was the best way to play Go, Dota, StarCraft, or Quake. As far as I know, there is no track record of success with this approach in complex robotics problems or virtual agent problems that are comparable to driving. But I could be wrong!
With perception, there are all kinds of entities that either have near-zero depth or are made of light that a self-driving car has to see with super high accuracy: the colour of traffic lights, turn signals and brake lights, the paint markings on roads like lane lines and crosswalks, and the surfaces of road signs. Lidar can’t see these things, so we’re going to need to solve seeing them with cameras and neural networks. Whether we use lidar or not.
To solve computer vision for these entities, maybe we’ll need to employ a new approach like self-supervised learning or end-to-end learning. I would say we shouldn’t rule out new approaches until we’re confident these entities can be solved with conventional supervised learning and human labelling of images/videos. And if we’re confident these entities can be solved that way, why not other entities like vehicles, pedestrians, and cyclists? This isn’t a rhetorical question; I’m really asking if there is a good reason to think cameras + human labels + neural nets can solve lidar invisible entities but not lidar visible entities. Maybe there is. I’m not an expert.
Additionally, almost all self-driving stacks visualize the world in top-down perspective for planning purposes, so misjudging the width of a car (as we saw in the first example) can lead the planning system to incorrectly predict what maneuvers other cars on the road have the space to perform or even to propose a path that would lead to the AV side-swiping the other vehicle.
Failing to detect vehicles carries obvious risks. But the post also makes a claim about the importance of getting the exact size and location of vehicles right.
At parking speeds and distances, the human margin of error is something on the order of 10 centimetres. At highway speeds and distances, it’s probably many metres. In any case, you want to keep a certain amount of distance between yourself and other vehicles. So if an AV does the same, the computer vision error has to exceed whatever margin of error it uses.
This is robustness vs. exactness. You want a perception system that is robust to rare objects, diverse lighting and weather conditions, and other variations in visual conditions. But you only need so much exactness. It is better to have a system that detects vehicles 99.999% of the time with 20 cm of accuracy than a system that detects them 99.995% of the time with 1 cm of accuracy.
The more convincing argument about lidar, in my opinion, is that it helps with robustness rather than the argument that it helps with exactness. Timothy B. Lee recently pointed out on Twitter that Tesla could have gotten the benefits of large-scale fleet learning without the liability of promising “Full Self-Driving” by simply saying the new vehicle hardware is for advanced driver assistance. A company like Tesla or maybe even General Motors could use fleet learning to collect training data and run large-scale testing while equipping a separate small fleet of robotaxis with lidar. I personally think this the best argument for lidar, since it combines the strengths of lidar and the fleet learning approach. (I don’t know if it would be at all feasible for Tesla to retrofit its Hardware 2/Hardware 3 vehicles with lidar at some hypothetical future eventuality where lack of lidar is the only thing holding back the fleet from achieving superhuman full autonomy.)
A distinct but related argument from the quote above is that exactness in computer vision is important for behaviour prediction. I don’t know enough about this to comment. I think it’s an interesting argument that deserves further thought.