Data labelling may cost less than I thought

From the Financial Times (semi-paywalled):

Mother Jones says that Indian call centre employees:

…will earn as much as 20,000 rupees per month—around $2 per hour, or $5,000 per year if they last that long, which most will not. In a country where per-capita income is about $900 per year, a BPO salary qualifies as middle-class.

Suppose that a labelling company pays employees a 50% premium over what the call centres pay: $3/hour. If it truly takes only 8 hours to label an hour of video, then it would cost $24 to label an hour of video.

Waymo has driven around 15 million miles. At an average speed of 25 miles per hour, that’s 600,000 hours of driving. At $24 per hour, it would cost $14.4 million to annotate 600,000 hours of video. $115 million if it’s video from 8 cameras.

Waymo drives about 1 million miles (40,000 hours) per month or 12 million miles (480,000 hours) per year. Waymo is believed to spend about $1 billion per year.

This would suggest that data collection may actually be more of a bottleneck than data labelling. Especially since only a fraction of video might be considered useful or interesting and therefore be sent for labelling.

Tesla has 530,000 cars driving about 1 hour per day on average, so about 530,000 hours per day. To label all video from all cameras at $24 per hour of video would cost $37 billion per year. Obviously infeasible. Yet to annotate 6,000,000 hours of driving (3% of the annual total from 530,000 cars and 10x more than Waymo’s total hours driven) would cost $1.15 billion, a very large but not totally infeasible amount, especially if spread out over multiple years.

The caveat is whether the 8 hours figure cited by the Financial Times is accurate. Drive.ai said it would take 800 hours (100x more!) to label an hour of video (or is it an hour of driving, including multiple video streams?), but that automation speeds up the process. By how much? We’re not told. Companies like Scale AI are focused on bringing down the time it takes to label video in order to get more productivity out of their workers.

Another source for the 8 hours figure.

1 Like

In North America, you can pay a safety driver $25/hour to drive around collecting data. Then there’s the capital cost of getting your sensors and computers into vehicles. If your test vehicle costs $50,000 then deploying 100,000 of them will cost $5 billion. Employing 100,000 safety drivers will cost around $5 billion/year. Unless you can actually sell cars to paying customers. Or figure out some other strategy.

Tesla’s trick is eliminating the labour cost of the safety driver and the capital cost of deploying vehicles.

If labelling video truly costs $24 per hour, then labelling the most valuable 5% of video from 100,000 test cars would cost $1.8 billion/year. The most valuable 1%? $360 million/year. On these assumptions, labelling data would be much cheaper than collecting it.

Related:

Degrees of labeling. No chance you can accurately draw lane splines in 1/8th real time. Nor are you going to be drawing accurate 3D bounding boxes around more than 1 car even with key frames at 1/s.

Maybe if you have a “pretty good” existing algorithm that is 90% right and you’re just fixing errors you could hit 1/8th real time.

This is the benefit of closing the loop though like Tesla. Your QC tools need to be integrated into your stack very closely. You need to be just correcting errors, not wholesale labeling.

I think that’s exactly it. If your neural network is already 98% accurate and you’re trying to get to 99.999% accurate, then labelling consists in reviewing the automatically labelled data and manually correcting the 2% that is wrong.

Data collection costs:

  • buying/building the test vehicles
  • operating the test vehicles (energy, maintenance, and insurance)
  • paying the safety drivers
  • bandwidth and storage

Data labelling costs:

  • paying the labellers
  • bandwidth and storage

These are marginal costs, ignoring fixed costs like management and R&D.

If I’m getting this right, if you paid data labellers in India $4.50/hour that would put them in the top ~2% of wage earners in the country, which is incredibly disheartening for the state of poverty in India. It also means companies that need labelling can pay people in India much more than the average Indian wage yet much less than the U.S. minimum wage.

India is just an illustrative example. Companies outsource labelling to many countries.

If you want to do something about global poverty right now, donate to GiveWell or one of its top charities: