Machine learning and computer vision for autonomous rover racing -- step one

jwatte's picture

Some interesting news on my project to apply machine vision to the subject of autonomous rover racing.
I've been renting GPU processing power on the GPU processing power spot market (yes, there is such a thing) and training neural networks.

I've used some of the videos that jesse helpfully provided from onboard his rovers during AVC 2015/2016, painstakingly painting red/green "stop/go" areas in each frame across some selected snippets.
Then, I've cut little sub-sections of each frame out randomly, where the center spot is either "stop" or "go," and formulated and trained a Deep Learning Fully Convolutional network on those pictures. (I had about 500 pictures when I was done)

Here are random examples of "stop" and "go" from this training set:

The randomly-selected "stop" is probably cropped from the top of some picture, and the blue lines is likely a fence.

The randomly-selected "go" is probably cropped from the right of a picture, and we can see the track and a bit of a wheel in it.

Then I took some footage from Money Pit 3 in 2015, when I was just driving around a local park. What's interesting is that this park has brighter pavement than the AVC track, and it also faces a road where cars are parked, which is different form haybales on the side.
I picked three random pictures, two "go" ones and a "stop" one (note: the model classifies the CENTER of the image, not the near field of the image.) Here are the results, on totally brand new images the models' never seen before, taken with different cameras, etc:

It should go on those paths, but should probably stop for those cars.

This feels amaaaaaziiing!
(...until such time that I find out something like that it actually thought anything with white along the bottom is OK, and anything with green along the bottom is not OK, at which point I'm sad and need to apply the Knobbly Stick of Network Enlightenment to it)

The next step is to turn these 64x64 snippets into a wider convolutional network that classifies pixels in a full video frame -- this is known as semantic segmentation. The cool thing is that I can basically take the model that I have, and run it across sections of the video frame, sliding it over a little bit each time, and then color the center of the image window with whatever stop/go the model tells me it should be.
Then I should have a "map" of what's okay or not in front of the camera out to the horizon.