To solve both of these problems, ALVINN takes each input image and computes other views of the road by performing various perspective transformations (shift, rotate, and fill in missing pixels) so as to simulate what the vehicle would be seeing if its position and orientation on the road was not correct. For each of these synthesized views of the road, a "correct" steering direction is approximated. The real and the synthesized images are then used for training the network.
To avoid overfitting using just the most recent images captured, ALVINN maintains a buffer pool of 200 images (both real and synthetic). When a new image is obtained, it replaces one of the images in the buffer pool so that the average steering direction of all 200 examples is straight ahead. In this way, the buffer pool always keeps some images in many different steering directions.
Initially, a human driver controls the vehicle for about 5 minutes while the network learns weights starting from initial random weights. After that one epoch of training using the 200 examples in the buffer pool is performed approximately every 2 seconds.