What kind of network to use?

I have python and tensorflow set up and I have played with a couple of examples provided by Siraj Raval but now I want to try something original.

The idea – detect an object in a picture and return a position and orientation vector for it fast enough for real time tracking from a video feed.

The questions – on the autoencoder side do I need to go as far as using a convolutional neural network? How do I autodecode the model on the other side to produce a position and orientation vector?

