Face Tracking Duck (and Sheep!)

Did you know that you can control your DSLR camera via command line instructions. Not me! And I wouldn’t have until I attempted this project. But that’s besides the point, this project goes beyond me simply using a DSLR camera to obtain a video feed. I used a 2-DOF pan-tilt rig powered by Arduino and the OpenCV library to track the movement of a person’s face. In a way, one could call this a rudimentary example of an animatronic head.

If you move upwards, the head will tilt upwards to look at you. Almost like its gaze will follow in whichever direction you move. (Provided you remain within the frame, of course!) I tried placing several cute toys on the pan-tilt rig such as a duck or a sheep, and a ladybird.

Setting up a DSLR camera for your video feed:

I used a DSLR camera so that I can position the camera wherever I wanted to. In addition, you get a higher frame rate, better image quality etc… This project works with a laptop camera as well, which is what I was using at the beginning of the project. Also, you can try using any other USB webcam or gphoto2 library supported cameras.

I carried out this project on Ubuntu 18.04, so all the instructions are for Ubuntu. I followed the instructions at [12] to set it up. gphoto2 is a great library if you want to control specific features of your webcam such as ISO mode, white balance, flash mode etc...

Using gphoto2 –list-config you can see what camera parameters you can control from the command line, for instance my camera’s config list is shown below. Different cameras will have a different list of parameters that you can control.

Using gphoto2 –set-config [name]=value, you can set a parameter to a certain value. For instance, maybe you want to turn off flash, you could assign a value using this command.

Each time you start up your system make sure you set up another video stream output (other than your laptop camera). dev/video1 is used as the video stream output. Usually video0 will refer to your laptop camera. To set this up you run:

sudo modprobe v4l2loopback exclusive_caps=1 max_buffers=2

This line is in the shell script provided along with this project but has been commented out, you can uncomment it if you want to.

To make sure that your terminal doesn’t say something along the lines of

“Could not claim the USB device”.

You have to kill a gphoto2 process. I’ve written the UNIX commands for that in the shell script. The process is called “gphoto2 – spawner”, the shell script finds the ID for the process and then kills it. Also if you want to get the video stream you can read up more on [13]. In this case, I used the Unix command shown below. (Also present in the shell script).

gphoto2 --set-config flashmode=0 --stdout --capture-movie | ffmpeg -i pipe:0 -vcodec rawvideo -pix_fmt yuv420p -threads 0 -f v4l2 /dev/video1

Note: I am outputting the video stream to /dev/video1 and flashmode=0 (I turned off flash). There are more settings you can play around with. Just read up the ffmpeg and gphoto2 documentations.

Using Serial Communication:

Serial communication is an easy to use communication protocol, so for this project I stuck with that. However, if I ever decide to set up the detached rig for a display then I’d use some form of wireless connection instead. In the shell script provided, the command shown below is used. This is to enable the serial port for read and writing purposes.

sudo chmod a+rw /dev/ttyACM0

I set the Baud Rate to 38400. A higher value is chosen so that data is sent faster between the face tracking module and the Arduino. A lot of the code was inspired from [14], whereby the code you send from the python app is encoded in the form of "X__Y__Z". The values in the blank spaces are set to contain the x and y coordinates of the centre point of the detected face. The Arduino Code then reads the data from the Serial Port. Again this code was obtained from [14].

Setting up the Pan-tilt rig:

I got the Pan-tilt servo rig from Thingiverse. The original STL files were designed by the user fbuenonet. I stuck it on a piece of cardboard box using screws, making sure one side of the box was hollow so that I could place my arduino, breadboard and wires underneath it.

I stacked a bunch of boxes and placed my DSLR Camera on top of it. This was set up next to the Pan-tilt servo rig. I then, taped a Duck Head and a Sheep Head on the Pan-Tilt Servo rig’s surface. Notice that I placed a tape measurer under the DSLR Camera for good measure. Sorry for the pun :)

Had to cut open this duck toy unfortunately.

Viola-Jones Algorithm (Haar Cascades) for Face Detection:

I used a feature based algorithm called the Viola Jones algorithm for face detection. The Viola-Jones algorithm, also known as Haar cascade, is usually used for detection rather than tracking. Algorithms such as Template Matching can be used for Face Tracking after a face had been detected in an earlier frame. For this project, however, I’m using Haar cascades to detect a face each time a new frame comes in.

Image Obtained from [11]

The Haar Cascade Features are shown above. We sum the pixels under the white region and find difference from the sum of pixels under the black region. That is defined as our feature. Each Haar Cascade feature denotes a certain detail on your face. [7] We are using a windows size of 24x24 which we run through around the rows and columns of the image. [2]

Adaboost allows one to select all the relevant features needed for classification. These are selected from a large pool of features. [4] All the features are tested for each image and an appropriate threshold is selected, which would classify the feature as either face or “non-face”. The features selected are the ones which would give a lower error rate. [9] Adaboost creates a series of linear weak classifiers that form a strong classifier when summed together. [11] Each weak classifier is assigned weights. The weights are computed after each classification step depending on the error that is produced. This is done until you reach the minimum detection rate or maximum acceptable false positive rate. In total, 6000 features are evaluated. [11]

Now we don’t want to spend all our time testing all 6000 features at every row and column. So this is where Viola and Jones suggested a Cascade of Classifiers. Whereby the “easier” features are tested in early stages and the “more complex” features are tested if the window passes several stages of classifiers.

For instance, in the first stage, two features are evaluated. Those are the eyes. The classifier checks to see if your eyes are darker than your nose or cheeks. What would happen if you had dark circles under your eyes. Would the program fail to detect your face? And if so, would dabbing some concealer under your eyes be a worthwhile solution to make this algorithm work? I’m kidding obviously, the algorithm is pretty robust. So no concealer will be needed.

The second stage checks for 10 features and so on... [4] In the later stages, the thresholds become stricter. Hence, the term a “cascade of classifiers” is used. There are 38 stages in total. If at any stage, if a feature is marked as “non-face”, all the stages are considered to be a “fail”, and the window moves on to the next pixel. The image below shows one stage of the classifier, where the Haar Cascades are being used to detect certain features.

Image obtained from [7]

Overlapping of rectangles can be seen in the video above. If there are multiple rectangles in one place, one can be sure that there is a face present in that region. Notice how the window’s scale changes in the video as well. In this case the detector is scaled up/down using a user-defined variable called scaleFactor. [4]

The trained classifier is stored in an XML file. There are several XML files which contain trained cascade classifiers to detect faces, profile shots, noses, eyes and the entire body. These can be found in the OpenCV library. If you're more curious about what the XML file exactly entails check out this article.

This algorithm is mainly targeted towards frontal face feature detection. [7] However, the OpenCV library consists of XML features that can detect the side profile of the face. Because normally people don’t move around their heads stiffly facing the camera at all times, I needed to account for instances when a person would shift their face away from the camera, revealing their side profile. So my code tries to find a front-facing face first, and then it looks for a side profile.

Explanation of Python Code:

Note: Most of the python code was inspired from the code written in the video tutorial at [6] and [14]. My only addition was typing in the profile detection part.

Capturing the Video Feed:

Firstly we capture the video feed from our DSLR camera. We use "/dev/video1” to access the video output. The image is then converted to grayscale as it is computationally less intensive to process [11] The image is flipped horizontally as well, the reason for this is explained later.

#Accesses the stream from the DSLR/USB camera
#if you want to use your laptop camera use -> "/dev/video0"
video=cv2.VideoCapture("/dev/video1")
while True:
	ret,frame=video.read() #Stores each incoming frame to frame
	frame=cv2.resize(frame,(0,0),fx=0.5,fy=0.5) #Makes the image size smaller
	frame=cv2.flip(frame,1) #Flips the image
	
	#uncomment line below if you want to size of image	
	#print("Height:{}, Width:{}".format(frame.shape[0],frame.shape[1]))
	gray=cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY) #Converts to grayscale
	cv2.imshow("Frame",frame) #Display frame with detected faces labelled
	#The code below is written so that you can exit the window using "q" 
	ch=cv2.waitKey(1)
	if ch &0x0FF==ord('q'):
		break
video.release()
cv2.destroyAllWindows()

Haar Cascade part:

First an instance of the CascadeClassifier class is created. We load the class with the pretrained classifier’s file path. Now for this project, two Cascade Classifiers are loaded, one is facecascade(for frontal face detection) and profcascade(for the profile detection).

One thing to mention is that the XML file only accounts for the left side of the face, (I know, weird right?). So the frames I’m using to run the algorithm on are flipped, so that the right side of my face is perceived as the left side of my face. (Even weirder, right?)

Using the detectMultiScale() method from the cv::CascadeClassifier class, we can start detecting faces. But firstly we have to define certain parameters and tune them so that you can get the correct detections. In my case, I’m only tuning 3 variables, namely:

scaleFactor: we define this variable to set the factor by which we reduce the window size

minNeighbors: defines how many neighbors each promising region should have before we can classify it as a face or profile [2]

minSize: Minimum rectangle size for a region to be considered as a face or profile.

faces=facecascade.detectMultiScale(gray,scaleFactor=1.5,minNeighbors=6,minSize=(15,15))
profile=profcascade.detectMultiScale(gray,scaleFactor=1.5,minNeighbors=3,minSize=(10,10))

There exists a relationship between scaleFactor, minSize, minNeighbours and accuracy. [9] More false positives are seen if the minNeighbours value is too low. If it is placed too high, it ends up missing some profile shots as well. I would rather have a few false positives than no detections at all.

The code draws the faces’ boundaries onto the frame image and sends the data to the Arduino. One important thing to observe is that for every frame, I send the data only once. If a front facing face is not found then and only then a profile face position is sent. Regardless, of whether both are detected at the same time (it happens!) and drawn on the same frame.

A rectangle is only drawn for a front facing face or profile face when they are above a minimum area threshold. This was set to a certain value that would work even if you were far from the camera or close to it. Also, setting this condition helped somewhat stabilize the detections, making sure only 1 major detection was made at each iteration.

Mapping the values from the Python code to a Servo motor:

The map() function is used in the Arduino code to take in the X-Y pixel coordinates and map it to a servo position.

map(value, fromLow, fromHigh, toLow, toHigh)

So in this case, value would correspond to either the X or Y coordinate you get from the Python code.

Although the image size has a defined height and width, we do not take these values for the fromHigh value, neither do we set the fromLow value to 0. This is because it is unlikely your face will be detected at extreme regions of the image. Most likely, it will remain between a certain range of rows and pixels, no matter how much you move your face around. These values will vary depending on your image size.

Now the toLow and toHigh values have to be heavily tuned depending on the how you have positioned your servo and how much you want it to pan and tilt. In my case, I had to do some trial and error to figure out the values. My servo had 180 degrees rotation capability, but obviously if my animatronic head is going to follow your gaze, it will not be fully rotating 180 degrees.

The values I set for Pan:

fromLow-40

fromHigh-244

toLow-30

toHigh-80

For tilt motion:

fromLow-128

fromHigh-48

toLow-140

toHigh-170

Shown below is the code written for mapping the values. Whereby x and y are the coordinates that we get from our python code. prevX and prevY correspond to the previous iteration’s x and y coordinates. pos and pos2 refer to the servo position for tilt and pan respectively. The max and min functions are used to make sure the values don’t over extend the ranges set for both servos. Finally, we write the positions to the servo using the write () method.

//The function pos() serves to map the values we got from the python code to
// a value that lies between the range of positions we want the servo to sweep through
//These individual values may vary depending on the size of your image, namely-> fromLow, fromHigh
//Or the way you have positioned your servos -> toLow, toHigh

//The tuned values will differ for each servo
void Pos()
{
  if (prevX != x || prevY != y) //Checks to see if the x and y coordinates have changed
  {
    //pos2 receives width and stores for the PAN servo
    pos2 = map(x, 40, 244, 30,80 );
       //pos2 receives height and stores for the Tilt servo
    pos = map(y, 128, 48, 140, 170);

    //Making sure that the values don't go beyond the ranges
    pos2  = max(pos2, 30);
    pos2  = min(pos2, 80);
    
    pos  = max(pos, 140);
    pos = min(pos, 170);

    //sets the current xy positions to prevX and prevY
    prevX=x;
    prevY=y;
    //Writes to Servos
   myservo.write(pos);                     
   myservo2.write(pos2);
  
  }
}

Finally, here is a working demo.The video stream in the lefthand corner is the video stream I get from the DSLR camera after the python code has processed it. Although it isn't very evident from the angle I filmed, the rig is positioned in such a way that the animatronic head follows me in whichever direction I move.