PERCEPTUAL COMPUTING: Depth Data Techniques

Downloads

PERCEPTUAL COMPUTING: Depth Data Techniques [PDF 479KB]

1. Introduction

Many developers have had an opportunity to explore the power of perceptual computing thanks to the Intel® Perceptual Computing SDK, and have created applications that track faces, gestures, and voice commands. Some developers go beyond the confines of the SDK and use the raw camera depth data to create amazing new techniques. One such developer (me) would like to share some of these techniques and provide a springboard for creating innovative solutions of your own.

Figure 1: 16-bitDepth Data as seen from a Gesture Camera

This article provides beginners with a basic knowledge of how to extract the raw depth data and interpret it to produce application control systems. It also suggests a number of more advanced concepts for you to continue your research.

I’m assuming that readers are familiar with theCreative* Interactive Gesture camera and the Intel® Perceptual SDK. Although the code samples are given in C++, the concepts explained are applicable to Unity* and C# developers as well.

2. Why Is This Important

To understand why depth data is important, you must consider the fact that the high level gesture functions in the SDK are derived from this low level raw data. Features such as finger detection and gesture control begin when the SDK reads the depth data to produce its interpretations. Should the SDK run out of features before you run out of ideas, you need the ability to supplement the existing features with functions of your own.

To that end, having a basic knowledge of app control through the use of depth data will be an invaluable skill to build on.

Figure 2: Unsmoothed raw depth data includes many IR artefacts

As shown in Figure 2, the raw depth data that comes from the camera can be a soup of confusing values when unfiltered, but it also contains interesting metadata for those coders who want to go beyond conventional wisdom. For example, notice how the blobs in the upper right corner of the image contain bright and dark pixels, which might suggest very spiky camera facing objects. In fact, these are glass picture frames scattering the IR signal and creating unreliable readings for the purpose of depth determination. Knowing that such a randomly spiky object could not exist in real depth, your code could determine the ‘material type’ of the object based on these artifacts.

This is just one example of how the raw depth data can be mined for ‘as yet’ unexplored techniques, and helps to show how much more functionality we can obtain from outside of the SDK framework.

3. Depth Data Techniques

Given this relatively new field, there is currently no standard set of techniques that apply directly to controlling an application using depth data. You will find a scattering of white papers on data analysis that may apply and fragments of code available from the early Perceptual pioneers, but nothing you could point to as a definitive tome.

Therefore, the following techniques should be viewed as unorthodox attempts at describing possible ways to obtain specific information from the depth data and should not be viewed as definitive solutions. It is hoped that these ideas will spark your own efforts, customized to the requirements of your particular project.

Below is a summary of the techniques we will be looking at in detail:
(a) Basic depth rendering
(b) Filtering the data yourself
(c) Edge detection
(d) Body mass tracker

Basic Depth Rendering

The simplest technique to start with is also the most essential, which is to read the depth data and represent it visually on the screen. The importance of reading the data is a given, but it’s also vital to present a visual of what you are reading so that subsequent technique coding can be debugged and optimized.

The easiest way to start is to run the binary example from the Intel Perceptual Computing SDK:

\PCSDK\bin\win32\depth_smoothing.exe

Figure 3: Screenshot of the Depth Smoothing example from the SDK

\PCSDK\sample\depth_smoothing\ depth_smoothing.sln

The project that accompanies this example is quite revealing in that there is very little code to distract and confuse you. It’s so small in fact that we can include it in this article:

int wmain(int argc, WCHAR* argv[]) {
    UtilRender   rraw(L"Raw Depth Stream");
    UtilPipeline praw;
    praw.QueryCapture()->SetFilter(PXCCapture::Device::PROPERTY_DEPTH_SMOOTHING,false);
    praw.EnableImage(PXCImage::COLOR_FORMAT_DEPTH);
    if (!praw.Init()) {
        wprintf_s(L"Failed to initialize the pipeline with a depth stream inputn");
        return 3;
    }

    UtilRender rflt(L"Filtered Depth Stream");
    UtilPipeline pflt;
    pflt.QueryCapture()->SetFilter(PXCCapture::Device::PROPERTY_DEPTH_SMOOTHING,true);
    pflt.EnableImage(PXCImage::COLOR_FORMAT_DEPTH);
    pflt.Init();

    for (bool br=true,bf=true;br || bf;Sleep(5)) {
        if (br) if (praw.AcquireFrame(!bf)) {
            if (!rraw.RenderFrame(praw.QueryImage(PXCImage::IMAGE_TYPE_DEPTH))) br=false;
            praw.ReleaseFrame();
        }
        if (bf) if (pflt.AcquireFrame(!br))
        {
            PXCImage* depthimage = pflt.QueryImage(PXCImage::IMAGE_TYPE_DEPTH);
            if (!rflt.RenderFrame(depthimage)) bf=false;
            pflt.ReleaseFrame();
        }
    }
    return 0;
}

Thanks to the SDK, what could have been many pages of complicated device acquisition and GUI rendering code has been reduced to a few lines. The SDK documentation does an excellent job of explaining each line and there is no need to repeat it here, except to note that in this example the depth data is being given directly to the renderer with no intermediate layer where the data can be read or manipulated. A better example to understand how to do this is this one:

\PCSDK\bin\win32\camera_uvmap.exe
\PCSDK\sample\camera_uvmap\ camera_uvmap.sln

Familiarizing yourself with these two examples, with the help of the SDK documentation, will give you a working knowledge of initializing and syncing with the camera device, reading and releasing the depth data image, and understanding the different channels of data available to you.

Filtering the Data Yourself

Not to be confused with depth smoothing performed by the SDK/driver, filtering in this sense would be removing depth layers your application is not interested in. As an example, imagine you are sitting at your desk in a busy office with colleagues walking back and forth in the background. You do not wish your application to respond to these intrusions, so you need a way to block them out. Alternatively, you may only want to focus on the middle depth, excluding any objects in the foreground such as desktop microphones and stray hand movements.

The technique involves only a single pass, reading the smoothed depth data and writing out to a new depth data image for eventual rendering or analysis. First, let’s look at code taken directly from the CAMERA_UVMAP example:

int cwidth2=dcolor.pitches[0]/sizeof(pxcU32); // aligned color width
for (int y=0;y<(int)240;y++) 
{
 for (int x=0;x<(int)320;x++) 
 {
  int xx=(int)(uvmap[(y*dwidth2+x)*2+0]*pcolor.imageInfo.width+0.5f);
  int yy=(int)(uvmap[(y*dwidth2+x)*2+1]*pcolor.imageInfo.height+0.5f);
  if (xx>=0 && xx<(int)pcolor.imageInfo.width)
   if (yy>=0 && yy<(int)pcolor.imageInfo.height)
    ((pxcU32 *)dcolor.planes[0])[yy*cwidth2+xx]=0xFFFFFFFF;
 }
}

As you can see, we have an image ‘dcolor’ for the picture image coming from the camera. Consider that the depth region is only 320x240 compared to the camera picture of 640x480, so the UVMAP reference array translates depth data coordinates to camera picture data coordinates.

The key element to note here is the nested loop that will iterate through every pixel in the 320x240 region and perform a few lines of code. As you can see, there is no depth data reading in the above code, only camera picture image writing via dcolor.planes[0]. Running the above code would produce a final visual render that looks something like this:

Figure 4: Each white dot in this picture denotes a mapped depth data coordinate

Modifying the example slightly, we can read the depth value at each pixel and decide whether we want to render out the respective camera picture pixel. The problem of course is that for every depth value that has a corresponding camera picture pixel, many more picture pixels are unrepresented. This means we would still see lots of unaffected picture pixels for the purpose of our demonstration. To resolve this, you might suppose we could reverse the nested loop logic to traverse the 640x480 camera picture image and obtain depth values at the respective coordinate.

Alas, there is no inverse UVMAP reference provided by the current SDK/driver, and so we are left to concoct a little fudge. In the code below, the 640x480 region of the camera picture is traversed, but the depth value coordinate is arrived at by creating an artificial UVMAP array that contains the inverse of the original UV references, so instead of depth data coordinates converted to camera picture image references, we have picture coordinates converted to depth data coordinates.

Naturally, there will be gaps in the data, but we can fill those by copying the depth coordinates from a neighbor. Here is some code that creates the reverse UVMAP reference data. It’s not a perfect reference set, but sufficient to demonstrate how we can manipulate the raw data to our own ends:

// 1/2 : fill picture UVMAP with known depth coordinates
if ( g_biguvmap==NULL )
{
 g_biguvmap = new int[640*481*2];
}
memset( g_biguvmap, 0, sizeof(int)*640*481*2 );
for (int y=0;y<240;y++) 
{
 for (int x=0;x<320;x++) 
 {
  int dx=(int)(uvmap[(y*320+x)*2+0]*pcolor.imageInfo.width+0.5f);
  int dy=(int)(uvmap[(y*320+x)*2+1]*pcolor.imageInfo.height+0.5f);
  g_biguvmap[((dy*640+dx)*2)+0] = x;
  g_biguvmap[((dy*640+dx)*2)+1] = y;
 }
}

// 2/2 : populate gaps in picture UVMAP horizontal and verticle
int storex=0, storey=0, storecount=5;
for (int y=0;y<480;y++) 
{
 for (int x=0;x<640;x++) 
 {
  int depthx = g_biguvmap[((y*640+x)*2)+0];
  int depthy = g_biguvmap[((y*640+x)*2)+1];
  if ( depthx!=0 || depthy!=0 )
  {
   storex = depthx;
   storey = depthy;
   storecount = 5;
  }
  else
  {
   if ( storecount > 0 )
   {
    g_biguvmap[((y*640+x)*2)+0] = storex;
    g_biguvmap[((y*640+x)*2)+1] = storey;
    storecount--;
   }
  }
 }
}
for (int x=0;x<640;x++) 
{
 for (int y=0;y<480;y++) 
 {
  int depthx = g_biguvmap[((y*640+x)*2)+0];
  int depthy = g_biguvmap[((y*640+x)*2)+1];
  if ( depthx!=0 || depthy!=0 )
  {
   storex = depthx;
   storey = depthy;
   storecount = 5;
  }
  else
  {
   if ( storecount > 0 )
   {
    g_biguvmap[((y*640+x)*2)+0] = storex;
    g_biguvmap[((y*640+x)*2)+1] = storey;
    storecount--;
   }
  }
 }
}

We can now modify the example to use this new reverse UVMAP reference data, and then limit which pixels are written to using the depth value as the qualifier:

// manipulate picture image
for (int y=0;y<(int)480;y++) 
{
 for (int x=0;x<(int)640;x++) 
 {
  int dx=g_biguvmap[(y*640+x)*2+0];
  int dy=g_biguvmap[(y*640+x)*2+1];
  pxcU16 depthvalue = ((pxcU16*)ddepth.planes[0])[dy*320+dx];
  if ( depthvalue>65535/5 ) 
   ((pxcU32 *)dcolor.planes[0])[y*640+x]=depthvalue;
 }
}

When the modified example is run, we can see that the more distant pixels are colored while the depth values that do not meet the condition we added are left unaffected, allowing the original camera picture image to show through.

Figure 5: All pixels outside the depth data threshold are colored green

As it stands, this technique could act as a crude green screen chromo key effect, or separate the color data for further analysis. Either way, it demonstrates how a few extra lines of code can pull out specific information.

Edge Detection

Thanks to many decades of research into 2D graphics, there is a plethora of papers on edge detection of static images, used widely in art packages and image processing tools. Edge detection for the purposes of perceptual computing requires that the technique is performance friendly and can be executed with a real-time image stream. An edge detection algorithm that takes 2 seconds and produces perfect contours can be of no use in a real-time application. You need a system that can find an edge within a single pass and feed the required data directly to your application.

There are numerous types of edges your application may want to detect, from locating where the top of the head is all the way through to defining the outline of any shape in front of the camera. Here is a simple method to determine the location of the head in real-time. The technique makes use of edge detection to determine the extent of certain features represented in the depth data.

Figure 6: Depth data with colored dots to show steps in head tracking

In the illustration above, the depth data has been marked with a number of colored dots. These dots will help explain the techniques used to detect the position of the head at any point in time. More advanced tracking techniques can be employed for more accurate solutions, or if you require multiple head tracking. In this case, our objective is fast zero-history, real-time head detection.

Our technique begins by identifying the depth value closest to the camera, which in the above case will be the nose. A ‘peace gesture’ has been added to the shot to illustrate the need to employ the previous technique of clipping any foreground objects that would interfere with the head tracking code. Once the nearest point has been found, the nose, we mark this coordinate for the next step.

Scanning out from the nose position, we march through the depth data in a left to right direction until we detect a sharp change in the depth value. This lets us know that the edge of the object we are traversing has ended. At this stage, be aware that IR interference could make this edge quite erratic, but the good news for this technique is that any sharp increase in depth value means we’re either very near or on the edge of interest. We then record these coordinates, indicated as green dots in Figure 6 and proceed to the third step.

Using the direction vector between the red and green dots, we can project out half the distance between the center of the object and the edge to arrive at the two blue dots as marked. From this coordinate, we can scan downwards until a new edge is detected. The definition of a head, for the purposes of this technique, is that it must sit on shoulders. If the depth value drops suddenly indicating a near object (i.e., a shoulder), we record the coordinate and confirm one side of the scan. When both left and right shoulders have been located, the technique reports that the head has been found. Some conditions are placed to ensure other types of objects will not result in head detection such as the shoulders being too far down the image. The technique is also dependent on the user not getting too close to the camera where the shoulders might lie outside of the depth camera view.

Once the existence of the head has been determined, we can average the positions of the two green dot markers to arrive at the center position of the head. Optionally, you can traverse the depth data upwards to find the top of the head, marked as a purple dot. The downside to this approach, however, is that hair does not reflect IR very well and produces wild artifacts in the depth data. A better approach to arriving at the vertical coordinate for the head center is to take the average Y coordinate of the two sets of blue dot markers.

The code to traverse depth data as described above is relatively straight forward and very performance friendly. You can either read from the depth data directly or copy the contents to a prepared filtered buffer in a format of your choice.

// traverse from center to left edge
int iX=160, iY=120, found=0;
while ( found==0 )
{
  iX--;
  pxcU16 depthvalue = ((pxcU16*)ddepth.planes[0])[iY*320+iX];
  if ( depthvalue>65535/5 ) found=1;
  if ( iX<1 ) found=-1;
}

Naturally, your starting position would be at a coordinate detected as near the camera, but the code shown above would return the left edge depth coordinate of the object of interest in the variable iX. Similar loops would provide the other marker coordinates and from those the averaged head center position can be calculated.

The above technique is a very performance friendly approach, but sacrifices accuracy and lacks an initial object validation step. For example, the technique can be fooled into thinking a hand is a head if positioned directly in front of the camera. These discoveries will become commonplace when developing depth data techniques, and resolving them will improve your final algorithm to the point where it will become real-world capable.

Body mass tracker

One final technique is included to demonstrate true out-of-the-box thinking when it comes to digesting depth data and excreting interesting information. It is also a very simple and elegant technique too.

By using the depth value as a weight against cumulatively adding together the coordinates of each depth pixel, you can arrive at a single coordinate that indicates generally at which side of the camera the user is located. That is, when the user leans to the left, your application can detect this and provide a suitable coordinate to track them. When they lean to the right, the application will continue to follow them. When the user bows forward, this too can be tracked. Given that the sample taken is absolute, individual details like hand movements, background objects, and other distractions are absorbed into a ‘whole view average.’

The code is divided into two simple steps. The first will average all the value depth pixel coordinates to produce a single coordinate, and the second will draw the dot to the camera picture image render so we can see if the technique works. When run, you will see the dot center itself around the activity of the depth data.

// find body mass center
int iAvX = 0;
int iAvY = 0;
int iAvCount = 0;
for (int y=0;y<(int)480;y++) 
{
 for (int x=0;x<(int)640;x++) 
 {
  int dx=g_biguvmap[(y*640+x)*2+0];
  int dy=g_biguvmap[(y*640+x)*2+1];
  pxcU16 depthvalue = ((pxcU16*)ddepth.planes[0])[dy*320+dx];
  if ( depthvalue<65535/5 ) 
  {
   iAvX = iAvX + x;
   iAvY = iAvY + y;
   iAvCount++;
  }
 }
}
iAvX = iAvX / iAvCount;
iAvY = iAvY / iAvCount;

// draw body mass dot
for ( int drx=-8; drx<=8; drx++ )
 for ( int dry=-8; dry<=8; dry++ )
  ((pxcU32*)dcolor.planes[0])[(iAvY+dry)*640+(iAvX+drx)]=0xFFFFFFFF;

In Figure 7 below, notice the white dot has been rendered to represent the body mass coordinate. As the user leans right, the dot respects the general distribution by smoothly floating right, when he leans left, the dot smoothly floats to the left, all in real-time.

Figure 7:The white dot represents the average position of all relevant depth pixels

These are just some of the techniques you can try yourself, as a starting point to greater things. The basic principle is straightforward and the code relatively simple. The real challenge is creating concepts that go beyond what we’ve seen here and re-imagine uses for this data.

You are encouraged to experiment with the two examples mentioned and insert the above techniques into your own projects to see how easy it can be to interpret the data in new ways. It is also recommended that you regularly apply your findings to real-world application control so you stay grounded in what works and what does not.

4. Tricks and Tips

Do’s

When you are creating a new technique from depth data to control your application in a specific way, perform as much user testing as you can. Place your family and friends in front of your application and see if your application responds as expected, as well as changing your environment such as moving the chair, rotating your Ultrabook™ device, switching off the light, and playing your app at 6AM.
Be aware that image resolutions from a camera device can vary, and the depth resolution can be different from the color resolution. For example, the depth data size used in the above techniques is 320x240 and the color is640x480. The techniques used these regions to keep the code simplified. In real-world scenarios, the SDK can detect numerous camera types with different resolutions for each stream. Always detect these dimensions and feed them directly to your techniques.

Don’ts

Until depth data resolution matches and exceeds camera color resolution, the reverse UVMAP reference technique noted above cannot be relied on to produce 100% accurate depth readings. With this in mind, avoid applications that require a perfect mapping between color and depth streams.
Avoid multi-pass algorithms whenever possible and use solutions that traverse the depth data in a single nested loop. Even though 320x240 may not seem a significant resolution to traverse, it only takes a few math divisions within your loop code to impact your final application frame rate. If a technique requires multiple passes, check to see if you can achieve the same result in a single pass or store the results from the previous pass to use in the next cycle.
Do not assume the field of view (FOV) for the color camera is the same as the depth camera. A common mistake is to assume the FOV angles are identical and simply divide the color image coordinate by two to get the depth coordinate. This will not work and result in disparity between your color and depth reference points.
Avoid streaming full color, depth, and audio from the Creative Gesture Camera at full frame rate when possible, as this consumes a huge amount of bandwidth that impacts overall application performance. An example is detecting gestures while detecting voice control commands and rendering the camera stream to the screen at the same time. You may find voice recognition fails in this scenario. If possible, when voice recognition is required, deactivate one of the camera image streams to recover bandwidth.
Do not assume the depth data values are accurate. By their very nature they rely on IR signal bounces to estimate depth, and some material surfaces and environmental agents can affect this reading. Your techniques should account for an element of variance in the data returned.

5. Advanced Concepts

The techniques discussed here are simplified forms of the code you will ultimately implement in your own applications. They provide a general overview and starting point for handling depth data.

There are a significant number of advanced techniques that can be applied to the same data, some of which are suggested below:
(a) Produce an IK skeleton from head and upper arms
(b) Sculpt more accurate point data from a constant depth data stream
(c) Gaze and eye detection using a combination of depth and color data
(d) Detect the mood of the user by associating body, hand, and head movements
(e) Predict which button Bob is going to press before he actually presses it
(f) Count the number of times Sue picks up her tea cup and takes a sip

It is apparent that we have really just scratched the surface of what is possible with the availability of depth data as a permanent feature of computing. As such, the techniques shown here merely hint at the possibilities, and as this sector of the industry matures, we will see some amazing feats of engineering materialize from the very same raw depth data we have right now.

Since the mouse and pointer were commercialized, we’ve not seen such a fundamentally new input medium as Perceptual Computing. Touch input had been around for two decades before it gained widespread popularity, and it was software that finally made touch technology shine. It is reasonable to suppose that we need a robust, predictable, fast, and intuitive software layer to compliment the present Perceptual Computing hardware. It is also reasonable to expect that these innovations will occur in the field, created by coders who are not just solving a specific application issue but are, in fact, contributing to the arsenal of weaponry Perceptual Computing can eventually draw from. The idea that your PC or Ultrabook can predict what you want before you press or touch anything is the stuff of fiction now, but in a few years’ time it may not only be possible but commonplace in our increasingly connected lives.

About The Author

When not writing articles, Lee Bamber is the CEO of The Game Creators (http://www.thegamecreators.com), a British company that specializes in the development and distribution of game creation tools. Established in 1999, the company and surrounding community of game makers are responsible for many popular brands including Dark Basic, FPS Creator, and most recently App Game Kit (AGK).

The application that inspired this article and the blog that tracked its seven week development can be found here: http://ultimatecoderchallenge.blogspot.co.uk/2013/02/lee-going-perceptual-part-one.html

Lee also chronicles his daily life as a coder, complete with screen shots and the occasional video here: http://fpscreloaded.blogspot.co.uk

Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of third-party vendors and their devices. For optimization information, see software.Intel.com/en-us/articles/optimization-notice/. All products, dates, and plans are based on current expectations and subject to change without notice. Intel, the Intel logo, and Ultrabook are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2013. Intel Corporation. All rights reserved.

ultimate coder challenge