A preview of the MNIST Autoencoder lens.
2021 Apr. 6
Custom Tensorflow Models in Snapchat Lenses

test test

In September of 2020, I spent some time scripting augmented reality “lenses” for Snapchat using their Lens Studio software. I had seen an announcement that they allowed developers to import their own Tensorflow/Torch models for tasks like style transfer, image segmentation, and binary classification. I noticed that all of the official examples used image inputs. I decided to explore the potential for building models with non-image inputs. The short answer: it is possible, but I had to trick the software into thinking the inputs were images.

I ended up building a lens that produces the effect shown above. First, I trained an autoencoder on the classic [MNIST hand-written digits dataset](https://en.wikipedia.org/wiki/MNIST_database, then saved its exported the decoder part of the model in ONNX format. Then, I used hand tracking to interpolate through the model’s latent space and watch as digits morphed into each other, as shown above.

Pros and Cons of developing for Snapchat lenses

Pro
- Easy to test and deploy for webcam or mobile device
- Easy to share with others, though they must be Snapchat users
- The separate SnapCamera app lets you use a Snapchat lens as a virtual webcam, so you can integrate with Zoom or Google Meet.
Cons
- No Linux distribution for Lens Studio
- Scripting API isn’t open source
- Documentation and community support are less than stellar

Scripting in Lens Studio

Lens Studio enables developers to write custom scripts using a scripting format built atop Javascript. I’m going to cover some of the big ideas here, but here is the official documentation.

User input parameters:

Scripts are implemented through the UI, and attached to camera layers as effects. We can set up parameters to set in the UI with a special syntax inside a JS comment. For example, here I set up some inputs for the hand tracking part of the lens:

  // @input Component.ScreenTransform screenTransform
  // @input Component.ObjectTracking hand

Importing other scripts

I wanted to organize my code into multiple scripts, but Snapchat didn’t make this very easy. Imports are handled partially within the UI @input syntax described above, so we have to specify the right script files in a point and click panel. We import a script as a Component.ScriptComponent object. Then, we access an api object within that ScriptComponent, which gives us access to functions and other objects named within that helper script. This looks something like the following:

// @input Component.ScriptComponent lib

api = script.lib.api
getXY = api.getXY;

So lib ends up inside a global object called script, and getXY is a function I defined in lib.js.

World/Local/Screen space

My Snapchat lens tracks my hand location within a rectangle at the center of the screen, and that location becomes an input for the MNIST decoder. Here are the basic steps involved in that process:

Import an ObjectTracking object from the scripting API. This object gives us a location in X,Y pixel space on the screen, based on a built-in method like hand tracking or face tracking.
Turn this screen space location into a world coordinates with transform.getWorldPosition()
Turn those into new screen space coordinates with screenTransform.worldPointToLocalPoint.

  script.api.getXY = function(hand, screenTransform){
          var transform = hand.getTransform();
          var position = transform.getWorldPosition();
          var screenPosition = screenTransform.worldPointToLocalPoint(position);

          return screenPosition
  }

  script.api.getXY = function(hand, screenTransform){
          var transform = hand.getTransform();
          var position = transform.getWorldPosition();
          var screenPosition = screenTransform.worldPointToLocalPoint(position);

          return screenPosition
  }

To clarify, this “world position” exists because Lens Studio also lets us create objects in 3D space, which we then view with a camera. This is useful for 3D effects or AR effects that place objects in the real world. Something might be at (0,0) in world space, but it is projected into screen space by a camera pointing at it. That camera might have the object centered or off-center. It might view the object from the front, back, etc. In effect, there is a camera projection necessary for converting the world coordinates to screen coordinates, and that’s what worldPointToLocalPoint is doing.

Why do we have to use worldPointToLocalPoint in the first place? The ObjectTracking object already gives us screen space coordinates. Again, I wanted to confine the model inputs to a small box in the center, so I couldn’t use coordinates from the whole screen. To set up the coordinate system for the box, we import a screenTransform object through the UI. For some reason, this object can only project world points, so we have to convert the objectTracking object’s outputs into world coordinates before we can convert it back to screen coordinates.

Drawing images in Lens Studio

The MNIST autoencoder lens needed to output an image, but Lens Studio doesn’t let us create a new image within a script. Instead, we need to create a placeholder image in the UI, then modify the pixels within the script. To do this, we create a ProceduralTextureProvider in the script and assign it to our image. Here is what some of that looks like:

  // Constants
  var width = 28;
  var height = 28;
  var channels = 4; // RGBA
  var newTex = ProceduralTextureProvider.create(width, height, Colorspace.RGBA);
  var newData = new Uint8Array(width * height * channels);

  //-------Draw decoded digit in box------
  function drawDigit(data, targetTexture, targetImage){

      for (var y=0; y<height; y++) {
          for (var x=0; x<width; x++) {
              // Calculate index
              var index = (y * width + x) * channels;
              var dataIndex = ( (height - y) *width + x);
              // Set R, G, B, A

              var color = Math.min(255,100 + (data[dataIndex] * 255));

              newData[index] =   255;
              newData[index+1] = 255;
              newData[index+2] = 255;

                  if(color < 200){
                      newData[index+3] = 0;
                  }
                  else{
                      newData[index+3] = 255;
                  }
              if (y < 1)  newData[index+3] = 0;
          }
      }
      targetTexture.control.setPixels(0, 0, width, height, newData);
      targetImage.mainPass.baseTex = newTex;
  }

I haven’t gotten into this part yet, but we have a Tensorflow model outputting decoded images and writing them into a Uint8Array called data. The model outputs a grayscale image, and I wanted to increase contrast, so I did a simple threshold operation. If the gray value was above 200 (where 255 is totally white), it becomes 255 and fully opaque, otherwise the pixel becomes transparent. To do this, I define a new Uint8Array called newData, and assign it new values based on what we see in data. Then, we set the pixels of our texture object based on newData, and set the content of our image based on our texture object.

Tensorflow models in Lens Studio

Finally, let’s talk about how defining the Tensorflow model works. Lens Studio expects us to use models that operate on Image tensors as inputs. Also, I will note that they expect these inputs to be in the Torch-style (channel, X, Y) format, where Tensorflow typically uses (X, Y, channel). Moreover, the inputs must conform to this format. So:

inputs must be 3-dimensional
The first dimension is channel: for instance, we would have 4 channels for an RGBA image, or 1 channel for a grayscale image without transparency.
The first dimension must equal 1, 3, or 4, if I’m not mistaken, since these are appropriate numbers of channels for images.

My model just needs two input values, which are the X and Y position of the hand within the bounding box. To make this work with Lens Studio, I gave my model an input shape of (1,2,1), which would be a grayscale image 2 pixels across and 1 pixel tall.

Now comes the part where we build the model object in the script. Let’s see what the code looks like:

  // Machine Learning Model: Input tensor must be named 'x'
  // @input Asset.MLAsset model

  // Build Tensorflow Model
  var mlComponent = script.sceneObject.createComponent('MLComponent');
  mlComponent.model = script.model

  var inputBuilder = MachineLearning.createInputBuilder();
  inputBuilder.setName("x"); // Needs to match the model name from Tensorflow
  inputBuilder.setShape(new vec3(1, 2, 1)); // Required number of dimensions for Lens Studio
  var inputPlaceholder = inputBuilder.build();

  var outputBuilder = MachineLearning.createOutputBuilder();
  outputBuilder.setName('Identity'); // Needs to match the output name from Tensorflow
  outputBuilder.setShape(new vec3(1, 1, 784));
  outputBuilder.setOutputMode(MachineLearning.OutputMode.Data);


  mlComponent.onLoadingFinished = onLoadingFinished;
  mlComponent.onRunningFinished = onRunningFinished;
  var outputPlaceholder = outputBuilder.build();
  mlComponent.build([inputPlaceholder, outputPlaceholder]);



  //-------MLComponent Callbacks------
  function onLoadingFinished(){

      mlInput = mlComponent.getInput("x");
      inputData = mlInput.data;

      inputData[0] = getXY()['x'];
      inputData[1] = getXY()['y'];


      mlComponent.runScheduled(true,
          MachineLearning.FrameTiming.OnRender,
          MachineLearning.FrameTiming.None);
  }

  function onRunningFinished() {
      //process output
      var outputData = mlComponent.getOutput("Identity");
      data = outputData.data;
      drawDigit(data, newTex, script.image);

      var interpolated = getInterpolatedInput(getXY()['x'], getXY()['y']);
      inputData[0] = interpolated['x'];
      inputData[1] = interpolated['y'];

      setDisplayText(interpolated);

  }

There are a number of preliminary steps to building a machine learning component in Lens Studio, and based on the documentation I’m not exactly sure how they differ. Basically:

We create an mlComponent object and assign our model to it. This is also where we will specify some callback functions, but more on that later.
We create an InputBuilder and an OutputBuilder object. This is where we specify input and output shape as well as the names of the input and output tensors in our imported model. In the OutputBuilder, we also specify the outputMode. For this model, the mode is MachineLearning.OutputMode.Data because we want to output an array we can access in the script.
We assign some callback functions to the mlComponent to fetch and process inputs, run the model, and save the outputs. These are onLoadingFinished and onRunningFinished. The first one is run once, when the model is loaded. The next one is run every time the model completes a forward pass. More on these later.
We call InputBuilder.build() to create an InputPlaceholder object, and similarly we call OutputBuilder.build() to create an OutputPlaceholder .
We call mlComponent.build([inputPlaceholder, outputPlaceholder]) to compile the final object.

MLComponent Callbacks

A bit more on our callback functions:

We have to do a few things in our onLoadingFinished function. We set up the values for an intial run of the model, then set up a paradigm for when / how often the model will run. Here, I call the method mlComponent.runScheduled and specify that I want this model to run every time a frame is rendered, which is after the user camera is captured and the hand tracking is measured.

In the onRunningFinished function, we capture the model outputs and set up the inputs for the next run of the model. Here we have to call mlComponent.getOutput, to get an OutputData object. Then, outputData.data is the actual array of pixel values we want. We use these to draw an image. Then we get the new hand tracking position and change the values of our inputData array. Note here that to call the model repeatedly, we repeatedly modify an input array rather than calling the model on a new array. I’m also running this setDisplayText function, which creates the text display you see in the lens effect.