State of the State of the Art in ML and AI — Q4 2019

The goal of this post is to provide a rundown of the immediately usable and new developments in machine learning and artificial intelligence.

At Pioneer Square Labs, we are often starting companies that use recently-developed technology. The idea here is to balance several principles that make new developments usable:

  • A result too far into the “research” arena will be too slow to develop, have too much technical risk, and is often somewhat uncertain even in the literature. Therefore we will look for code with usable implementations, that are new but can be leveraged on other data sets.
  • A result that is well established and has been around for many years will already be implemented and is unlikely to drive innovation by itself. Therefore we will look for the latest new concepts and implementations possible. Ideally these will have just been demonstrated in the last few months.
  • A result that only works on the one data set it is trained on is unlikely to make a good product without significant effort. Therefore we will attempt to estimate the breadth of a new technique and comment on how robust it seems to be.

Before we get into the specific developments, this might be the most entertaining video you’ll see on reinforcement learning for a long while, and at around 3 minutes it’s well worth it. This showcases an important point, and one that will come up again - the emergent and unexpected behaviors that can come from training modern algorithms. 

Reinforcement learning in action.

The first half of 2019 has been an amazing time in the data science world. Let’s dig into some of three newest developments:

Voice generation, Voice Coding, Text-to-Speech, etc

A lot of voice generation work is focused on two tasks: voice recognition (via phonemes, full-word recognition and other techniques), and speech generation. So far, efforts to combine these in real time (directly turn one voice into another via style transfer) haven’t been very successful on arbitrary voices, though some results have been seen with limited sets of voices. But the work on generation has gotten very good indeed. You may recall recent developments like Samuel L Jackson’s voice being a new usable add-on to Alexa, and this practice isn’t strictly new, actually the process has been a work in progress for quite some time. Generally the old way took a huge amount of real audio in high quality, and sometimes even pre-recorded phrases rather than arbitrary generation. The latest stuff is getting pretty good though - A very recent approach lets you “clone” anyone’s voice based on just seconds of arbitrary audio, and then use an existing text-to-speech generator to use that voice to say whatever you would like. The discerning reader may ask “is someone already spinning out a company to take advantage?” And i do know of one - Lyrebird, which says they can do just that, then give you a speech tool with your own (or presumably, someone else’s) voice.

The chief difficulty in this field (as I see it) is that no one has made a usable “voice changer” that operates to turn one person’s real voice to another in real time - so far they all have the issue of the slightly “unreal” cadence that the TTS generator provides. Some approaches have been made using cyclic GAN structures on Mel-spectrogrammed voices, but results aren’t ready for real use yet without significant further research. Stay tuned!

NLP developments

By now almost everyone has heard of GPT-2, the text generator that was “too dangerous to release as a trained model.” Well, OpenAI has been more-or less releasing it anyway, in stages. Just now, GPT-2 “Extra Large” is out - significantly more complex and capable than all of the models before. This is their “full model” of which we have previously seen only the results. But really, this new explosion of text generation is due to a novel structure in the deep learning world - the "Transformer". You can read much more about it here, but in brief, the transformer structure fixed some of the issues with the older ways of doing text translation - namely, it allows “remembering” longer-term relationships between sentences (i.e. words and phrases that may be very far apart) and it has some advantages in processing power, so it can be trained on very large corpuses of text data and produce very large and complex models. So that said, there are several good text generation tools that are “ready to use”, and right here you can find a bunch of fun working examples, including the biggest available GPT-2.

The upshot: A lot of good text generation is possible, but still requires significant human interaction to be usable. I have been tinkering with a method I called “Reductive Iterative Human-In-the-Loop” (RIHIL), which is really just a fancy way of saying “let a human add lots of text using a generator but then only keep the stuff that seems relevant in context and rewrite/override continuously.” Check out this example I made with GPT-2 by writing a prompt and hammering on the tab key until I found things that sounded approximately right. GPT-2 text is highlighted:

There are really only two types of scientists. One is a scientific scholar and the other is an ideologist. The former has to be a scientist in his or her field. The ideologists have to be experts at understanding and applying mathematics and science. The science scholar can only do this by virtue of being fluent in an area of mathematics and science; the ideologists have to be experts at understanding and applying mathematics and science.

Sounds almost-right, doesn’t it? It could be the start of a blog post at least, and I created this in about 60 seconds. Food for thought.

Image/Video analysis techniques

By now, you’re probably familiar with the GAN models used to make faces, and how the latest stuff can produce images that are photorealistic and malleable. So what’s next in the GAN world? Well, some recent research has used a similar trick, only to produce 3D face models. These models then require some rendering scheme/color mapping to make them look correct, but one big step forward has been made here - this approach also leverages a database of facial expressions to allow coupling of desired expression with the 3D model.

Looks pretty good, but even though the code is proclaimed to be out there, I am unclear there is an accessible working implementation.

Here’s one more that uses the Variational Autoencoder technique, a bit like a GAN, but with some additional niceties, such as a built-in encoder stage to take existing faces and put them quickly into the latent space. While GANs have been producing the best face-generation results lately, the VAE is also a contender for “top performance.” So stay tuned - there are a lot of “competitors” coming out of the research space, and many of them look pretty promising. 

A network reconstructing faces using the Variational Autoencoder technique.

Speaking of GANs, a lot of people have tried out dynamic image manipulation. The idea here is to take an existing image and tweak parts of it to maintain realism but to vary the specifics in an area - basically AI Repainting. Conceptually you would be able to say, remove a car and have the algorithm “replace those parts of the image” it guesses would be there without that car. Or say, expand someone’s actual lawn in a picture of their house to show them what additional landscaping could look like. There have been some more developments in these areas. To be honest, they have some of the same limitations of style-transfer techniques - they are pretty good at changing small-scale structures and textures, and not so good with larger or more complex structures. Take a look at the demo of this one and try:

  • Painting “grass” along the ground (usually decent results)
  • Adding a new door somewhere (very marginal results)

So, this may not be quite ready for use yet, but it points the way toward a real “semantic approach” to automatic image generation.

You may well ask: if this technique can do local textures and small structures, and has a little trouble adding large structure by itself, what if we made a version that took user input for all the large structure? Well, someone has. Even better, there’s a fun and usable demo you can try out right here. GauGAN is a lot of fun to play around with:

This last thing in this category is a VR advancement. Really, is too new and if it has an implementation, it’s likely Facebook will keep that for its own use. However, this is too important not to bring up, as it takes a big leap toward the general acceptance of VR meetings in lieu of in-person interaction. Facebook seems to have solved the human-expression problem. Remember Snow Crash? If not, it’s a worthy read. If so, you may remember the prophetic vision that VR-based-alternate-reality will require high-fidelity human face representation, so that we will “believe the interaction” on a visceral level. I’ve been waiting for someone to pair in-visor face tracking with generation, and it looks very near now.

Reinforcement learning/automation techniques

If you’ve made it this far, here’s a video of a tiny t-rex learning to play soccer. No, seriously though, there are a lot of frameworks for reinforcement learning recently. In particular, we are approaching a milestone I have been calling “skill abstraction,” in which a robot can be trained to do smaller-scale movements, and then more complex movements can be constructed from a second algorithm learning to put the small movements together. It is my belief that all sufficiently complex autonomous systems will eventually do this, training for subtasks and then “abstracting” those behaviors to make more complex control schemes. Here’s one example of that - this system combines pre-trained motions to attempt to mimic arbitrary motion-captured video frames. nteresting all around!

And then...sometimes it’s just used to be cruel to imaginary animals. Don’t show this one to skynet.

The big takeaway from this in late 2019 is really the proliferation of systems to allow quick model training. this used to be pretty tough - the frameworks were complex and specific, required simulations that were hard to adapt to the training stage and so on. But many platforms coming out now allow much faster, much more flexible training. It may be worth considering what tasks we would like to automate, and what areas may have just become feasible.