Introducing CodeBrew: Our quest for the mythical AI tool that turns designs into code

Here at PSL engineering, we've been exploring how to use AI/ML to build next-generation developer tools. Today, we're excited to share CodeBrew, our exploration of how ML can be used to quickly move teams like ours from visual designs to running code. If you don’t feel like reading up on how we built this (and learn about our process to quickly validate ideas like this) just check out the video because, to be honest, it’s a pretty cool demo.

At PSL, we work side-by-side with some talented designers. And, as we turn ideas into companies, we have a pretty solid toolset:

A big part of my day-to-day work is taking these designs and converting them into code. And like most developers, I’m lazy and want to automate as much of my job as possible. However, I know that this automatic design-to-code landscape is littered with countless failed companies over the years. Why did they fail? Two main reasons.

  1. The code it produced was bad and you should feel bad using it

The most common problem with one-click design-to-code tools is that they produce awful code that might technically work but would not be maintainable over time. Here’s a real example from a well-known tool (MS Word). Have some text that you want to be bold? Great, just right click, save as HTML and…well… 

to be fair, this isn't much worse than most web code from 2002

Will this produce a web page with bold text? Sure! Will looking at your code in the future make you want to cry tears of deep shame? Yes, yes, it will. And besides being a developer badge of dishonor, there are real business costs to incurring tech debt. Everything takes longer to ship, everything is buggier and harder to test, recruiting/retaining quality engineers is harder, etc.. 

  1. The code it produced was fine! But I’m a hipster and it didn’t fit my bespoke choices of language/framework/test suite/etc.. 

OK, fine. Cherry-picking worst-case examples from tools built 20 years ago is not very fair. As part of this exploration I tried out a bunch of modern design-to-code generation tools and some of them worked quite well, such as this Figma to Code plugin and this popular open-source plugin from builder.io. The output was relatively clean-looking modern HTML. 

Which is fine, except I don’t write HTML, I use javascript as my front-end language of choice. Which again is fine, because there are tools that will output javascript. 

But…I don’t actually use vanilla javascript, I use Typescript. With React. And again there are tools that output React AND Typescript. 

 Except…yeah, is that inline CSS for styling? Did I mention I’m a hipster who prefers using the Tailwind utility class framework? And what about my tests? And all of my quirky and draconian formatting and linting rules? And…do I see tabs instead of spaces in the output?!?

Source: https://me.me/i/19258175

The reality is, unless the code generated matches the languages, frameworks, and tools that I use, it’s probably going to take more time to update the generated code than it’s worth. And with all the various combinations that each developer uses, you quickly get an exponential number of code patterns that need to be supported (and added to) over time. It’s just a really hard problem to solve. 

So why did I, a single, mediocre-at-best developer with a time-boxed limit of two weeks, try to tackle one of the most well-trodden and intractable problems out there? The reason is an innocuous-looking little code editor plugin I’ve been using called Copilot by Github and OpenAI

A little backstory you can skip if you are familiar with GPT-3

Back in 2020, OpenAI released the first version of GPT-3, a new deep learning language generator. At the time, our team did a bit of exploration and it felt like giving a 12-year-old a Ferrari. It looked really cool, and you just knew that in the future it was going to be awesome, but there just wasn’t much that could be done with it today. Then about six months ago, GitHub partnered with OpenAI to create Copilot, a simple auto-complete plugin for Visual Studio Code, the IDE of choice for the PSL engineering team. Copilot takes the basic models used for GPT-3, only instead of training it on text, it’s trained on billions of lines of code. And unlike text, code follows more concrete patterns so it does a much better job of not going completely off the rails when trying to provide suggestions. 

In fact, the more I used Copilot, the more I started to love it. I found myself adjusting my coding approach to coax better suggestions from Copilot (Is the computer teaching me or am I teaching the computer, hmmm). And it was fun! Throughout my day I’d get a little moment of joy when Copilot would provide a particularly difficult completion. I’d shake my head and say out loud “Oh Copilot, you’re right, that’s exactly what I wanted!” more times than I’d care to admit. It wasn’t perfect, but it probably made me about 10% more efficient, which is a pretty big win given the amount of coding I do on a daily basis.

Given my (and my coworkers) positive experience with Copilot, I started to think that maybe this would be the key to creating my mythical design-to-code tool. Copilot is built on top of an OpenAI API called Codex, which I had access to as part of a private beta. Since I had a limited amount of time, this seemed like a good candidate to use for my design-to-code exploration. 

Enough with the talking, let’s see some code

Now that I had the “magic code creator” box checked, I needed to come up with a reasonably scoped prototype. My plan was to build a tool that would not try to code up the entire site in one attempt. Instead it would use a pattern that many front-end dev teams use, which is to build out a component library of the various buttons, drop downs, headers, etc.. and use those as building blocks to fit into each page. Ideally I could click on each of those items right from the design, hit a few buttons, and get the code I needed for each component. 

Luckily our design team uses Figma, which has great support for building 3rd-party plugins. After a few days of learning the system, I built a simple and very ugly plugin that would let you click on an element on the page and pull out the information I needed such as the colors, font size, rounded corners, etc… Then I just needed to send that information to the Codex API and I should be able to get back code. Right? Well, not quite. 

First of all, just having some basic design details wasn’t enough to get started. We needed a way to give more of a hint of what type of code needed to be created. It turned out a simple solution was to have the user provide the name of the component. This is something that needed to be done anyway, and having a properly descriptive name helped guide the system to create a button vs a dropdown box or something entirely different. 

The system can only complete text from an initial prompt, but luckily it’s common in many programming languages to put a comment before a block of code. So we can pass in a descriptive comment and hopefully the system will generate some usable code. After a few tweaks we ended up with something like this:

The gray text was created by me, the AI wrote the rest

It’s…pretty good! Overall it’s creating relatively clean working React code that has the right colors and styles. It even does some neat things like correctly identifying that a “Fill” in the design tool translates to “backgroundColor”.  But it’s not there yet. The code I actually want is written in TypeScript instead of plain Javascript. This is using inline CSS instead of Tailwind utility CSS classes, and this button has hardcoded “Click Me!” instead of the ability to pass in the correct label. Finally…this button doesn’t actually do anything! We need to be able to send it a function that will execute when the button is clicked.

Time to make my hipster dreams come true

Now here’s where things start to get interesting. It turns out you can just…ask Codex to do all that stuff in plain English and it…just does it! It takes a bit of trial and error to figure out the right series of steps to ask, but in general it turns out that a pretty straightforward set of instructions gets you a very good result. 

This is the part that I created...

The very good result:

...and this is what the AI came up with as a result

So…🤯.  Overall this is almost exactly the code that I would have written myself. The system correctly added all the right types. It came up with very sensible names. And it even did some pretty wild conversions to come up with (mostly) correct Tailwind classes. It knew that “4px” matched with the “rounded” class and even came up with a subtle “hover” effect based on a slightly darker shade of blue! It wasn’t perfect - I’d move those props into a separate interface and we’re missing the font name, and the colors aren’t quite right. But overall as a first attempt? It’s very promising.  Pretty wild stuff.

The mythical personalized design-to-code dream gets one step closer

Now let’s say that instead of React, the team was using a different front-end framework. Will it work with something completely different like Angular? Let’s change that first sentence a bit and see what we get:

The result:

As if Angular developers didn't already have enough to worry about...

As you can see, just changing a few words in one instruction was enough to get an entirely different output. Is it perfect? Probably not. Do I have any idea if it works since I don’t write Angular code? Nope! But again it sure looks promising. 🤓

So now that we had a working prototype, I started using it on a new design and quickly hit some stumbling blocks. First, the results were pretty inconsistent. Sometimes it would be great, other times it would spit out nonsensical tailwind classes or other randomly bad results. And second, often the result would be almost right but not quite what I wanted. 

Sometimes it’s better to be lucky than to be good

Fortunately for me, right as I was hitting this wall, OpenAI released a brand new API endpoint called “edit”. Now, instead of just providing one very long comment at the top and crossing my fingers, I could instead send over a block of code and some instructions on how I wanted to modify it. The output could then be used as the input for the next step, and the system could now iteratively improve as needed. And once the code is returned to the user, they can provide simple written instructions to get the code from pretty good to exactly right.

For example, here the system generated code that was mostly right, but has a bug where the className is inside the “styles” block. 

So how to fix it? Just write: “Move the className out of the style block” and the bug is gone. 🤯 🧙

I honestly didn't expect this to work, pretty wild stuff

The next part of our development process includes writing tests for the component and also creating Storybook stories for visual testing. We could use the same approach to automate the creation of these tests. We just take the generated code, ask Codex to give us some tests, and it just works. Same with the Storybook stories. And honestly, this is where a lot of the time savings comes into play. These aren’t hard things to do, they just take time, and just automating this part can easily get that 10% efficiency improvement that I’m looking for.

Cool tech, bro. But is it a company?

So after about two weeks of experimentation, we’ve got a neat demo and something that looks pretty promising. But is it a company? Let’s start by highlighting some drawbacks to this approach. 

The first is a big one. The real value here is in the OpenAI API and those Codex endpoints are expensive to use! It’s currently free in private beta, but if it were live each call would cost about $.25 on average and it takes about 20 calls to create one component. This could quickly get expensive if used to build out a large number of screens and components. However, Microsoft has invested a billion dollars to partner with OpenAI and recently launched the Azure OpenAI Service in a closed beta. So it could be worth a chat with the OpenAI team to see if maybe there are some startup credits or discounts available for Microsoft’s Azure cloud platform that could be applied here. But it’s a dangerous game to build a venture-scale company with a critical component of your value being tied so directly to another company’s platform. 

The second is that the codex library will quite often insert subtle bugs or omit code that is needed. And that’s OK, because remember that the goal is not to replace developers but just help them be 10-20% more efficient. Experienced developers can usually spot the issue pretty quickly but newer developers or non-coders would have a much harder time. And sometimes it’s just plain wrong. It should get better over time, but while it does make for a neat demo, it just doesn’t feel like it works consistently right now. 

So what are we planning to do next? Our team can see today that applying this technology is comparable to hiring a novice engineer. Currently, it needs a lot of coaching for mediocre results, and it's a bit expensive. We can't put them in charge of the business yet, but we're investing in this junior technology because we can see the potential for rapid coding that enables us to stay ahead in the future. This can be a great tool to make us more productive at PSL and we can see a path forward where this becomes even more powerful and "could" be a company in the future. We’ll keep experimenting and using this tool internally and we’ll be sure to share more details in a future post.