Have you ever wondered if generative AI can be used to create fun new Magic the Gathering cards? I wondered that, and I wrote some code to do it. The code is on Github and I’m going to talk about how it works and what I learned.
This is a post about a card game, but it’s also a post about prompt engineering and the use of programmatic scaffolding around an LLM, and I hope you may find it interesting even if you’ve never played Magic.
Let’s Just Ask ChatGPT
First of all, what happens if you just ask GPT4 to create a Magic card?
Me: I’d like help designing a new Magic the Gathering card for a custom set I’m designing that’s X-Files themed. Can you make a Magic card to represent Agent Mulder? Please use the MTGJSON format.
ChatGPT: Sure I can help with blah, blah, blah:
{
"name": "Agent Mulder, Truth Seeker",
"manaCost": "{3}{U}{B}",
"type": "Legendary Creature — Human Investigator",
"text": "Whenever Agent Mulder, Truth Seeker enters the battlefield or uncovers a conspiracy, draw a card. \\n {2}{U}, {T}: Look at the top card of your library. If it's a nonland card, you may reveal it and put it into your hand. \\n {2}{B}, {T}: Exile the top card of an opponent's library. You may play that card for as long as it remains exiled, and you may spend mana as though it were mana of any color to cast it.",
"power": "3",
"toughness": "4",
"rarity": "Mythic Rare",
"set": "XFLS",
"artist": "Your Artist Name",
"flavorText": "\\"The truth is out there, and I will find it.\\""
}
We can use the Magic Set Editor to render this into a card:
I used Midjourney to create the image, from the prompt “Agent Fox Mulder from the X Files, investigating the paranormal, depicted as a fantasy character in the style of Rebecca Guay, Magic the Gathering card art”. You can see it got a little confused and drew a fox in the background, but overall it’s good art. I asked it to draw the art in the style of Rebecca Guay, my favorite MTG artist.
Considering the actual mechanics of the card, some problems immediately jump out at me:
- It uses the phrase “uncovers a conspiracy”. While conspiracies are a card type in MTG, they cannot be “uncovered”, so this rules text is meaningless
- It has two abilities which are very similar to each other. I would advise ChatGPT to get rid of one of these abilities to simplify the card
- It’s quite powerful. Arguably it’s not overpowered for a mythic rare, but there’s a trend of ChatGPT creating extremely strong cards
Art
Before talking more about the text, I should say that the art for these cards is generated with a script that access Midjourney. The basic approach is to ask the LLM to fill out a MadLibs style prompt:
“[Card Name], from [Source], [describe the scene in one sentence or less], [list several adjectives], in the style of [style], [visual medium], by [artist name] and [historical artist]”
But before filling that out, I ask it to brainstorm options for each of those details. It gets in the Card Name along with the flavor and mechanics of the card, and brainstorms options for all the rest. It’s able to do both brainstorming and MadLibs in the same prompt response, and just a little bit of parsing is required to get out the final response. I find that this brainstorming significantly improves the resulting Midjourney prompt, because it allows the LLM to consider several options, and it doesn’t always go with the first option it puts in the list.
For example, here is some card art that was generated from an LLM-sourced prompt.
I also ask it to tell me who the artist is, by filling this out:
Artist Credit: [artist name], [second artist, if more than one]
It’s difficult to extract the name of the artist from the previous answer, because I don’t have an exhaustive list of artists.
We take that artist credit and use it as the artist credit at the bottom of the card, using the template “Insp[ired]. by [Artist Name]”, like this:
I know there’s currently a lot of debate about how to credit or compensate artists whose work is used in the training set for generative AI. I like this credit line as a compromise. I’m very grateful to be able to see art in the style of famous artists, and I feel like this project has expanded my appreciation of art.
I include some diversity prompting in the prompt that I send to the LLM, where I say “(Some of the other cards are in the style of: {other_artist_names})”, where the other artist names are randomly selected from a list of artists. In this case, “diversity” refers to diversity of artists. By suggesting other artists to think about, the LLM is encouraged to think about similar artists. If I don’t include this, it tends to really like some of the same artists, like Alphonse Mucha and John Howe, I’m not sure why.
Improving on 1-Shot Generation
We can do better than just asking ChatGPT to make a whole card directly. We can do some prompt engineering that’s similar to what we did with the art.
Brainstorm Multiple Options and Review
The next thing I tried was to generate the same card multiple times, and then send those options back to the LLM with a prompt asking it to choose the best option. This works okay, but it turns out that for Magic cards, we can do better, and we can brainstorm specific details for the card, and remix those together.
The first step is to ask the LLM to brainstorm 15 possible mechanics for the card. I tried just sending that list back to the LLM, asking it to pick a few of the best options that go well together. But that just prompted the LLM to write really text-heavy cards that have way too many mechanics on them.
So instead, I ask the LLM to rate each mechanic on a complexity scale and a power scale. I tell it to rate each mechanic on a 1-5 scale for both of those attributes, again with a MadLibs template, like this:
1. Text of the mechanic. Similar to [Card]. Complexity X. Flavor X. Synergy X.
I forget why I asked it to suggest a similar card, but it seems to improve results, I speculate that it helps anchor the model.
So the program ends up with a list of mechanics, each of which are rated for complexity and power level. I thought I could just send those mechanics to the LLM and ask it to create a well-balanced set, but it turns out LLMs are bad at math. I tried telling it to give me a set of mechanics whose power levels add up to 10, and it’s unable to do that task. It will give a list of mechanics with power levels like 5, 4, 5, 3, and then blatantly lie, claiming those numbers add up to 10.
So I had to write a script that generates sets of mechanics that are “well balanced” in the sense that they are approximately the right amount of power level and complexity.
That turns out to work well. I can take a few good sets of mechanics, and have the LLM choose from among them, like this. This ends up being a good amount of guidance for the LLM; it’s able to consistently generate cards with a good power and complexity level, and by seeing lots of mechanics, it’s able to choose some of the best ones.
For example, the system is generating a card called “Journey of Restoration”. This is the idea that it’s working from, and a description of the color archetype that I want this card to fit into:
Journey of Restoration. Sorcery. Rare. Blue/White. This card represents Elysia and Theron's quest to repair the Heart of Argentum. It's known for its ability to restore and rejuvenate, reflecting the siblings' hope to restore balance to their world.
**White/Blue (Artifact Control)** - Draft cards that control the board and make good use of artifacts. Cards with Improvise and Scry are particularly good here.
Using that guidance, the system proposes this list of potential mechanics:
- Gain 5 life. Similar to Healing Salve. Complexity 1. Flavor 4. Synergy 2.
- Scry 2. Similar to Preordain. Complexity 1. Flavor 3. Synergy 4.
- Return target artifact from your graveyard to your hand. Similar to Argivian Restoration. Complexity 2. Flavor 5. Synergy 5.
- Draw a card. Similar to Opt. Complexity 1. Flavor 2. Synergy 3.
- Target creature gains protection from the color of your choice until end of turn. Similar to Gods Willing. Complexity 2. Flavor 3. Synergy 2.
- Return up to two target artifacts or enchantments from your graveyard to your hand. Similar to Dowsing Shaman. Complexity 3. Flavor 5. Synergy 5.
- Create a token that's a copy of target artifact you control. Similar to Saheeli's Artistry. Complexity 4. Flavor 4. Synergy 4.
- For each artifact you control, create a token that's a copy of that artifact. Similar to Mechanized Production. Complexity 5. Flavor 4. Synergy 5.
- You may cast artifact spells from your graveyard this turn. If an artifact card put into your graveyard this turn would be put into your graveyard, exile it instead. Similar to Emry, Lurker of the Loch. Complexity 5. Flavor 4. Synergy 5.
- Exile all artifacts and enchantments. For each card exiled this way, its controller creates a 4/4 white Angel artifact creature token with flying. Similar to Cataclysmic Gearhulk. Complexity 5. Flavor 4. Synergy 3.
...
That list undergoes a process to filter it down into a smaller set of mechanics that are about the right power level and complexity for a 4 mana cost rare:
The power level and complexity of this card are acceptable, and it’s mechanically playable. But, like many cards that this process produces, the mechanics are a bit scattered and without synergy. In this case, the “scry 2” part of the card is completely disconnected from the rest of the card.
Self Criticism
OK the brainstorming process works well, but it still generates some cards that are simply broken, or that don’t make sense at all.
So once we have a card idea, we send it back to the LLM and ask it to criticize the card.
First, I ask it some questions to address fundamental problems with the card. These questions are taken directly from the LLM’s prompt:
- Is it missing any important details like mana cost, or power and toughness if it’s a creature? If so, write “Needs work: Missing details” and write what’s missing
- Does the text on the card make sense mechanically? Is it possible to do what the card says? If not, write “Needs work: Mechanical Issues”
- Is it written in the right style for MTG? Would a judge have trouble interpreting it? If it’s not written in the style of Oracle text, write “Needs work: Wrong Style”
Those questions identify serious problems with the card, and they tell the LLM to write an easily parsable response. But I also want to make sure the card falls within acceptable boundaries for qualities that are a bit harder to determine: power level and complexity.
For this, I again ask it to rate cards on a 1-5 scale. I find that this is a good technique to push it to produce output that’s easy to parse. I give it specific guidance, like “Rate the power level of the card on a scale from 0-6, where 0 is unplayable, 1 is very weak, like a bad common, 5 is very strong, like a bomb rare, and 6 is super overpowered like a black lotus.” Giving it a scale increases consistency among its responses, and also pushes it away from a pattern of responding 5 to most 1-5 scales, presumably because that’s the most common rating in online data.
The prompt then asks the LLM to compare its numerical scores against desired ranges, for example commons should not be higher than a 3 on the power scale. I find that asking it to give a number and then compare against a scale yields more consistent results than simply asking it to provide a binary classification of whether the card is too strong or not.
I give it some advice that explicitly addresses failure modes that I’ve observed it having. For example, if the card is a land, I say “Lands need to have a downside, like entering the battlefield tapped”, because otherwise it produces really overpowered lands.
I’ve noticed that stories from mythology often include places that make sense as lands, and the mechanics of Magic set design don’t allow for enough legendary lands to match the story.
In the self criticism step, I alternate between a critic LLM and a fixer LLM, which tries to fix the criticism. This seems to work well; it allows the critic to be harsh, and the fixer to do its best.
This step is done repeatedly until the critic LLM can no longer find any problems with the card. Unfortunately, that doesn’t mean that no flaws remain, just that ChatGPT is no longer able to see them.
In the future, I think this step could be improved by adding more specific and actionable advice about what makes good cards, like the guidance from this page.
Generating Flavor Text Separately
The flavor text for a Magic card is just the non-rules text that appears in italics in the card’s text box.
ChatGPT includes flavor text if you just ask it to write the full card JSON, but I find it kind of uninspired. So I do a separate creative writing exercise with the LLM to workshop potential phrases.
Here’s some of the advice I give it:
First, I want you to brainstorm some ideas for the flavor text. Choose 3 of these writing prompts, and write out about {lines_guide} of text for each one:
* A quote attributed to character in the story about the card, like: "A quotation here" -- [Character Name]
* Poetry relevant to the card
* Something humorous
* Reference to another part of the story
* A single line describing a pivotal moment in the story
* A shocking reveal from the story
* A quote from the character that illustrates their personality, like "A quotation here" -- [Character Name]
* A paragraph that narrates a surprising action like betrayal or sacrifice
Then I ask it to choose its favorite option from among the options it generates.
This works pretty well, and I think it improves the quality. The only downside seems to be that it pushes it to produce shorter flavor text, because it thinks list items should be one-liners. This is fine though, because it usually makes cards with too many lines of rules text anyway.
Generating Whole Sets
Generating individual cards works well, but I want to be able to generate entire sets of Magic cards that will play well together. My dream is to be able to run a program with the prompt “Magic the Gathering set based on Polynesian mythology”, or whatever, and have it create a whole set that I can draft and play with.
In order to do that, I have the program go through several steps:
- It writes a story for the set, which will inform the content that it includes
- Then it creates a list of mechanics for the set. Again, I have it brainstorm a lot of potential options, and then narrow the list down. It also includes guidance for each of Magic’s 5 colors, to hopefully make the set more draftable
- It then generates a bunch of possibilities for cards, where each card is just a one sentence idea. This is actually a sticking point for larger sets, because if I ask it to generate 100 or more suggestions, it perseverates and starts to suggest the same thing repeatedly
- The previous step actually overgenerates card ideas by a factor of 3. I take the list of cards it generates, and extract information about the color and rarity of each card idea, and use an algorithm to create a balanced MTG set that includes an equal number of each of the 5 colors (plus some colorless and multicolored cards). Without this step, the LLM creates way too many rare cards. I tried getting the LLM to perform this algorithm, but again it turns out that LLMs are embarrassingly bad at math. For example, I would give it a list of 20 cards and ask it to return a list of 10 cards, 2 of each color. It would include like 4 white cards and 0 blue cards, and then at the end it would say “as you can see, I have included a balanced set of 2 of each color”. In fairness to the LLM, it’s genuinely hard to create a list which is simultaneously balanced by both color and rarity.
- Finally, each of those cards are converted into a fully specified JSON object using the processes I described above, and the Magic Set Editor is used to turn them into full cards
With all these steps together, the program is able to create a pretty coherent set.
Next Steps
There are ways that this project could be improved in the future.
Better Example Cards
One of the first things I did was try showing the LLM random cards as an example. My thinking was that this would help demonstrate the correct data format for its response, as well as do diversity prompting, to push it to think about a similar card.
It turns out you can download all MTG cards as JSON files from here, and from there it’s easy enough to pick random ones.
This didn’t really work though. It turned out to be better to just tell it the JSON format explicitly. I also realized that it doesn’t really copy the example card very much, GPT4 has more imagination than that.
I think a better approach would be to find existing cards which are most mechanically similar to the card being generated, and present them as comparisons for the LLM. I think this will help keep new cards in line with the power level of existing cards.
If You’d Like to Help…
You can run the code and generate cards of your own. You’ll need to set your OpenAI key and Midjourney credentials. If you’d like to help improve the contributions, I welcome merge requests on Github.