User avatar
subscriber
Puzzle String Formats
When sharing puzzles over the Internet, your first inclination might be to simply take a screenshot or a photograph of the puzzle and share that. However, images can have large filesizes and some websites don't allow you to post them at all. In many cases it is best to use a textual representation (string) of the puzzle which can be easily copied and pasted into different programs. This is especially useful when you want to store puzzles or lists of puzzles to your computer without using up your hard drive space.

An example of this is the common Sudoku notation, where all 81 of the puzzle's givens are simply listed top-down left-right, with empty cells being either a 0 or another character such as a period (.). This example puzzle is thus represented by the string 21..........7.8..9.....4..6..69.....1...8..944....7..............56..7...74....3.

There are more compact formats using various mathematical tricks but this one is popular for its ease of use. The purpose of this thread is to document a few common Calcudoku puzzle formats and raise the topic for discussion - perhaps you have your own ideas. All formats here are intended to represent empty, clean-slate puzzles with only the clues necessary to define the puzzle and no other information. I apologise in advance for how long this post ended up being.

I'll use this 6x6 puzzle for all examples.
Image

In Calcudoku the givens aren't digits: they are cages, which have a shape, an operator, and a total. Given cells are often represented as single-cell cages with either an addition operator or no operator at all. There are two major families of string formats: formats that list cages with their layout, operator and total one by one, and formats that split these into a string for a cage pattern and a separate string containing all the cage operators & totals. In general you want a format to be fairly short in both the average & worst case, to be easy to read/write as a human (for copying puzzles out of books etc), and to be flexible enough to handle all the different Calcudoku variants such as twin puzzles, no-op, custom digit ranges, custom operators, etc... it's not possible to maximise all 3 of these but you can do pretty well if you try.

On this site's puzzle submission page a format is explained in detail. I have no idea if this site uses this format internally, it's clearly designed to maximise human legibility for ease of transcription, the main drawback being that the resulting puzzle definition is very long (consider that the 2025-09-20 15x15 puzzle has 109 cages). There is one cage per line and their attributes are listed one by one and separated with commas. Cage layouts are defined by listing the positions of every cell contained within the cage at two characters per cell. The example puzzle becomes:

30,x,a1b1c1
24,x,d1e1f1
2,-,a2a3
2,/,b2c2
2,/,d2e2
1,-,f2f3
3,-,b3b4
11,+,c3d3d4
2,/,e3e4
12,+,a4a5b5
7,+,c4c5
3,/,f4f5
24,x,d5c6d6
30,x,e5e6f6
1,-,a6b6

This comes out to 152 characters in total. It seems flexible enough for custom operators, you would just write (for example) % for modulus. Puzzle size isn't explicitly encoded but it can be inferred from the data. Custom digit ranges would have to be included in some sort of header or metadata. The commas aren't necessary but I suppose they're there for legibility - removing them saves two characters per cage. A less obvious trick is that the top-left cage cell's position is always already known, so you could get away with only listing the co-ordinates for every other cell in the cage. This saves another two characters per cage, bringing it down to 92 characters.

I found a solver on Github with its own input format, an example can be found here. It's a variant of the previous format, although it introduces a clever idea to save space when defining the cage layouts: first each cage is assigned a key and then the cage layouts are defined all in one, by drawing the grid and writing the key into each cell. This is the second family of formats mentioned earlier. Again our example puzzle:

0=30 x
1=24 x
2=2 -
3=2 /
4=2 /
5=1 -
6=3 -
7=11 +
8=2 /
9=12 +
a=7 +
b=3 /
c=24 x
d=30 x
e=1 -
START
0 0 0 1 1 1
2 3 3 4 4 5
2 6 7 7 8 5
9 6 a 7 8 b
9 9 a c d b
e e c c d d

This is 169 characters long but it can be improved a lot. First note that the "0=", "1=", "2=" assignments are implicit, they'll always be exactly the same in every puzzle, so we can skip those. We also don't need the spaces dividing totals and operators because there's no possibility to confuse them as one is composed of numbers and the other is composed of letters & symbols - in fact we don't even need the newlines. The "START" string is totally unnecessary, a single character is all you need, I'll use a comma. You can remove all the spaces & newlines from the grid definition too...

30x24x2-2/2/1-3-11+2/12+7+3/24x30x1-,00011123344526778596a78b99acdbeeccdd

This is 73 characters long which is an improvement over everything else so far. Each cage definition requires a variable amount of characters to define depending on the size of its operator but most are 2 characters long, with some being 3- or rarely 4- long. The grid definition takes 36 characters, one per cell in the grid, however note if there are more than 52 cages present you will run out of single-character keys and have to use 2-character keys which doubles the length of the grid definition.

Now is a good time to talk about Andrew Stuart's format. His solver on SudokuWiki uses a very similar format to the above, although the cage definition is much longer because it's built to include the current state of the candidates as well. I won't go into that but I will point out something interesting he does with the grid definition: instead of assigning each cage a unique key, he colours the grid in such a way that no two neighbouring cages are the same colour. It is a well-known result that you only need four colours to achieve this, but his site accepts a fifth colour for ease of implementation. Sure, why not. This completely eliminates the need for multi-character keys on puzzles with a lot of cages. The example puzzle looks like this in Andrew's format.

111222233113212243314241334121221122,30x,,,24x,,,2-,2/,,2/,,1-,,3-,11+,,2/,,12+,,7+,,,3/,,,,24x,30x,,1-,,,,,,

I have one more format to describe before I describe my own ideas & go on about compression, and it's the one I personally use as the default in my own solver. This is the format used by Simon Tatham's excellent puzzles page. It's only documented in the source code but like the previous formats it has two main components: the grid layout definition and the cages definition. The format also begins with a header which is simply the grid size followed by a colon.

The grid layout looks at every dividing line in the puzzle, first the vertical lines going top-down left-right, then the horizontal lines going left-right top-down. Here's a diagram of the order in case that's confusing:

Image

If the dividing line is solid, add an underscore "_" to the grid layout string. If the dividing line is not solid, count how many non-solid lines there are in a row, up to the first solid line. If there is only one you add "a" to the string, if there are two you add "b", 3 is "c", up to 26 = "z", above that you add "z" to the string and go back to counting from zero. Don't add an underscore after each letter - you can skip the first solid line after the run of non-solid lines. Also just in case the very last dividing line is non-solid, pretend there's a solid line immediately after everything else in the puzzle. Our example puzzle gives this grid layout definition:

bbaa__a______a___aaaaa__a____a__aa_aaaa_

This string is then RLE-encoded. If a character is present N times in a row, and N>2, then the run is replaced by "characterN". For example "aaa" becomes "a3".

bbaa__a_6a_3a5__a_4a__aa_a4_

This is appended to the header, then a comma is added to divide it from the cage definition. The cage definition is just the operator then the total for each cage, top-down left-right.

6:bbaa__a_6a_3a5__a_4a__aa_a4_,m30m24s2d2d2s1s3a11d2a12a7d3m24m30s1

This is 67 characters long, not amazingly shorter than the 73-character string from earlier, but thanks to the RLE step the grid layout definition varies in length: its worst-case length is 2*width*(width-1)+1, the average case is somewhere around 80% of that, but in symmetric grids with a repeating pattern the definition can be extremely short... for example Friday's 10x10 gets the string "10:_90a45,s1a10[...]".

As far as I can tell this is the end of the road where human-readable formats are concerned. The shorter the format, the more annoying the required processing becomes. Tatham's format is the best middle ground in my opinion. It's completely URL-safe and almost entirely alphanumeric and you can convert puzzles manually with some practice.

Unreadable formats

What if we ignore the legibility requirement and go all-in on string length? This is best achieved by representing the puzzles as a binary string and then converting it into URL-safe base64 (or base 256 or whatever is most convenient for storage). In this section all we care about is having some short string that a computer can read and write, which we can use to store 2 billion puzzles in a text file or something. There are four major components we need to optimise:

Header (puzzle info)
Cage layout
Cage operators
Cage totals

Cage layout:
The four-colouring from earlier uses 2 bits per cell, or 3 if you use Andrew Stuart's shortcut of allowing 5-colouring, but really I see no reason to allow this. This allows us to represent any width-N grid with 2N^2 bits, 72 for the 6x6 case. We can do better.
Tatham's idea of reading the gridlines themselves nets us a shorter string: 2N(N-1) bits, 60 for the 6x6 case, which would be 10 characters in base64, compared to 28 from Tatham's RLE format earlier.

Perhaps it could be possible to encode the shape of the cages themselves, seeing as there are so few possible shapes when the cages are small. A 2-cell cage only has two possible orientations, which can be represented with a single bit. The issue is that to avoid dividing characters (which are expensive), the exact number of bits per cage needs to be calculated beforehand. My first idea was to store the count of cages & the size of the largest cage in the header, work out how many bits are needed to enumerate all possible cage shapes of this size and below, then assign each shape an ID and our cage layout definition will be a list of all those IDs.

For a quick estimate of the results from this solution: the grid size (N) is presumably already known, so the maximum cage size is N^2, requiring log2(N^2) bits in the header - 6 for 6x6. Looking at the OEIS table for fixed polyomino counts I noticed something irksome: the counts for maxsize=2 and 3 are both 1 above a power of 2. Most of the puzzles in my data sets don't contain any single-cell cages, and a very large amount have their max cage size at 2 or 3, so I decided to add a bit to the header that tells the program to ignore the 1-cell cage if it's set. The example puzzle now uses 7 header bits + (15*3)=45 bits for the layout, 52 total. That's a tad better than Tatham's format but unfortunately the results are only this good if all the cages are similar sizes. Consider a puzzle with one 6-cell cage and 30 single-cell cages. In this case the layout string would be 31*9=279 bits long which is horrific. We clearly can't use this on every puzzle, perhaps you could add a bit to the header that tells the computer whether to import the puzzle using Tatham's format or using this over-complicated format, that would still save 7 bits on the example puzzle... but that's a bandaid solution. It would be nice to be able to deduce the amount of bits we need per cage, but it would cost too many bits to store the information of how many cells each cage has, so it may not be solvable.

Cage operators:
This is an easy one. There are 4 classical operators: addition, subtraction, multiplication, division. Each can be represented with 2 bits. If you know these are all you need, then the size of this component is 2*cagecount. If you're dabbling in custom operators then you could add a few mode bits to the header and expand the representation to 3 or 4 bits per cage.

Cage totals:
This one is more of a problem. Cage totals can be very large numbers, particularly so if there are large multiplication cages, especially so if there are large exponentiation cages. Naively storing these in as many bits are required to store the largest number on the puzzle would waste a ton of bits if there's even a single large multiplication cage present. I have two ideas, one is more practical and the other is the nuclear option. The first is to use the enumeration trick again - we know the operator, size and shape of each cage already, and there are only so many potential totals for each set of these. A two-cell multiplication cage in a 6x6 puzzle can have 13 unique totals so you only need 4 bits to cover all of them.

The example puzzle has 1 2-cell addition cage, 4 2-cell substraction cages, 4 2-cell division cages, 2 3-cell addition cages in an L shape, 2 3-cell multiplication cages in an L shape, and 2 3-cell multiplication cages in a straight line. These cases have 9, 5, 5, 14, 35, 16 possible totals respectively. The resulting string is therefore 4 + 4*3 + 4*3 + 2*4 + 2*6 + 2*5 = 58 bits long. The final string would be 8 + 45 + 30 + 58 = 141 bits, 24 characters in base64. Much more compact than the 67 character string from earlier.

What was the nuclear option for cage totals I mentioned? Well, if we already know the cage layouts & operators, then the solution grid uniquely defines the resulting cage totals. There are 812,851,200 solutions to the 6x6 Latin square; all you have to do is order these and assign each one a 30-bit ID. This works great and saves us 28 bits (for a 113-bit/19-char string) but it doesn't scale well at all - 9x9 is already pretty much impossible to enumerate quickly.

As of this thread's posting there is no way to import an arbitrary puzzle to Calcudoku.org and play it on the site. In fact I don't know any websites or programs that allow the import of puzzles in any format other than their own, private formats. So this thread may be a tad premature, but still I hope it provides a good enough overview of the topic to be useful as a resource for would-be programmers. My solver can accept & convert between most of these formats but it's not ready for the public yet.

Cheers
▄▀
▀▀▀
subscriber
Re: Puzzle String Formats
I am also interested to know if there is a standard ascii text file format for puzzles. This would be very helpful when discussing problems in the forum. Often there are references to a problem referring to a specific date, but it is not possible to refer to old problems such as "The difficult killer sudoku on 29 July 2022".

My own solver has the ability to save problems in its own private format. The format is not concise, but it is easy to read and understand. Here is an example for the difficult 8x8 puzzle on Dec 9, 2025.

The first line shows the size of the puzzle and the origin if different from 1.
In a 7x7 puzzle with the range -3 to +3 the line would read '7 -3'.
In the example below, the next lines show the vertical bars that form the edges of the cages.
In this case, an 8x8 puzzle, there are 8 rows and each row has 7 potential vertical separators.
A value of one indicates that the line is a boundary between cages and a value of zero indicates that the two squares
on the left and right belong to the same cage.
Similarly, he next 7 lines show the horizontal cage boundaries between cells. Each line has 8 characters.
A 1 indicates that the two cells above and below are in different cages and a 0 indicates that they are in the same cage.
Finally the next line shows the amount for each of the cages. Each is a number followed by the operator.
The operators are:
+ addition
- subtraction
* multiplication
/ division
~ mystery (can be any of +-*/)
# mystery (can be either + or -)
^ exponentiation
% modulo
| bitwise and


8
1011101
0111111
1111111
1111101
1111101
1011111
1010110
0101010
01100110
11011001
00100110
11011111
00100100
11111011
01111111
7| 7| 5| 14| 7+ 7| 15| 7| 5+ 40* 10+ 7| 5+ 9| 6| 15* 5- 9+ 12| 2- 13+ 14* 4/ 6* 13| 7| 7| 3- 28* 40*

There is no reason why the whole file couldn't be put into one or two lines.
Like this:
1011101011111111111111111101111110110111111010110010101001100110110110010010011011011111001001001111101101111111
7| 7| 5| 14| 7+ 7| 15| 7| 5+ 40* 10+ 7| 5+ 9| 6| 15* 5- 9+ 12| 2- 13+ 14* 4/ 6* 13| 7| 7| 3- 28* 40*

When reading in a puzzle, the solver figures out the location of cages from the bars.
The puzzle editor presents the grid, initially with every square being a single cage. Cages are created by dragging the
mouse over the edges, which toggles the edge status between off and on.
Post Reply